The Art and Science of Randomization in Oracle
Why Generate Random Values
Databases are supposed to be for maintaining factual data — with purity, integrity and reliability. So why are we interested in generating random data? The answer to this is that data is a representation of realworld facts and figures. While building a system, we try to mimic the reallife data when testing proper indexing, materialized view creation, partitioning, optimization paths, access methodologies, and so on. The trick is to generate such a data set that is random enough to represent the real world, yet follows the same statistical pattern as the target database. Note the key words here — it must be random, but needs to follow the same statistical pattern as real life.
So how about generating some values such as the following:
32nr 32nr3121ne –e21e
323=11r r
0vmdwdwv0[o rr032r2 0
r4i32r –rm32r3p=x ewifef432fr32o3==
The values in this previous code example are all generated by randomly pecking the keys of the keyboard. Random enough, right?
Not quite. Consider the following: What if you are generating a data set to test out a new application by populating bank accounts. In the U.S., account last names vary widely, but first names are usually from a limited set — John, Jane, Michael, and so on. While generating a sample, you may want to make sure that the first names are populated from actual names such as these only, not from some random, arbitrary characters as shown in the previous example. They must be randomly distributed throughout the table in such a way that the optimizer will follow the same path as in a reallife system. Further, you may want it to follow some more statistical patterns (e.g., 25 percent of the first names should be John, eight percent should be Michael, and so on). Creating a data set that closely mimics real life is important in building a test bench to test application and database development.
On the surface this seems like a trivial and/or fruitless exercise; but in reality, it is one of the most important activities of system design. A few examples will perhaps explain why.
Building a Perfect System
Say an application is going to be developed at Acme Bank. The team is discussing the nuts and bolts of the project — everything from tablespace and filelayout to indexing and partitioning schemes. Their biggest problem is that the database they are designing will contain projected data only, and no meaningful data exists with which to test their plan.
Jill, the lead developer, is mulling over the type of indexing — should she use regular tables with btree indexes or Index Organized Tables? How would she know without generating some type of data?
Next, Deborah is wondering whether partitioning will help or hurt the performance. It would make management easier, sure; but query performance? To test her ideas, she needs a table prepopulated with sample data.
Cathy, the system architect, is proposing a few materialized views. But will this performance enhancement justify the stale data?
At the same time, John, the DBA, is thinking about the best partitioning strategy and is conflicted about whether to use range or to use hash.
Finally, Jack, another DBA, is thinking about placing the datafiles for certain tablespaces on fast disks; but which ones?
These are not hypothetical problems, but reallife challenges faced by a team designing a system for optimal performance and value. There has never been a lack of ideas, but a common problem all these people face is the lack of a dataset that represents the database they are going to build. They could build such a dataset from scratch, but they don’t have the means to enter each and every piece of data by hand.
Solution?
The answer to this problem is to generate random values, but not arbitrary ones. But the values must be distributed following some statistical pattern to reflect real life; a mere update will not help in this case. For instance, suppose your requirement calls for distributing first names as follows:
 John — 30 percent
 Abby — 10 percent
 Scott — 5 percent
 Jill — 5 percent Mike — 50 percent
There are 1,000,000 rows in the test table, so you could specify the following to achieve this pattern:
To update 350,000 or 35 percent of 1 million rows:
update accounts set first_name = ‘John’
where rownum < 350001;
To update the next 5 percent of 1 million rows, or 50,000 records, to Abby:
update accounts set first_name = ‘Abby’
where first_name != ‘John’
and rownum < 50000
... and so on. This procedure will satisfy the overall distribution of the values, but the values will be distributed as follows: The first 35 percent of the records will be for the firstname John, which just won’t happen in real life. In a real table, the first name John will be distributed throughout the table, not in the first 35 percent of the rows. This imperfect distribution would impact heavily on operations such as index scans. Since the optimizer decides between fulltable scan and index scan based on on how the data is distributed, its choice would be greatly affected by the unusual, lopsided availability of values at the head end of the table.
Similarly, these values would probably be concentrated in a small portion of the disk or disk groups, and this would skew the results of the I/O test. So, such an unrealistic distribution of values of the first name John is neither helpful nor desirable.
Other Uses
Values must also be randomly generated when you are creating Web site user IDs for registered users, or when you are generating temporary passwords. Although these are jumbled characters and numbers, the data must fit a pattern (e.g., passwords must be between eight and 12 characters and should have at least four numbers and four characters, and so on).
Although it’s not important anymore in Oracle 10g, the encryption feature requires random keys to be generated. And the keys must be random enough to be difficult to guess.
Building a Random Character Value Generator
Oracle Database 10g provides a random value generator via a package named DBMS_RANDOM, which can be used to generate random values. However, those values are arbitrary, not representative of the reallife system. For instance, the following an example of how you can generate a string 30 characters long:
SQL> select dbms_random.string('P',30) from dual;
DBMS_RANDOM.STRING('P',30)

O=*KXh}#O`D~1'^VheWr:jZ>!:P}3
The package DBMS_RANDOM has several functions and procedures, one of which is STRING(), which accepts two parameters and returns a character value. The types of characters generated are determined by the first parameter — OPTION. Here are the options and the types of string generated:
Option  Type of String Produced 
U  Any uppercase alphabetic characters 
L  Any lowercase alphabetic characters 
A  Any mixed case alphabetic characters 
X  Any alphanumeric character in upper case 
P  Any printable character 
You can also use lowercase characters for the options; e.g., x instead of X. In this example, we used X, which produced a string of printable characters, not just alphabetic ones. Clearly, this kind of character string will not represent a customer name. Using an option such as “A” would generate a more appropriate string.
Populating a Table
Let’s start with a reallife example. In the case of Acme Bank’s application development, the team is mulling over the proper indexing, partitioning, physical layout, materialized view planning, and so on, for the ACCOUNTS table, which holds the customers’ account data. Here is how the table looks :
SQL> desc accounts
Name Null? Type
  
ACC_NO NOT NULL NUMBER
FIRST_NAME NOT NULL VARCHAR2(30)
LAST_NAME NOT NULL VARCHAR2(30)
ACC_TYPE NOT NULL VARCHAR2(1)
FOLIO_ID NUMBER
SUB_ACC_TYPE VARCHAR2(30)
ACC_OPEN_DT NOT NULL DATE
ACC_MOD_DT DATE
ACC_MGR_ID NUMBER
The columns are fairly selfexplanatory. To represent customers in the real world, the following requirements have been specified for the columns:
Column Name  Purpose  Data Pattern 
ACC_NO  Account Number  Any number less than 10 digits 
FIRST_NAME  The first name 

LAST_NAME  The last name  Any alphabetic character between four and 30 but 25 percent should be “Smith” 
ACC_TYPE  The type of account — Savings, Checking, and so on  20 percent each of S, C, M, D, and X 
FOLIO_ID  The folio ID from the other systems  Half NULL and the rest half a number related to the account number 
SUB_ACC_TYPE  If the customer is incorporated, then subaccount types, if any 75 percent null  From the values populated:

ACC_OPEN_DT  Date account was opened  A date between now and 500 days ago 
ACC_MGR_ID  The ID of the account manager servicing the account  There are five account managers, with account percentages distributed as follows:

As you can see, some fairly complex requirements were specified, but for good reasons. These accurately reflect how the data will be distributed in real life. In real life, there will be customers with first names like “Josh” and “Ellen,” not “XepqjEuF”; so, the names must be chosen from the set of possible names. And, in the U.S., people are called by a variety of last names. Thus, we want a semirandom distribution with 25 percent of a very popular last name, “Smith.”
Generating Random Numbers
Before we go any further, we have to discuss how to generate random numbers. The package dbms_random contains a function to return random numbers. If you want to return any random number between 2^31 and +2^31, use the function RANDOM. The following shows how we have generated numbers 10 times:
SQL> begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='dbms_random.random);
4 end loop;
5 end;
6 /
Random Number=81420432
Random Number=1024262734
Random Number=1965250926
Random Number=1439118604
Random Number=675429938
Random Number=466227661
Random Number=613708106
Random Number=1154140330
Random Number=643127572
Random Number=133140229
This function is available but deprecated. Oracle recommends using a new function called VALUE().
SQL> l
1 begin
2 for i in1..10 loop
3 dbms_output.put_line('Random Number='dbms_random.value);
4 end loop;
5* end;
SQL> /
Random Number=.547109841457281046373693994862305187
Random Number=.00500720861223234783817674992068380455
Random Number=.86974675431616311272549579082479240362
Random Number=.16306119784529083761710557238498944243
Random Number=.39327329046753189206427695323437381763
Random Number=.68307381430584611139249432690613072007
Random Number=.24640768414299435941101562183729221882
Random Number=.50568180275705934132098716939122047439
Random Number=.73345513921455391594299189253661168712
Random Number=.71764049051903979360796313613342367114
Note the difference; VALUE returns a positive random decimal number less than one, with 38 digits after the decimal point. In most cases, this is probably enough, with some modification (if you are looking for a 10digit whole number, for instance, simply multiply it by 10,000,000,000 and use FLOOR() to discard the decimals).
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='
4 floor (10000000000 * dbms_random.value)
5 );
6 end loop;
7* end;
SQL> /
Random Number=939084911
Random Number=690705371
Random Number=4696700513
Random Number=7978266084
Random Number=5157885833
Random Number=8902042948
Random Number=5839885968
Random Number=6207324613
Random Number=5633096626
Random Number=1891871746
The function VALUE() is overloaded; another variation of the functions accepts a set of low and high values and returns a number between them. To generate a number between 10 and 20, for instance, you will use
dbms_random.value (10, 20)
Note: The numbers generated will be more than or equal to 10, but less than (never equal to) 20.
Using it to generate multiple values, we get
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='
4 dbms_random.value (10,20)
5 );
6 end loop;
7* end;
SQL> /
Random Number=11.0908840421899990054735051046498510644
Random Number=14.5391525077977177619575126573478641174
Random Number=11.1343623028437386191996545034616142284
Random Number=14.8831724931888089801812805693409810454
Random Number=11.7294096895635532492353976105129628219
Random Number=15.0589018451126293465505440533299716691
Random Number=15.6152418028444556585301007183814067306
Random Number=18.2903839241741813303152060555229669855
Random Number=13.0914352695913402037626709927051618367
Random Number=12.6027933168087755621076348461280998683
As you can see, the numbers are generated with 37 digits after the decimal point. This is a very valuable function and will suffice for lmost types of random number generation. If you want to generate only whole numbers, use the same approach used earlier, using the FLOOR() function.
floor(dbms_random.value (10,20))
Generating Specific Characters and Strings
So, how can we generate specific strings that follow a predetermined statistical pattern?
We can borrow a page from the probability theory text to accomplish this. The trick is to use a method similar to MonteCarlo simulation used by statisticians worldwide. In this approach, we generate a random number, between one and 100 (both inclusive). Over a period of time, the probability that a specific number, say six, will turn up is exactly one time out of 100, or 1 percent of the time. In fact, all the numbers have 1/100 probability. Going by the same approach, the probability that either of two numbers — say, 1 and 2 — will be 2 percent. And, of course, the probability that any one of numbers between one and 10 will turn up is 10 percent. We will use this to configure the probability of the random value.
Take, for instance, the value of the column ACC_TYPE, which calls for equal probability of S, C, M, D and X; or 20 percent probability each. If we generate a whole number between one and five (both inclusive), the probability of each number will be 20 percent. Then we can use a DECODE() function to get the ACC_TYPE value.
SQL> select
2 decode (
3 floor (
4 dbms_random.value (1,6)
5 ),
6 1,'S',
7 2,'C',
8 3,'M',
9 4,'D',
10 'X'
11 )
12 from dual;
First, we are generating a number between one and five (line 4). Since the number is generated is less than the highest value passed as a parameter, we have specified six. And since we want a whole number, we have used the FLOOR() function in line 3. It truncates all decimal values from the generated number. Depending on the number obtained, we used DECODE() to get one of the values — S, C, M, D, or X. Since the numbers 1, 2, 3, 4, and 5 will have equal probability of being generated, so will be the letters — at 20 percent each.
This technique is very valuable for generating random, but useful values, as shown previously. The same approach can be used to generate almost all types of predetermined random values.
Random Values with NULLs
Remember, the requirement for FOLIO_ID is a little different. It needs only 50 percent of the values populated; the rest should be NULL. How can we achieve this?
Quite simply, we will use the same probability approach with a twist: we will use a determination of yes or no. Generating a random number between one and 100 will ensure 1 percent probability of each number. Hence, a number less than 51 will have exactly 50 percent probability of occurring. We can use this in a CASE statement to get the value.
SQL> select
2 case
3 when dbms_random.value (1,100) < 51 then null
4 else
5 floor(dbms_random.value(1,100))
6 end
7 from dual;
On line 3, we should check to see whether the number generated is less than 51. If so, we return NULL. Since the probability of a sub51 number is 50 percent , we have NULLs occurring 50 percent of the time as well. In the other 50 percent of the time, we have generated a value to be used as a FOLIO_ID.
Random Strings of Random Length
In dbms_random.string, a random string is generated, but of fixed length. That is not representative of real life; in reality, people have last names of varying lengths. In this example, the requirement is to have a length between four and 30 characters. To facilitate this, we can pass the length as a random number as well to the function dbms_random.string in line 6 below.
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random String='
4 dbms_random.string ( \
5 'A',
6 dbms_random.value(4,30)
7 )
8 );
9 end loop;
10* end;
SQL> /
Random String=RniQZGquFVJYFpGLOvtNd
Random String=GhcphpcsaCXlhigRQY
Random String=JtakoelUf
Random String=BgCOu
Random String=QFBzQxcHqGlHWkZFmnN
Random String=lSxVjqJvpwBB
Random String=jfhNARzALrLOKZRpOwnhrzz
Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo
Random String=BhuZ
Random String=GebcqcgvzBfEpTYnJPmYAQdb
Notice that the strings are of different lengths. Remember, 25 percent of the last names must be “Smith,” and the rest must have random lengths. We can accomplish this by combining the random strings and the MonteCarlo approach:
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
)
The previous expression will return “Smith” 25 percent of the time and a random alphabetic string between four and 30 characters long the rest of the time.
Putting it All Together
Now that you understood the building blocks of the randomization approach, you can put them together to build the account record generation PL/SQL block as shown below. In the following example, we are loading 100,000 records into the table ACCOUNTS. Here is the loading program in full:
begin
for l_acc_no in 1..100000 loop
insert into accounts
values
(
l_acc_no,
 First Name
decode (
floor(dbms_random.value(1,21)),
1, 'Alan',
2, 'Alan',
3, 'Barbara',
4, 'Barbara',
5, 'Charles',
6, 'David',
7, 'Ellen',
8, 'Ellen',
9, 'Ellen',
10, 'Frank',
11, 'Frank',
12, 'Frank',
13, 'George',
14, 'George',
15, 'George',
16, 'Hillary',
17, 'Iris',
18, 'Iris',
19, 'Josh',
20, 'Josh',
'XXX'
),
 Last Name
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
),
 Account Type
decode (
floor(dbms_random.value (1,5)),
1,'S',2,'C',3,'M',4,'D','X'
),
 Folio ID
case
when dbms_random.value (1,100) < 51 then null
else
l_acc_no + floor(dbms_random.value(1,100))
end,
 Sub Acc Type
case
when dbms_random.value (1,100) < 76 then null
else
decode (floor(dbms_random.value (1,6)),
1,'S',2,'C',3,'C',4,'C',5,'C',null)
end,
 Acc Opening Date
sysdate  dbms_random.value(1,500),
 Account Manager ID
decode (
floor(dbms_random.value (1,11)),
1,1,2,1,3,1,4,1,5,2,6,3,7,4,8,5,9,5,10,5,0
)
);
end loop;
commit;
end;
Now, how do we know that all these exercise yielded fruit? After this table is loaded, let’s see the actual distribution:
SQL> select first_name, count(*) from accounts group by first_name;
FIRST_NAME COUNT(*)
 
Alan 9834
Barbara 10224
Charles 5046
David 4980
Ellen 15094
Frank 14960
George 14890
Hillary 4898
Iris 10009
Josh 10065
Great! The distribution for each first name is exactly we wanted. For instance, we wanted to have 10 percent rows with first name “Alan,” and we got 9,834 out or 100,000; this equates to approximately 10 percent . We wanted 15 percent with the name “Ellen” and we got 15.094% — pretty close to that number and statistically significant. Similarly, you can go through all the other columns and see how they are actually distributed.
Conclusion
Generating random values in PL/SQL is a muchsoughtafter, but often lessunderstood aspect of the PL/SQL language. As we saw in the previously shown case, generating random numbers or strings is not what is most often needed. The real need is to generate data representing the real world, which requires a different approach, using the same builtin functions. In this article, you learned how to generate random values following a predetermined statistical approach, and then how to use it to populate a system mimicking real life.
Further Reading
An updated copy of this article may be found on www.proligence.com/downloads.html.

Arup Nanda is the Lead DBA at Starwood Hotels & Resorts. He has been an Oracle DBA for more than 11 years, touching all aspects of database management — from modeling to performance tuning to disaster recovery. He is the coauthor of the book Oracle Privacy Security Auditing (2003, Rampant Tech Press).
Contributors : Arup Nanda
Last modified 20050418 09:23 PM