The Art and Science of Randomization in Oracle

by Arup Nanda

Why Generate Random Values

Databases are supposed to be for maintaining factual data — with purity, integrity and reliability. So why are we interested in generating random data? The answer to this is that data is a representation of real-world facts and figures. While building a system, we try to mimic the real-life data when testing proper indexing, materialized view creation, partitioning, optimization paths, access methodologies, and so on. The trick is to generate such a data set that is random enough to represent the real world, yet follows the same statistical pattern as the target database. Note the key words here — it must be random, but needs to follow the same statistical pattern as real life.

So how about generating some values such as the following:

-32nr -32nr3121ne –e21e
323-=11r- r
0-vmdw-dwv0-[o- rr0-32r2 0
r4i32r –rm32r3p=x ewifef-432fr32o3-==

The values in this previous code example are all generated by randomly pecking the keys of the keyboard. Random enough, right?

Not quite. Consider the following: What if you are generating a data set to test out a new application by populating bank accounts. In the U.S., account last names vary widely, but first names are usually from a limited set — John, Jane, Michael, and so on. While generating a sample, you may want to make sure that the first names are populated from actual names such as these only, not from some random, arbitrary characters as shown in the previous example. They must be randomly distributed throughout the table in such a way that the optimizer will follow the same path as in a real-life system. Further, you may want it to follow some more statistical patterns (e.g., 25 percent of the first names should be John, eight percent should be Michael, and so on). Creating a data set that closely mimics real life is important in building a test bench to test application and database development.

On the surface this seems like a trivial and/or fruitless exercise; but in reality, it is one of the most important activities of system design. A few examples will perhaps explain why.

Building a Perfect System

Say an application is going to be developed at Acme Bank. The team is discussing the nuts and bolts of the project — everything from tablespace and file-layout to indexing and partitioning schemes. Their biggest problem is that the database they are designing will contain projected data only, and no meaningful data exists with which to test their plan.

Jill, the lead developer, is mulling over the type of indexing — should she use regular tables with b-tree indexes or Index Organized Tables? How would she know without generating some type of data?

Next, Deborah is wondering whether partitioning will help or hurt the performance. It would make management easier, sure; but query performance? To test her ideas, she needs a table prepopulated with sample data.

Cathy, the system architect, is proposing a few materialized views. But will this performance enhancement justify the stale data?

At the same time, John, the DBA, is thinking about the best partitioning strategy and is conflicted about whether to use range or to use hash.

Finally, Jack, another DBA, is thinking about placing the datafiles for certain tablespaces on fast disks; but which ones?

These are not hypothetical problems, but real-life challenges faced by a team designing a system for optimal performance and value. There has never been a lack of ideas, but a common problem all these people face is the lack of a dataset that represents the database they are going to build. They could build such a dataset from scratch, but they don’t have the means to enter each and every piece of data by hand.

Solution?

The answer to this problem is to generate random values, but not arbitrary ones. But the values must be distributed following some statistical pattern to reflect real life; a mere update will not help in this case. For instance, suppose your requirement calls for distributing first names as follows:

John — 30 percent
Abby — 10 percent
Scott — 5 percent
Jill — 5 percent Mike — 50 percent

There are 1,000,000 rows in the test table, so you could specify the following to achieve this pattern:

To update 350,000 or 35 percent of 1 million rows:

update accounts set first_name = ‘John’
where rownum < 350001;

To update the next 5 percent of 1 million rows, or 50,000 records, to Abby:

update accounts set first_name = ‘Abby’
where first_name != ‘John’
and rownum < 50000

... and so on. This procedure will satisfy the overall distribution of the values, but the values will be distributed as follows: The first 35 percent of the records will be for the first-name John, which just won’t happen in real life. In a real table, the first name John will be distributed throughout the table, not in the first 35 percent of the rows. This imperfect distribution would impact heavily on operations such as index scans. Since the optimizer decides between full-table scan and index scan based on on how the data is distributed, its choice would be greatly affected by the unusual, lopsided availability of values at the head end of the table.

Similarly, these values would probably be concentrated in a small portion of the disk or disk groups, and this would skew the results of the I/O test. So, such an unrealistic distribution of values of the first name John is neither helpful nor desirable.

Other Uses

Values must also be randomly generated when you are creating Web site user IDs for registered users, or when you are generating temporary passwords. Although these are jumbled characters and numbers, the data must fit a pattern (e.g., passwords must be between eight and 12 characters and should have at least four numbers and four characters, and so on).

Although it’s not important anymore in Oracle 10g, the encryption feature requires random keys to be generated. And the keys must be random enough to be difficult to guess.

Building a Random Character Value Generator

Oracle Database 10g provides a random value generator via a package named DBMS_RANDOM, which can be used to generate random values. However, those values are arbitrary, not representative of the real-life system. For instance, the following an example of how you can generate a string 30 characters long:

SQL> select dbms_random.string('P',30) from dual;

DBMS_RANDOM.STRING('P',30)
-----------------------------------------------------
O=*KXh}#O`D~1|'^VheWr:jZ>!:P}3

The package DBMS_RANDOM has several functions and procedures, one of which is STRING(), which accepts two parameters and returns a character value. The types of characters generated are determined by the first parameter — OPTION. Here are the options and the types of string generated:

Option	Type of String Produced
U	Any uppercase alphabetic characters
L	Any lowercase alphabetic characters
A	Any mixed case alphabetic characters
X	Any alphanumeric character in upper case
P	Any printable character

You can also use lowercase characters for the options; e.g., x instead of X. In this example, we used X, which produced a string of printable characters, not just alphabetic ones. Clearly, this kind of character string will not represent a customer name. Using an option such as “A” would generate a more appropriate string.

Populating a Table

Let’s start with a real-life example. In the case of Acme Bank’s application development, the team is mulling over the proper indexing, partitioning, physical layout, materialized view planning, and so on, for the ACCOUNTS table, which holds the customers’ account data. Here is how the table looks :

SQL> desc accounts
 Name              Null?    Type             
 ----------------- -------- ------------
 ACC_NO            NOT NULL NUMBER 
 FIRST_NAME        NOT NULL VARCHAR2(30)
 LAST_NAME         NOT NULL VARCHAR2(30)
 ACC_TYPE          NOT NULL VARCHAR2(1)
 FOLIO_ID                   NUMBER
 SUB_ACC_TYPE               VARCHAR2(30) 
 ACC_OPEN_DT       NOT NULL DATE
 ACC_MOD_DT                 DATE
 ACC_MGR_ID                 NUMBER

The columns are fairly self-explanatory. To represent customers in the real world, the following requirements have been specified for the columns:

Column Name	Purpose	Data Pattern
ACC_NO	Account Number	Any number less than 10 digits
FIRST_NAME	The first name	10% Alan 10% Barbara 5% Charles 5% David 15% Ellen 20% Frank 10% George 5% Hillary 10% Iris 10% Josh
LAST_NAME	The last name	Any alphabetic character between four and 30 but 25 percent should be “Smith”
ACC_TYPE	The type of account — Savings, Checking, and so on	20 percent each of S, C, M, D, and X
FOLIO_ID	The folio ID from the other systems	Half NULL and the rest half a number related to the account number
SUB_ACC_TYPE	If the customer is incorporated, then sub-account types, if any 75 percent null	From the values populated: 5% — S 20% — C
ACC_OPEN_DT	Date account was opened	A date between now and 500 days ago
ACC_MGR_ID	The ID of the account manager servicing the account	There are five account managers, with account percentages distributed as follows: 1 — 40 percent 2 — 10 percent 3 — 10 percent 4 —10 percent 5 — 30 percent

As you can see, some fairly complex requirements were specified, but for good reasons. These accurately reflect how the data will be distributed in real life. In real life, there will be customers with first names like “Josh” and “Ellen,” not “XepqjEuF”; so, the names must be chosen from the set of possible names. And, in the U.S., people are called by a variety of last names. Thus, we want a semi-random distribution with 25 percent of a very popular last name, “Smith.”

Generating Random Numbers

Before we go any further, we have to discuss how to generate random numbers. The package dbms_random contains a function to return random numbers. If you want to return any random number between -2^31 and +2^31, use the function RANDOM. The following shows how we have generated numbers 10 times:

SQL> begin  
  2     for i in 1..10 loop 
  3        dbms_output.put_line('Random Number='||dbms_random.random); 
  4     end loop;  
  5  end;  
  6  /
Random Number=-81420432
Random Number=-1024262734
Random Number=-1965250926
Random Number=1439118604
Random Number=675429938
Random Number=-466227661
Random Number=613708106
Random Number=-1154140330
Random Number=-643127572
Random Number=-133140229

This function is available but deprecated. Oracle recommends using a new function called VALUE().

SQL> l  
  1  begin  
  2     for i in1..10 loop  
  3        dbms_output.put_line('Random Number='||dbms_random.value); 
  4     end loop;  
  5* end;
SQL> /
Random Number=.547109841457281046373693994862305187
Random Number=.00500720861223234783817674992068380455
Random Number=.86974675431616311272549579082479240362
Random Number=.16306119784529083761710557238498944243
Random Number=.39327329046753189206427695323437381763
Random Number=.68307381430584611139249432690613072007
Random Number=.24640768414299435941101562183729221882
Random Number=.50568180275705934132098716939122047439
Random Number=.73345513921455391594299189253661168712
Random Number=.71764049051903979360796313613342367114

Note the difference; VALUE returns a positive random decimal number less than one, with 38 digits after the decimal point. In most cases, this is probably enough, with some modification (if you are looking for a 10-digit whole number, for instance, simply multiply it by 10,000,000,000 and use FLOOR() to discard the decimals).

   1  begin  
  2     for i in 1..10 loop  
  3        dbms_output.put_line('Random Number='|| 
  4            floor (10000000000 * dbms_random.value) 
  5        ); 
  6     end loop;  
  7* end;
SQL> /
Random Number=939084911
Random Number=690705371
Random Number=4696700513
Random Number=7978266084
Random Number=5157885833
Random Number=8902042948
Random Number=5839885968
Random Number=6207324613
Random Number=5633096626
Random Number=1891871746

The function VALUE() is overloaded; another variation of the functions accepts a set of low and high values and returns a number between them. To generate a number between 10 and 20, for instance, you will use

dbms_random.value (10, 20)

Note: The numbers generated will be more than or equal to 10, but less than (never equal to) 20.

Using it to generate multiple values, we get

  1  begin  
  2     for i in 1..10 loop  
  3        dbms_output.put_line('Random Number='|| 
  4           dbms_random.value (10,20) 
  5        ); 
  6     end loop;  
  7* end;
SQL> /
Random Number=11.0908840421899990054735051046498510644
Random Number=14.5391525077977177619575126573478641174
Random Number=11.1343623028437386191996545034616142284
Random Number=14.8831724931888089801812805693409810454
Random Number=11.7294096895635532492353976105129628219
Random Number=15.0589018451126293465505440533299716691
Random Number=15.6152418028444556585301007183814067306
Random Number=18.2903839241741813303152060555229669855
Random Number=13.0914352695913402037626709927051618367
Random Number=12.6027933168087755621076348461280998683

As you can see, the numbers are generated with 37 digits after the decimal point. This is a very valuable function and will suffice for lmost types of random number generation. If you want to generate only whole numbers, use the same approach used earlier, using the FLOOR() function.

floor(dbms_random.value (10,20))

Generating Specific Characters and Strings

So, how can we generate specific strings that follow a predetermined statistical pattern?

We can borrow a page from the probability theory text to accomplish this. The trick is to use a method similar to Monte-Carlo simulation used by statisticians worldwide. In this approach, we generate a random number, between one and 100 (both inclusive). Over a period of time, the probability that a specific number, say six, will turn up is exactly one time out of 100, or 1 percent of the time. In fact, all the numbers have 1/100 probability. Going by the same approach, the probability that either of two numbers — say, 1 and 2 — will be 2 percent. And, of course, the probability that any one of numbers between one and 10 will turn up is 10 percent. We will use this to configure the probability of the random value.

Take, for instance, the value of the column ACC_TYPE, which calls for equal probability of S, C, M, D and X; or 20 percent probability each. If we generate a whole number between one and five (both inclusive), the probability of each number will be 20 percent. Then we can use a DECODE() function to get the ACC_TYPE value.

SQL> select  
  2        decode (  
  3           floor ( 
  4              dbms_random.value (1,6) 
  5           ), 
  6           1,'S', 
  7           2,'C', 
  8           3,'M', 
  9           4,'D',
 10             'X' 
 11        ) 
 12  from dual;

First, we are generating a number between one and five (line 4). Since the number is generated is less than the highest value passed as a parameter, we have specified six. And since we want a whole number, we have used the FLOOR() function in line 3. It truncates all decimal values from the generated number. Depending on the number obtained, we used DECODE() to get one of the values — S, C, M, D, or X. Since the numbers 1, 2, 3, 4, and 5 will have equal probability of being generated, so will be the letters — at 20 percent each.

This technique is very valuable for generating random, but useful values, as shown previously. The same approach can be used to generate almost all types of pre-determined random values.

Random Values with NULLs

Remember, the requirement for FOLIO_ID is a little different. It needs only 50 percent of the values populated; the rest should be NULL. How can we achieve this?

Quite simply, we will use the same probability approach with a twist: we will use a determination of yes or no. Generating a random number between one and 100 will ensure 1 percent probability of each number. Hence, a number less than 51 will have exactly 50 percent probability of occurring. We can use this in a CASE statement to get the value.

SQL> select  
  2        case 
  3             when dbms_random.value (1,100) < 51 then null 
  4        else 
  5             floor(dbms_random.value(1,100)) 
  6        end  
  7  from dual;

On line 3, we should check to see whether the number generated is less than 51. If so, we return NULL. Since the probability of a sub-51 number is 50 percent , we have NULLs occurring 50 percent of the time as well. In the other 50 percent of the time, we have generated a value to be used as a FOLIO_ID.

Random Strings of Random Length

In dbms_random.string, a random string is generated, but of fixed length. That is not representative of real life; in reality, people have last names of varying lengths. In this example, the requirement is to have a length between four and 30 characters. To facilitate this, we can pass the length as a random number as well to the function dbms_random.string in line 6 below.

  1  begin  
  2     for i in 1..10 loop  
  3        dbms_output.put_line('Random String='|| 
  4           dbms_random.string ( \
  5               'A', 
  6               dbms_random.value(4,30) 
  7           ) 
  8        ); 
  9     end loop; 
 10* end;
SQL> /
Random String=RniQZGquFVJYFpGLOvtNd
Random String=GhcphpcsaCXlhigRQY
Random String=JtakoelUf
Random String=BgCOu
Random String=QFBzQxcHqGlHWkZFmnN
Random String=lSxVjqJvpwBB
Random String=jfhNARzALrLOKZRpOwnhrzz
Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo
Random String=BhuZ
Random String=GebcqcgvzBfEpTYnJPmYAQdb

Notice that the strings are of different lengths. Remember, 25 percent of the last names must be “Smith,” and the rest must have random lengths. We can accomplish this by combining the random strings and the Monte-Carlo approach:

   decode (        
         floor(dbms_random.value(1,5)),           
            1,'Smith',           
            dbms_random.string ('A',dbms_random.value(4,30))     
      )

The previous expression will return “Smith” 25 percent of the time and a random alphabetic string between four and 30 characters long the rest of the time.

Putting it All Together

Now that you understood the building blocks of the randomization approach, you can put them together to build the account record generation PL/SQL block as shown below. In the following example, we are loading 100,000 records into the table ACCOUNTS. Here is the loading program in full:

begin
   for l_acc_no in 1..100000 loop
   insert into accounts
   values  
   (
      l_acc_no,     
      -- First Name
      decode (         
         floor(dbms_random.value(1,21)),
         1, 'Alan',
         2, 'Alan',
         3, 'Barbara',
         4, 'Barbara',
         5, 'Charles',
         6, 'David',
         7, 'Ellen',
         8, 'Ellen',
         9, 'Ellen',
         10, 'Frank',
         11, 'Frank',
         12, 'Frank',
         13, 'George',
         14, 'George',
         15, 'George',
         16, 'Hillary',
         17, 'Iris',
         18, 'Iris',
         19, 'Josh',        
         20, 'Josh',           
             'XXX'
      ),
      -- Last Name
      decode (        
         floor(dbms_random.value(1,5)),           
            1,'Smith',           
            dbms_random.string ('A',dbms_random.value(4,30))     
      ),
      -- Account Type
      decode (        
         floor(dbms_random.value (1,5)),           
             1,'S',2,'C',3,'M',4,'D','X'     
      ),
      -- Folio ID     
      case           
           when dbms_random.value (1,100) < 51 then null     
      else           
           l_acc_no + floor(dbms_random.value(1,100))      
      end,
      -- Sub Acc Type     
      case           
           when dbms_random.value (1,100) < 76 then null     
      else
          decode (floor(dbms_random.value (1,6)),            
             1,'S',2,'C',3,'C',4,'C',5,'C',null)      
      end,
      -- Acc Opening Date
      sysdate - dbms_random.value(1,500),
      -- Account Manager ID
      decode (        
         floor(dbms_random.value (1,11)),           
            1,1,2,1,3,1,4,1,5,2,6,3,7,4,8,5,9,5,10,5,0     
      )
   );
   end loop;
   commit;
end;

Now, how do we know that all these exercise yielded fruit? After this table is loaded, let’s see the actual distribution:

SQL> select first_name, count(*) from accounts group by first_name;

FIRST_NAME                      COUNT(*)
------------------------------ ----------
Alan                                9834
Barbara                            10224
Charles                             5046
David                               4980
Ellen                              15094
Frank                              14960
George                             14890
Hillary                             4898
Iris                               10009
Josh                               10065

Great! The distribution for each first name is exactly we wanted. For instance, we wanted to have 10 percent rows with first name “Alan,” and we got 9,834 out or 100,000; this equates to approximately 10 percent . We wanted 15 percent with the name “Ellen” and we got 15.094% — pretty close to that number and statistically significant. Similarly, you can go through all the other columns and see how they are actually distributed.

Conclusion

Generating random values in PL/SQL is a much-sought-after, but often less-understood aspect of the PL/SQL language. As we saw in the previously shown case, generating random numbers or strings is not what is most often needed. The real need is to generate data representing the real world, which requires a different approach, using the same built-in functions. In this article, you learned how to generate random values following a predetermined statistical approach, and then how to use it to populate a system mimicking real life.

DBAzine.com

Sections

Personal tools

Menu

Who Are You?