Collection vs. IN-LIST. Performance Study

Objective of the study

Several years ago I needed to determine the efficiency of collections vs. IN-LIST. The ultimate goal was to migrate an application from using huge IN-LIST that made query text as big as 2M. Goal of this study was to figure whether and/or when PL/SQL tables (collections) become useful vs. IN-LIST predicates. Is it possible to build competitive code on top of this solution using JDBC/PRO*C or SQL*Plus.

I took part in changing two big applications since this study was completed. As I’m reviewing this study right now, collections proved to be very efficient substitution for IN-LISTs. It worked in PRO*C and JAVA code and was similar across Oracle 8i, 9i and 10g. There were of course certain limitations and issues that were needed special attention.

Introduction

There were many questions regarding retrieving data from database according to some predefined data set. Originally, it is done using sql statements similar to the following: 

SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where tab1.col1 in (:1, :2, :3, ...) ...;

That method was convenient till the list become quiet big. At some point Oracle restricted using lists of more then 1000 values. Error “ORA-01795: maximum number of expressions in a list is 1000″. The work around was to combine predicates using or condition, like this: 

SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where (tab1.col1 in (:1, :2, :3, ..., :1000) or tab1.col1 in (:1001, :1002 ...)) ...;

or

SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where (tab1.col1 in (:1, :2, :3, ..., :1000)) ...
 UNION ALL
  SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where (tab1.col1 in (:1001, :1002 ...)) ...
 UNION ALL ...;

Still the overall size of the query was limited and developers were free to run several queries to find final row set. This method called chunking. Of course, sorting and finding distinctive sets were on the shoulders of the developers. The other option was to set some table to contain the predefined set and subsequently use it in a statement like this: 

SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where tab1.col1 in (select driving_set_column from some_temp_table) ...;

Unfortunately, this solution involves slow operation of insertion data into a table.

In Oracle8i it become possible to use different technique – object types and casting them to a table structure. Like in following statement: 

SELECT col1,col2,col3...
  From tab1,tab2,tab3
  Where tab1.col1 in (select column_name from table(cast (some_collection as some_object_type))) ...;

Defining test methods and terms

Single test using SQL*Plus showed that parse/execution time to fill up a collection table exceeds all possible advantages, unless it is done using data from other database objects. It was decided not to pursue this path in the tests. Let’s see if JDBC would do it. I used JDBC driver from classes12.zip from Oracle distribution. Any given test would consist of multiple queries with equal size IN-LIST sets. Queries were fed to the server until all required values were processed. So, the total input size didn’t vary between tests, only IN-LIST set sizes. Input and output result sets were guarantied to be exactly the same in any given test. The test case for collections was using two additional PL/SQL procedures to put data into it. Data binding was done using ARRAY class. Resulting times included time spent to call those procedures.

Obviously, type of cursor sharing would change the test outcome. Thus, tests were done with both exact and similar cursor sharing parameter. There were four sets of tests: generic IN-LIST with similar and exact cursor sharing and object type IN-LISTS with similar and exact cursor sharing. IN-LIST set sizes were chosen to be 10, 20, 40, 60, 80, 100, 200, 400, 800, 1000, 2000, 4000, 20000, 40000.Prefix for generic IN-LISTs with similar cursor sharing will be G-SIM; for generic IN-LISTs with exact cursor sharing – G-EX; for object type IN-LISTs with similar cursor sharing T-SIM and T-EX for object type IN-LISTs with exact cursor sharing, where suffix is a size of IN-LIST in a single query.In each test there was I/O wait time that was ignored. It represents disc access time and is not helpful when hard disk speed was not constant and was affected by multiple factors. The number of blocks read in one test proven to be virtually same. Hence, possible disc speed fluctuation between tests can be substructed and ignored for the further analysis.

Results

Execution time minus I/O time:
 

+-----------------------------------------------------------+
|                IN-LIST               |                    |
+--------------------------------------+     COLLECTION     |
|      Similar      |       Exact      |                    |
+-------------------+------------------+--------------------+
| G-SIM_10     7.80 | G-EX_10   22.65  |    T-EX_10   13.69 |
| G-SIM_20     6.65 | G-EX_20   14.58  |    T-EX_20    8.96 |
| G-SIM_40     5.98 | G-EX_40   10.28  |    T-EX_40    6.72 |
| G-SIM_60     5.69 | G-EX_60    8.96  |                    |
| G-SIM_80     5.68 | G-EX_80    8.56  |    T-EX_80    5.80 |
| G-SIM_100    5.53 | G-EX_100   8.72  |    T-EX_100   5.27 |
| G-SIM_200    5.64 | G-EX_200   7.76  |    T-EX_200   5.08 |
| G-SIM_400    9.46 | G-EX_400   8.72  |    T-EX_400   4.96 |
| G-SIM_800   10.30 | G-EX_800  10.89  |    T-EX_800   4.95 |
| G-SIM_1000  13.28 | G-EX_1000 11.78  |    ...             |
| G-SIM_1500  28.90 | ...              |    ...             |
| G-SIM_2000  41.68 | G-EX_2000 17.90  |    T-EX_2000  4.83 |
| G-SIM_4000 347.69 | G-EX_4000 30.90  |    T-EX_4000  4.65 |
| ...               |                  |    T-EX_20000 4.76 |
+-----------------------------------------------------------+

Analysis

Let’s see how to use these results and what the obstacles are and where to watch out for reefs. As one can see the results greatly depends on the cursor sharing mode. The mechanism of sharing cursors between queries is based on the similarity of a query text source. If exact sharing is used then the query text should have exact match in shared pool to make sharing possible. “Similar cursor sharing” is possible if queries differ only in literal values. Precisely, number of IN-LIST members in our subject query should be the same only their values can vary. Oracle then separates those literals from the query internally rewriting it to use bind place holders.

Gaining high performance from sharing cursors in “exact” environment for IN-LISTs is virtually impossible (what good can those queries do if they are using same predefined sets of data?). For instance in test  G-EX_10 74% of execution time was spent to parse new queries. Whatever we do we can’t speed up this particular operation.Environment with similar cursor sharing more flexible and favorable for queries with IN-LIST predicates. Several things must not be forgotten. One is that developers should implement some universal interface to form equally sized IN-LISTs “chunked” queries. As we see the optimum would be near 100 elements per query. Second, there should be a collector that would parse results from different queries and combine them back to a final result set. Third, the internal mechanism of removing values from place holders and binding them back seems to be performing badly on large queries (starting from around 800 literals per query – see test G-SIM_800). Lastly, this would not work unless read consistency is guaranteed across all smallish queries.

Using collection objects as a substitution of original IN-LIST predicates produced stable deterministic performance results. Those queries do not depend on whether the environment set to share similar or exact queries. This because the quiry have exact same text no matter how many elements is in IN-LIST. We didn’t reach limit of number of elements in these tests, in contrary to generic IN-LISTs where the queries with more then 10000 IN-LIST elements were unstable and often produced Oracle errors. In fact they all failed after 20000 elements for one or another reason.

Summary

It’s not easy to decide which way to choose in general application. What kind of queries are ran in the application? When the number of small queries ran against a database greatly outweigh number of large ones it is wise to use similar cursor sharing and made those queries match each other, so their literals are the only what differentiate them. Fighting with big amount of large queries is not that easy. The best performing solution is to have them chopped into several separate queries with around 100 elements in IN-LIST. The application then should take care of combining results. But most important, as already mentioned this would not work unless read consistency is guaranteed across all queries. All this doesn’t make application’s persistent layer any simpler. If it is even possible to change application, advantage of using collection types for IN-LISTs should be taken. Ones it’s done bigger size of a driving result set will have no influence on the performance of queries. The relative speed (time to process one element in this set) will only rise.

Complete test data can be found here.

One Response to “Collection vs. IN-LIST. Performance Study”

  1. Sachin Says:

    Useful document – I tried getting the test data but the link http://www.fourthelephant.com/blog/uploads/collection_vs_inlist_data.txt did not open.

    Can you please upload the test data.

    We are exploring collection way of using IN-LIST.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: