Bulk Data Loading: The available approaches

This articles explores the various programmatic methods of moving large amounts of data between Oracle tables and presents the relative pros and cons.

Database solutions such as data warehouses and data conversion/migration projects can require very large amounts of data to be moved in a highly efficient and robust manner. What solutions are there available? There are plenty of ETL tools on the market, but often these are expensive, complex, and time consuming to implement and configure. Developing a simple PL/SQL based solution may be all you need, but what is the best approach to satisfy your specific requirements?

This article is directed primarily at developers wanting to develop bulk data loading programs in PL/SQL.

Background

The first part of this article identifies the available approaches for bulk loading data and compares the pros and cons of each. The second part of the articles then looks at the refinements that you may want to consider adding to your solution.

The Approaches

Create Table As (CTAS)

Creating a table based on a SELECT statement neither requires UNDO space nor writes to the redo log. It is the quickest and simplest method of loading a temporary table, and with minimal execution cost can be added to a partition table using a mechanism known as partition exchange.

  • CREATE TABLE emp_temp AS SELECT ….
  • CREATE TABLE emp
  • PARTITION …
Pros
  • Fastest method available to load between tables.
  • No Undo space is used.
  • No writes to the redo log performed.
Cons
  • Useful for loading into partition tables or into temporary tables, otherwise benefits are lost if the final destination is a permanent non-partitioned table.
  • All or nothing if Oracle errors encountered during loading, i.e. exceptions cannot be trapped and and bad rows ciphered to one side.
  • No constraints or predefined precision on data types to trap data errors.
  • Partition exchange would require the temporary table to be of identical shape to the partitioned table, i.e. must have full compliment of columns in the correct sequence.

Insert Append

Next best performing method to CREATE TABLE AS is the INSERT APPEND technique. Append is not a syntactical component of the INSERT command, but a tuning hint recognised by the /*+ */ notation. This approach does have to use UNDO (or rollback) space, but writing to the redo log can be avoided by specifying the NOLOGGING property on the table. This effectively provides the equivalent to direct loading achievable with SQL*Loader.

INSERT /*+ APPEND */ INTO ... SELECT ...
Pros
  • A very fast method of loading (second to CREATE TABLE AS).
  • Able to load into any target table
  • Can insert a sub-set of columns.
  • Able to enable Oracle’s parallel processing during the insert.
Cons
  • Hindered by the excessive use of UNDO which needs to be sized very high for loading very large sets of data.
  • Long retention of rollback can cause ORA-1555 Snapshot too old errors. This is not so much of a problem with Oracle10g.
  • All or nothing loading if an Oracle error is encountered followed by potentially long rollback period, i.e. no opportunity to trap bad rows.

No practical means to batch inserts into smaller chunks to avoid rollback issues, unless data has a convenient key to do this, e.g. serial number range, date range, etc. Even if data could be batched on a suitable key, the efficiency of the execution plan for each batch would need to be adequate to justify the multiple executions, particularly if FULL TABLE SCANS on large tables are involved.

Forall

This is a PL/SQL approach that has the performance to compete fairly well with the INSERT APPEND method and flexibility to contend with the limitations of the methods discussed so far. To be more accurate, the FORALL command would need to be combined with BULK COLLECT on a cursor fetch to be most effective.

The BULK COLLECT enables a single fetch to populate array mapped to columns on the cursor and either retrieve the full set of data or limited to a number of rows. The buffer size, i.e. the number of entries in the array, is governed by the amount of PGA memory available.

The FORALL command then enables DML commands such as INSERT to be used in conjunction with arrays used as bind variables (known as bulk binding). Effectively an INSERT with bind variables is only fired once at the SQL engine and the data passed via an array for each bind variable.

To simplify the number of arrays and bind variables used, it is possible to use record types with both BULK COLLECT and FORALL, as long as the individual columns of the record type are not referenced individually.

FORALL ...
INSERT INTO ..... VALUES v_rec_typ(i);

or

FORALL ...
UPDATE .... SET ROW = v_rec_typ(i) WHERE id = v_id(i);

FORALL is able to trap bad rows and exclude these from the DML operation by specifying the SAVE EXCEPTIONS clause. This option saves the offending array index and SQLCODE into a pseudo array which can be accessed by the PL/SQL program. Being able to trap exceptions also provides the added advantage of responding to expected errors such as duplicate value on index and performing an UPDATE instead. In Oracle9i the MERGE command combine the INSERT/UPDATE logic into one statement, and helps to simply this type of requirement.

Determining the optimal batch size is a trade off between over excessive use of UNDO and the amount of PGA memory the PL/SQL program has to store the arrays in.

Pros
  • Performance is on par with INSERT APPEND up until a threshold batch size.
  • Able to trap exception on individual bad rows.
  • Able to batch up inserts into more manageable chunks followed by COMMITs and therefore avoid excessive UNDO use.
Cons
  • Batch sizes are limited by the amount of PGA memory available.
  • Unable to avoid writes to the redo log.
  • Exception handling is very slow in Oracle9i (not really an issue if not many exceptions being trapped).
  • Committing across fetches can cause ORA-1555 errors to be encountered (depends on what other activities are occurring on the database).

Build Your Own Application

Which Approach

Deciding on which method to use for your application is much dependent on what your requirements are. If all of your target tables are always partitioned and you can fully guarantee the integrity of the data you are loading, then exploit the CREATE TABLE AS approach. If your target tables are non-partitioned, then consider the INSERT APPEND. If you want ultimate flexibility, expect “dirty data”, and happy to compromise on the performance slightly, then the FORALL approach would be the most attractive approach.

Refinements

Disable Indexes

Avoiding the overhead of maintaining indexes whilst loading into a table can save considerable processing time, particularly if a table has quite a number of multi-columned indexes. The traditional method of achieving this has been to drop indexes prior to loading and recreating them afterwards, but there is now a more convenient means of managing the overhead of indexes.

Indexes are not disabled and enabled in the same way as constraints, but made UNUSABLE. Whilst an index is in this state, the index is not maintained during DML operations on the table, which as the term suggest makes them completely unusable, even for read operations. If you attempt an operation within a SELECT that attempts to use the index, then you get the error ORA-1502: index …. or partition of such index is in unusable state. The work around is to alter your session skip using unusable indexes:-

ALTER SESSION SET SKIP_UNUSABLE_INDEXES = TRUE;

However, if the unavailable index is a unique index, or if it is used to enforce a constraint, the setting of SKIP_UNUSABLE_INDEXES does not suppress the 1502 error and the data manipulation is not allowed to continue. So limit the setting of UNUSABLE to non-unique indexes only.

An index is re-enabled by rebuilding it. The rebuild execution time is considerably quicker than attempting to load a table with enabled indexes attached, and hence a considerable performance gain.

Each index can be rebuilt individually using:

ALTER INDEX <index name> REBUILD;

Alternatively, on partitioned tables all indexes on a specific partition can be rebuilt with the following command:

ALTER TABLE <table name>
MODIFY PARTITION <partition>
REBUILD UNUSABLE LOCAL INDEXES;
Nologging

The redo logs are essential for recovering the database from a catastrophic failure. Having them enabled on any critical tables in an Online Transaction Processing (OLTP) system would be mandatory in order to guarantee point of failure recovery. However, loading into staging or work tables often found in data warehousing and migration projects, being able to recover the table is not so crucial since the data is often temporary or replaceable from the source data. The NOLOGGING option set on a table helps to avoid the overhead of maintaining the redo logs during loading. However, this no logging feature can only be achieved by the INSERT APPEND command discussed earlier.

Analyze Tables

Tables that are later referenced in down stream data loading steps, particularly those loaded whilst empty or are partitioned, need to have statistics gathered at either table (global) or partition level as appropriate. This is essential for the Cost Based Optimiser (CBO) to produce an optimal explain plan. If tables do not have statistics and the database is running under CHOOSE optimiser mode, then the inferior RULES based optimizer will be used which could lead to a totally in appropriate execution plan.

No Constraints

There is an overhead with enforcing foreign key (FK) constraints, since each foreign key value will require a unique key index look up on the parent table. From my experience, this overhead is no where near as significant as some of the other topics discussed, in particular having indexes enabled. FK constraints can be disabled prior to data loading and re-enabled afterwards, but unless you can guaranteed 100% data integrity, the re-enable may fail, leaving you with a task of correcting or eliminating the offending data. If you leave FK constraints enabled whilst using the PL/SQL bulk loading approach, then you will be able to trap individual loading errors. CTAS and INSERT APPEND approaches would favour the constraints being disabled, but would force you to deal with any invali data before the constraints can be enabled.

Deleting from parent tables will force Oracle to check that no rows exist in the child tables as defined by the FK constraint. So if no indexes are set up for the FK on the child table, then the child table will require a full table scan for each parent row being deleted. On large tables, this would be very impractical. If it is not possible to have an index on the FK columns, then disabling of the constraint is a viable option, however, unless data integrity can be guaranteed, then re-enabling the constraint may fail leading to the complication of clean up the data.

Primary Key (PK) constraints suffer similar limitations as FK constraints. However, if the loading task is performing INSERT on new data and UPDATEs on existing data (effectively a MERGE), then the PK must be enabled in order establish whether a row already exists.

Parallel Processing

On a multi-processor database server, any session running on Oracle will typically utilise one CPU only, unless otherwise told to. Spreading the processing effort across multiple processors can be achieved in several ways:

Oracle’s built-in parallel processing functionality. Parallel processing of DML operations such as INSERT, UPDATE, and DELETE can be enabled at three levels

Instance level, defined in the parameter file.

Object level, defined in create/alter table/index

Query level, defined in a PARALLEL hint

Oracle will break the operation up into however many parallel tasks that has been specified by the degree of parallelism and spread them equally across the available processors.

There is an overhead for Oracle to manage parallel processing, so performance gains are only achieved beyond a certain threshold. There is no definitive guideline for this, but processing tens of millions of rows would probably benefit from having parallelism enabled.

Programmatic methods to perform asynchronous tasks. The various refinements to loading that have been mentioned so far do not necessarily need to be performed serially, for example, rebuilding of indexes, enabling constraints, etc. PL/SQL is able to simulate a crude form of “multi-threading” via the DBMS_JOB built in package. An anonymous PL/SQL block can be submitted to the job queue to be executed immediately as one-off job. This is simple, if there is no dependency on the task completing (successfully or failing). For more sophisticated requirements, the asynchronous task could signal its completion to a proceeding task, and on completion of all its dependent tasks, the task can be started.

Flexible Metadata

The traditional approach to storing metadata in a relational database solution would be to design a series of tables and hold metadata as row data. A far more flexible method is to use XML configuration files. Using the built-in DBMS_XMLDOM package, XML metadata can be easily read and mapped to a PL/SQL nested table type and used to drive the flow and behaviour of your data loading solution.

Conclusion

Hopefully this article has given you an overview of the things you may want to consider for you own bulk data loading solution.

1

Leave a Reply