|
by Philip Lambert
Database solutions such as data warehouses and
data conversion/migration projects can require very large amounts of
data to be moved in a highly efficient and robust manner. What
solutions are there available? There are plenty of ETL tools on the
market, but often these are expensive, complex, and time consuming
to implement and configure. Developing a simple PL/SQL based
solution may be all you need, but what is the best approach to
satisfy your specific requirements?
This article is directed primarily at
developers wanting to develop bulk data loading programs in PL/SQL.
Background
The first part of this article identifies the
available approaches for bulk loading data and compares the pros and
cons of each. The second part of the articles then looks at the
refinements that you may want to consider adding to your solution.
The Approaches
Create Table As
(CTAS)
Creating a table based on a SELECT statement
neither requires UNDO space nor writes to the redo log. It is the
quickest and simplest method of loading a temporary table, and with
minimal execution cost can be added to a partition table using a
mechanism known as partition exchange.
CREATE TABLE emp_temp AS SELECT ....
CREATE TABLE emp
PARTITION ...
Pros
Cons
-
Useful for loading into partition tables or
into temporary tables, otherwise benefits are lost if the final
destination is a permanent non-partitioned table.
-
All or nothing if Oracle errors encountered
during loading, i.e. exceptions cannot be trapped and and bad
rows ciphered to one side.
-
No constraints or predefined precision on
data types to trap data errors.
-
Partition exchange would require the
temporary table to be of identical shape to the partitioned
table, i.e. must have full compliment of columns in the correct
sequence.
Insert Append
Next best performing method to CREATE TABLE AS is the INSERT APPEND
technique. Append is not a syntactical component of the INSERT
command, but a tuning hint recognised by the /*+ */ notation. This
approach does have to use UNDO (or rollback) space, but writing to
the redo log can be avoided by specifying the NOLOGGING property on
the table. This effectively provides the equivalent to direct
loading achievable with SQL*Loader.
INSERT /*+
APPEND */ INTO ... SELECT ...
Pros
-
A very fast
method of loading (second to CREATE TABLE AS).
-
Able to load into
any target table
-
Can insert a
sub-set of columns.
-
Able to enable
Oracle's parallel processing during the insert.
Cons
-
Hindered by the
excessive use of UNDO which needs to be sized very high for
loading very large sets of data.
-
Long retention of
rollback can cause ORA-1555 Snapshot too old errors. This is not
so much of a problem with Oracle10g.
-
All or nothing
loading if an Oracle error is encountered followed by
potentially long rollback period, i.e. no opportunity to trap
bad rows.
-
No practical
means to batch inserts into smaller chunks to avoid rollback
issues, unless data has a convenient key to do this, e.g. serial
number range, date range, etc. Even if data could be batched on
a suitable key, the efficiency of the execution plan for each
batch would need to be adequate to justify the multiple
executions, particularly if FULL TABLE SCANS on large tables are
involved.
Forall
This is a PL/SQL approach that has the performance to compete fairly
well with the INSERT APPEND method and flexibility to contend with
the limitations of the methods discussed so far. To be more
accurate, the FORALL command would need to be combined with BULK
COLLECT on a cursor fetch to be most effective.
The BULK COLLECT enables a single fetch to populate array mapped to
columns on the cursor and either retrieve the full set of data or
limited to a number of rows. The buffer size, i.e. the number of
entries in the array, is governed by the amount of PGA memory
available.
The FORALL command then enables DML commands such as INSERT to be
used in conjunction with arrays used as bind variables (known as
bulk binding). Effectively an INSERT with bind variables is only
fired once at the SQL engine and the data passed via an array for
each bind variable.
To simplify the number of arrays and bind variables used, it is
possible to use record types with both BULK COLLECT and FORALL, as
long as the individual columns of the record type are not referenced
individually.
FORALL ...
INSERT
INTO ..... VALUES v_rec_typ(i);
or
FORALL ...
UPDATE
.... SET ROW
= v_rec_typ(i) WHERE id = v_id(i);
FORALL is able to trap bad rows and exclude these from the DML
operation by specifying the SAVE EXCEPTIONS clause. This option
saves the offending array index and SQLCODE into a pseudo array
which can be accessed by the PL/SQL program. Being able to trap
exceptions also provides the added advantage of responding to
expected errors such as duplicate value on index and
performing an UPDATE instead. In Oracle9i the MERGE command combine
the INSERT/UPDATE logic into one statement, and helps to simply this
type of requirement.
Determining the optimal batch size is a trade off between over
excessive use of UNDO and the amount of PGA memory the PL/SQL
program has to store the arrays in.
Pros
-
Performance is on
par with INSERT APPEND up until a threshold batch size.
-
Able to trap
exception on individual bad rows.
-
Able to batch up
inserts into more manageable chunks followed by COMMITs and
therefore avoid excessive UNDO use.
Cons
-
Batch sizes are
limited by the amount of PGA memory available.
-
Unable to avoid
writes to the redo log.
-
Exception
handling is very slow in Oracle9i (not really an issue if not
many exceptions being trapped).
-
Committing across
fetches can cause ORA-1555 errors to be encountered (depends on
what other activities are occurring on the database).
Build Your Own Application
Which Approach
Deciding on which method to use for your
application is much dependent on what your requirements are. If all
of your target tables are always partitioned and you can fully
guarantee the integrity of the data you are loading, then exploit
the CREATE TABLE AS approach. If your target tables are
non-partitioned, then consider the INSERT APPEND. If you want
ultimate flexibility, expect "dirty data", and happy to
compromise on the performance slightly, then the FORALL approach
would be the most attractive approach.
Refinements
Disable Indexes
Avoiding the overhead of maintaining indexes
whilst loading into a table can save considerable processing time,
particularly if a table has quite a number of multi-columned
indexes. The traditional method of achieving this has been to drop
indexes prior to loading and recreating them afterwards, but there
is now a more convenient means of managing the overhead of indexes.
Indexes are not disabled and enabled in the
same way as constraints, but made UNUSABLE. Whilst an index is in
this state, the index is not maintained during DML operations on the
table, which as the term suggest makes them completely unusable,
even for read operations. If you attempt an operation within a
SELECT that attempts to use the index, then you get the error
ORA-1502: index .... or partition of such
index is in unusable state. The work around is to alter
your session skip using unusable indexes:-
ALTER SESSION SET SKIP_UNUSABLE_INDEXES =
TRUE;
However, if the unavailable index is a unique
index, or if it is used to enforce a constraint, the setting of
SKIP_UNUSABLE_INDEXES does not suppress the 1502 error and the data
manipulation is not allowed to continue. So limit the setting of
UNUSABLE to non-unique indexes only.
An index is re-enabled by rebuilding it. The
rebuild execution time is considerably quicker than attempting to
load a table with enabled indexes attached, and hence a considerable
performance gain.
Each index can be rebuilt individually using:
ALTER INDEX <index name> REBUILD;
Alternatively, on partitioned tables all
indexes on a specific partition can be rebuilt with the following
command:
ALTER TABLE <table
name> MODIFY PARTITION <partition>
REBUILD UNUSABLE LOCAL
INDEXES;
Nologging
The redo logs are essential for recovering the
database from a catastrophic failure. Having them enabled on any
critical tables in an Online Transaction Processing (OLTP) system
would be mandatory in order to guarantee point of failure recovery.
However, loading into staging or work tables often found in data
warehousing and migration projects, being able to recover the table
is not so crucial since the data is often temporary or replaceable
from the source data. The NOLOGGING option set on a table helps to
avoid the overhead of maintaining the redo logs during loading.
However, this no logging feature can only be achieved by the INSERT
APPEND command discussed earlier.
Analyze Tables
Tables that are later referenced in down stream
data loading steps, particularly those loaded whilst empty or are
partitioned, need to have statistics gathered at either table
(global) or partition level as appropriate. This is essential for
the Cost Based Optimiser (CBO) to produce an optimal explain plan.
If tables do not have statistics and the database is running under
CHOOSE optimiser mode, then the inferior RULES based optimizer will
be used which could lead to a totally in appropriate execution plan.
No Constraints
There is an overhead with enforcing foreign key
(FK) constraints, since each foreign key value will require a unique
key index look up on the parent table. From my experience, this
overhead is no where near as significant as some of the other topics
discussed, in particular having indexes enabled. FK constraints can
be disabled prior to data loading and re-enabled afterwards, but
unless you can guaranteed 100% data integrity, the re-enable
may fail, leaving you with a task of correcting or eliminating the
offending data. If you leave FK constraints enabled whilst using the
PL/SQL bulk loading approach, then you will be able to trap
individual loading errors. CTAS and INSERT APPEND approaches would
favour the constraints being disabled, but would force you to deal
with any invali data before the constraints can be enabled.
Deleting from parent tables will force Oracle
to check that no rows exist in the child tables as defined by the FK
constraint. So if no indexes are set up for the FK on the child
table, then the child table will require a full table scan for each
parent row being deleted. On large tables, this would be very
impractical. If it is not possible to have an index on the FK
columns, then disabling of the constraint is a viable option,
however, unless data integrity can be guaranteed, then re-enabling
the constraint may fail leading to the complication of clean up the
data.
Primary Key (PK) constraints suffer similar
limitations as FK constraints. However, if the loading task is
performing INSERT on new data and UPDATEs on existing data
(effectively a MERGE), then the PK must be enabled in order
establish whether a row already exists.
Parallel Processing
On a multi-processor database server, any
session running on Oracle will typically utilise one CPU only,
unless otherwise told to. Spreading the processing effort across
multiple processors can be achieved in several ways:
Oracle's built-in parallel processing
functionality. Parallel processing of DML operations such as
INSERT, UPDATE, and DELETE can be enabled at three levels
-
Instance level, defined in the parameter
file.
-
Object level, defined in create/alter
table/index
-
Query level, defined in a PARALLEL hint
Oracle will break the operation up into however
many parallel tasks that has been specified by the degree of
parallelism and spread them equally across the available processors.
There is an overhead for Oracle to manage
parallel processing, so performance gains are only achieved beyond a
certain threshold. There is no definitive guideline for this, but
processing tens of millions of rows would probably benefit from
having parallelism enabled.
Programmatic methods to perform asynchronous
tasks. The various refinements to loading that have been
mentioned so far do not necessarily need to be performed serially,
for example, rebuilding of indexes, enabling constraints, etc.
PL/SQL is able to simulate a crude form of "multi-threading" via the
DBMS_JOB built in package. An anonymous PL/SQL block can be
submitted to the job queue to be executed immediately as one-off
job. This is simple, if there is no dependency on the task
completing (successfully or failing). For more sophisticated
requirements, the asynchronous task could signal its completion to a
proceeding task, and on completion of all its dependent tasks, the
task can be started.
Flexible Metadata
The traditional approach to storing metadata in
a relational database solution would be to design a series of tables
and hold metadata as row data. A far more flexible method is to use
XML configuration files. Using the built-in DBMS_XMLDOM package, XML
metadata can be easily read and mapped to a PL/SQL nested table type
and used to drive the flow and behaviour of your data loading
solution.
Conclusion
Hopefully this article has given you an
overview of the things you may want to consider for you own bulk
data loading solution.
If you wish to consider using an already
developed solution, then please review the
dbi_loader data loading product available from Database
Innovations.
|