Data Loading starts after the data sets are ready in the Data Staging in the presentation server. This is considered to be a straightforward process. The key consideration in the Loading process is to achieve the speed of Loading. This is achieved by using various methods:
Turn off Logging during loading
If the logging is on, the system will create the transaction log with every write action it will perform on the database. By turning off logging, one can save on this overhead.
Drop and recreate Indexes
Like logging, indexing also has an associated overhead with most of the write operations in database. One can drop the index and recreate the same after Loading is done. This will make the indexing a batch process instead of an activity done with every new record added into the system. However, this trick is relevant, if you are adding a high proportion (say 10%..) to the existing tables. If you are adding only a small %age say 2% of the size of the existing tables, it will take more time to re-create the index.
Pre-Sort the File as per the primary key index for Data Warehouse loading
This is a good practice as it speeds the up the indexing process tremendously. This is among the most recommended techniques. If you also need to create indexes on other fields (which are not primary key), you can drop and re-create them later.
Take a balanced view on the number of indexes.This is a concept, which is as much applicable for source systems as it is for the Data warehouse. Too many indexes will increase the Loading time, and too few will increase the user access time. Use the append (new OR updated values, tables) instead of Full Refresh
This is a conventional wisdom that adding new records will take less time compared to rebuilding the table from scratch. However, if there are too many new records to be added, and volume is high, it is faster to drop and re-create.
Limit the frequency of updationThere are opportunities (though not many.) when you may update certain data marts, OR sections of Data Warehouse by less than Daily frequency. This will save a lot of time, if Data Warehouse is used for analytics, modeling OR score carding, which does not necessarily need daily updation. Parallelize table load and index maintenance
Dropping indices and bulk Loading in parallel can significantly improve Loading time. All the known data warehouse platforms provide this capability.
Stream Data ETL
Certain ETL tools will allow you to extract, transform, and load in one process. That is, it is not necessary to create intermediate files. However, this is a new concept and works only for some very fire-wall segregated staging processes, which are complete on their own. Moreover, this will work when the transformation is simplistic.
Take a balanced view on integrity checks
If your quality assurance routines over the days and months are giving clean chit to the Loading process, one can reduce these checks as they place an extra load. Given your level of comfort, this can be done post-facto. |