|
Most of the source system transactional tables store the new activities as
additional records (append). However the master tables (Customer
s, products,
suppliers..) etc are mostly "updated" with changes like activation status, addresses,
marital status (refer slowly changing dimensions ), over-writing the previous/historical
information. For a large company with millions of such changes happening
every day, an ETL has a big challenge to identify the records which have changed.
There are two ways to handle it:
- Refer the transaction log of the source system:
Even mediocre systems have transaction logging capability in their
database systems, which keep a track of all the updates done to the
tables. The ETL system can scan the log and find out the master (or
for that matter any) table which has changed. It goes to those records
and picks them for processing.
However here can be the issues with this approach when the database
administrators turn-off transaction logging due to performance overhead
reasons during peak periods, or due simple lack of skills.
- Use the check-sum approach:
There are many check-sum algorithms, which are used to compute a unique
number using the contents of a record. The next day when the ETL picks
up the master tables, it can calculate the check-sum of all the records
and compare it with previous day check-sum for the corresponding records.
If there is a mis-match, it means that something in the record has changed.
This will allow ETL to quickly filter out the changed records and then
do field by field comparison of this short-listed set of records.
|