Semi-Additivity is, when you can have a measure aggregated on a certain dimension, but not all the dimensions. Another phrase for semi-additivity is, when you have the summarization with an index of in-accuracy. Semi-Additivity happens primarily in four scenarios:
Semi-additive Missing OR dirty data
There are many reasons and manifestations for missing and dirty data. One can refer to Customer Data Quality and reasons for bad data quality. The missing OR dirty data, provides a wrong picture for the overall submissions. For example if you have empty records for sales of some offices, you will end up under-reporting the sales figures. Similarly, if you have wrong data in the same case, you may end-up under OR over-reporting.
The solution to this issue (you can refer Data Correction techniques in customer data quality to get more detailed listing):
- Don't include those offices in the summarization and specify that the report does not include the specific instances.
- Fill-up an average figure for the instance. For example, if the sales figure for can office is not available for this month, you can (temporarily) assign a value, which is closer to the past patterns. One option is to put the average of last 12 months sales. This is generally a preferred solution. Apart from just putting the average for past periods, you can also use various extrapolation and forecasting techniques to calculate the stop-gap figure. This however, will be done, only if the number of cases of missing OR dirty data is within certain limits (for example- 5% offices not having data..)
- In case of a high proportions of instances having missing OR dirty data, one needs to apply the above-said tricks and also mention the caveat, that the data could have an inaccuracy index of some percentage.
Historical Data
Historical data falls in two categories, and both have different treatments:
Historical snap-shots:
When you have historical snap-shots, and over the time, you have situations, where you have changed the instances of your dimensions. For example- you may have changed your product categorization ('home- segment' and 'small business' product segment is not combined and re categorized into 'hand-held' and 'Table-top') OR sales locations ('New-York' sales area and 'New-Jersey' sales areas now combined and split into 5 sales areas, as company business has grown..)
In this case, when there is an incompatibility across the instances, it is not possible to add the measures across those dimensions, across time. Referring to the above-said example, it might still be possible to add the sales figures for office instance across the time, as that level in the location dimension has not changed. However, adding sales on 'sales area' basis over the time dimension will not be possible.
Solution: You may like to apply some smart transformation rules, to translate the historical categories into new categories. For example- you may know that 'Hand-Helds' typically formed 30% of the sales in the home and small business segment. You may apply this %age to historical snap-shots and make it aligned to the current categories.
Slowly Changing dimensions (SCD):
Please refer to Special Situations in Dimensional Modeling to understand on what we mean by slowly changing dimensions. In-short, a Business Intelligence system will be storing the various instances of changing values with a dimension, as a new record OR a field, whereas the transaction system will typically overwrite. For example, ZIP code of a customer may be over-written by a transaction system, as customer moves to a new address. The Business Intelligence system, may append a new record for this change, without erasing the previous ZIP-code. This may be needed to do sales analysis on basis of ZIP codes in the previous months.
In case of SCD, one is not able to summarize the data on the time dimension. For example, if the customer has changed the ZIP code, you will not be able to summarize the sales related to the customer, as there are two records related to the customer (with two different ZIP-codes). One has to take this statement with a little care. It is 'possible' to summarize, but one has place an extra filter, so that you do not double calculate.
Solution : Track the constraints in the meta-data and apply the right filters.
Snap-shot data
There are many measures, which are not the 'activity' for a period, but the state of the measure at a given point of time. The example of the 'at the moment' measures is the line items in a balances sheet (which provides the assets and liabilities at a given point of time). example of 'Activity for a period' is the line-items in profit & Loss account (which provides expenses and revenues for a given period).
The 'Snap-Shot' measures cannot be added over the time dimension. If you want to find out the account balances for the year, you will not add the balances at the end of each of the 12 months. As you will see in the averaging of measures topic, these are best handled through averaging.
Solution: Apply other operators like averages/Maximum/Minimum..
Category data
When you have measures, which provide 'type of magnitude' and not the 'magnitude', it is not possible to add them across some dimensions. The example of this difference is- 'Number of Sales units' and 'units of inventory' are value of Magnitude measures whereas 'number of product-categories sold' and 'Number of inventory parts types' is type of magnitude measure.
example of above semi-additivity is - if you have sold 2000 different product models across US in this month, and 1500 models in the previous month, you cannot sum them up to provide the total numbers of models sold in US over past two months. You will need to have a way to identify the distinct models sold across this period.
Solution: Apply other operators like averages/Maximum/Minimum..
|