Define Naming Standards for Data Warehouse
The naming standards of the Data Warehouse will be essentially on how to name the Data-Base objects, File Names. We will not go into the naming standards for routines and processes, as it is not linked to database design. Here are some methods:
Naming of Dimension Tables
The dimension tables should be named as per the standard dimensions list which you have created during the analyze phase. The assumption here is that you are using the foundations dimensions and measures. You can also add a prefix or suffix in terms of that:
- It is a dimension table
- It is a detailed table or an aggregate table
- It is current table or historical snapshot table
Naming of Fact Tables
The fact tables will vary with each schema unlike the dimension tables (assuming you are using foundation dimensional tables). There can be two suggested ways to name:
Name it as the business themes of the schema. For example, if a star schema is doing the sales Revenue Analysis, you can provide similar names. The reason is that you will not find two different schemas addressing the same analysis objective. So the distinctness in naming is maintained.
The other way to name a fact table is to name a fact table as per the fact-table grain you are using. For example in sales revenue analysis schema, you may be using a sales transaction as the fact table grain. Therefore, you can name the fact table accordingly. In the fact table, you can add a prefix or suffix on:
- If it is a detailed or aggregate fact table
- If it is current fact table or historical snapshot table
Naming of the Database Fields:
A database field name can have the following components:
- The attribute Word: The field will represent an attribute of some entity. For example pin code, telephone no, date of birth
- The attribute type word: This part of the name suggests on what is the kind of attribute it is. Some examples are
- Summary Types: Average, sum, total, count
- Functional Reason: Flag, ID, category, class...
- Others: Description, Title, Serial_number
- Additional Identifier: This is optional if needed. Some examples are end, start, initial, final, previous, later, Ist, 2nd, primary, secondary, main... (like start_date, 1st_address, primary_email...)
Few Points to keep in mind, as you are setting the naming standards:
- Avoid overuse of abbreviations: If you have to use abbreviations, use them in a standard way (like use 'descr' universally and not 'desc' and 'descr' being used at different instances...). Ideally you should have a standard reference list for abbreviations. However avoid using unnecessary abbreviations like 'cust' for 'customer'.
- Avoid using too long a name: Avoid using examples like 'Purchase_order_header_parts_total_amount_dollars'. The best way to handle it is that one can look at the field and table together. The name of the table may not be needed to be fully repeated in the field name, though it need not be totally eliminated. For example I will use the name customer_ID in the table customer_master (where the customer part is both in the table name and field name). However, I will not need to have field name called customer_master_ID.
- Use underscores instead of hyphens: This is the standard best practice for any kind of naming. Underscores are always preferred.
- Don't and never use uppercase: To make sure that you do not get caught into any case sensitive platform or database handling function, you should always use lower-case.
- Document your naming standards: Your naming standards should be universally available, and they should be a living document.
Naming standards of the Data Warehouse code files, data files and data definition files
Apart from the naming standards for the database, one will need to define the naming standards for the directory paths, code files, the dbf files and database creation scripts. It’s a fairly simple and doable concept. The release manager or version control manager can easily create some rules for naming. Some examples are-
- Name the ETL script names for the customer entity to go as per the customer name. You may also add the source system name in the script name.
- Name the dbf file as that of the business theme (this is more applicable for data-mart...)
- Name the table creation script as per the name of the entity or fact table grain.
Design Physical Tables & Columns
The physical design or columns need to be same as the logical model. The changes will be typically will be driven by the following consideration:
- Determine the Null Fields
- Create the surrogate keys
- Change the structure, if it will make the access by the access tools more productive.
- Add the indexing structure
- Making the changes in the staging, if your defined logical staging model is not turning out to be efficient.
Estimate Database Size
The database size estimates tricks are similar to that of an OLTP. While one can go into a complex calculation, keep your database size estimate as five times that of the base size of your dimensional tables (including historical snap-shots). The components are:
- Aggregate tables
- Index Tables
- Metadata tables
- Temp Space
- Future expansion
Develop Index Plan
Indexing plan will keep on changing with time as you learn more about the usage. We will not spend too much time on this aspect as most of the principles are same as that of OLTP. There are however few differences:
- Special optimization of dimensional model schema: This includes creation of special indexing, so that you can actually search for the combination of dimensional table fields. By this optimization, you can create index on a combination of keys not from a single table (as you typically do in OLTP), but across various tables. This means that when a query is searching across these dimensional tables, one can find the records across these dimensional tables and then link up to fact table for getting the data. This kind of multi-table indexing needs to be constantly updated as the query types change over time. To further explain this point, if you feel that there is a heavy query load across customer, time, location and product dimension table (people want to find which customer went to which branch at what time and bought what product). You may create a multi-key index for the combination of primary keys of these four dimension tables.
- High fact Table indexing: Fact table is a kind of table which is unique and a parallel is not found in OLTP environment. A fact table will have more indexes than a typical OLTP table. This is because it is the centre point of all queries done on a dimensional model schema. Depending upon query usage, you can define multiple multi-key indexes. Taking the examples from the previous point, you can create an index on the four keys which are linked to the customer, product, location and time dimension tables.
Develop Aggregation Plan
Aggregation is used for the purpose of increasing the response time to the queries. The aggregation is done to handle few predictable and frequent queries which need to look at the summary data. This is straight as one has to look at the expected queries. Now two things to consider:
- If you are using an OLAP layer, you may need to create too many aggregates as OLAP itself is an aggregator.
- You have to ruthlessly question the need of aggregates, as they do increase the load on the whole loading process.
Resource Handling
- Determine the partition plan: There are innumerable partitioning plans. This is nothing but the same principles you apply on OLTP. Typical partitioning happens with partitioning of the fact table and the associated slices of the dimensional tables in a partition.
- Determine DBMS Memory handling: You can refer in-memory analytics for more details on the benefit of placing the reference data in Random Access Memory. There are various ways in which you can cut it. You can place entire set of dimensional model schema in the memory (or OLAP cube) to speed up the access OR If you are not able to do so, you can have most accessed data set in the memory. The suggestion is to maximize the use of memory.
- Determine the Cache handling: Cache is used when you want to maximize the response time by using the storage which is local or closer. You can decide about the cache distribution across the local memory, local disc, and central database memory or in the database itself. You can also decide to maintain the cache in the memory or disk of the application server.
Installation and Implementation of DBMS design
There are some other aspects of physical database design. However, they are same as that in OLTP. The design and implementation will go through some iterations as you run your testing around volume and stress factors to ensure the meeting of expectations.
|