|
|

|
|
 |
Data Warehouse Overview
Data warehousing is the combination of data and tools to store and access that data. Any discussion about data warehousing should center on data. What kind of data should be included, the best way to retrieve the data from other systems and how to store the data, and how to access that data in a meaningful way once it is in the data warehouse.
Not just any data should go into the data warehouse. The data warehouse should not be used as just a means of storing data. If we use a physical warehouse analogy, a company is not going to store useless materials or products in their warehouse; it costs money and affects the ability to get products in and out of the warehouse. As with a physical warehouse, a data warehouse should not have useless data. The data should be meaningful data that can be used for business analysis by the end users. It should be used to help in making business decisions. Any useless data will cost money and affect performance, disk space costs money and the more data the worse the performance of the database will be for loading information and for retrieving and reporting on that information.
Before any data gets put into the data warehouse, there must be an understanding of the end users needs. Many discussions must occur with the decision-makers to understand how they make their business decisions and how their current data can assist in making faster and more accurate decisions. The data warehouse is not needed just for replacing reports that can already be accessed from their current systems, whether it is an ERP or any OLTP system. The data warehouse can also satisfy the needs of multiple business groups from finance to marketing. It is important to remember that all needs can't be satisfied when first getting the data warehouse up. It is important to keep a reasonable project scope so expectations can be met.
So how does the data warehouse assist in analysis and decision-making. The following is some of the advantages of having a data warehouse and how it helps in analysis.
Different sources of data. The data warehouse can load data from multiple data sources having similar data or different data and presenting that data in a meaningful way. Different sources lends itself to comparing different parts of a company or perhaps multiple subsidiaries. It allows for comparing different data, an example would be comparing forecast numbers coming from a forecast application to actual numbers coming from the ERP.
Tracking data over time. The data warehouse takes a snapshot of the OLTP and a certain point in time. Depending on how often the data warehouse is loaded, the state of the OLTP can be found out for past dates. For example, this can help as dates promised to deliver to customers change; this change can't be seen in the OLTP but is available in the data warehouse. As those dates change the data warehouse keeps the information. Perhaps the most important part of the data is its time element. It allows for seeing trends over time, comparing time periods to each other and comparing the same time period over years. Another example would be the data warehouses ability to help forecasters see trends over time of sales and revenue and to see how sales may do during different times of the year, as some company's have seasonal products.
Creating data summaries. Most business decisions are not made using detailed data. That is why summary tables are created and data marts or cubes are generally summarized. There are many advantages to summarized data. Summary data is quicker to access and analyze as it contains less physical data to manipulate. Since summary tables take up less disk space, older detail data can be purged when there are space concerns and the information is still available in summary tables. This can be the case because detail data is not generally needed, as it becomes older. Decisions are not made on a single sales order that happened three years ago. Another advantage to summaries is it lets the users see the overall view and gives them then the ability to "drill down" into certain areas and see an increasing more amount of detail.
Persistence of data. Many times data must be purged from the OLTP system to open disk space and improve performance. While this data is purged from the OLTP it can remain in the data warehouse either as detail or as summarized data.
Organization of data. Data in the warehouse can be organized in ways that lend itself to easier reporting and better analysis. A discussion of the database schema will follow in which this can be seen. Data can be summarized by areas that are of importance to decision-makers. It helps to show weak point and strong points. It allows the users to easily see and compare different aspects or areas of their business, everything from comparing sales division performance, comparing customer information to identify where more of the business is coming from, or even in a multinational company, identify which country is creating more revenue. The data warehouse not only shows this, but also allows the users to "drill down" or look at more detail in those areas, to see where the weak point may occur.
Typically Data Warehouses are implemented with a star schema. Star schemas consist of dimension and fact tables. Dimension tables consist of data in which the fact tables can be reported and summarized on. These are typically smaller parent tables. Fact tables hold the important statistical information in which business decisions can be made. The fact tables are larger than dimension tables and many of the fields are numeric and can be summarized. Dimensions help make the fact table information meaningful.
Tables in the star schema are generally more de-normalized. De-normalized tables allow for easier and quicker reporting, because there are not as many joins and less complicated joins. In order to track the data a schema using start or beginning dates and ending dates can be used. This works by time stamping new records coming in and time stamping the old records.
Another challenge in implementing a data warehouse is getting the data into the schema. An important question is how often should data be loaded? This can range from taking all transactions and applying them real time to the data warehouse, to loading data once a month. Since the source data can be a large amount and as it may not be persistent due to archiving, it is appropriate to find adds, updates, and deletes that occurs on the source data, and apply those to the data warehouse.
This also allows for tracking the data over time. It is important especially with regards to updated data in which a view of the data can be seen before and after the changes. By using the beginning and end dates in which the old record would be end time stamped and the new record would come in with a new begin timestamp and no end timestamp. Also, in the case of deletes, the delete does not have to be applied to the warehouse as a delete, but may take the form of putting an ending timestamp on the record. New records, or adds, would come in with a begin timestamp and no end timestamp.
Another part of the load procedure should be analyzing and cleansing the data. As most systems and data are imperfect, the load process can be a great way of identifying problems such as referential integrity issues. Cleansing of data will have to occur to get data into the schema as some data may be missing, or in the cases of multiple data sources, one data set may not have all of the attributes of another data set. It is important to report the load problems in an understandable report that can be viewed by the data warehouse administrator.
For ease of reporting the load process will typically include calculations and creations of summaries. This prevents the extra calculations and summaries being done at the time the reports are run, improving performance in reporting.
With the data in the data warehouse, the users will then need to get at the data. This can be in the form of "canned" reports, data marts or cubes, or ad-hoc queries. Canned reports can be run on demand or scheduled and waiting for users when the need them. Data marts or data cubes allow for a subset of the data to be easily accessed by the users. These are typically summarized data, or particular slices or dimensions of the business that different business units may be interested in. An example would be different data marts for the financial officers and the marketing group, as they may be interested in different data. We see the "slicing and dicing" typically in these cubes, this means the users can take closer looks at particular dimensions allowing users to see where business strengths and weaknesses lie. The third aspect of reporting is ad-hoc querying. It is important to find a tool that lays the data out in a simple, understandable format so users can create their own reports on the fly.
In preparing to implement a data warehouse it is important to keep the following in mind. Do some good research with the decision-makers to identify the information needed to make business decisions. When designing the schema, keep in mind the possibility of multiple data sources and how to design the database to accommodate future, unknown data sources. Identify the information that needs to be tracked for historical purposes. Include calculated and summarized information in the schema. Choose tools, programming languages that will aid in loading the data accurately and efficiently. And finally, choose a reporting / data mart tool that allows the end users to do their analysis easily and efficiently.
|
|