OSCON Day #2: Building Data Warehouses with MySQL (by Jeremy Zawodny)

I'm in John Ashenfelter's Data Warehouses talk this morning. He's an excellent presenter who really knows his material and is completely sold on MySQL. At least one person in the audience has already extolled the use of MySQL in data warehousing applications.

I'm taking notes real-time, so this will be a bit disjointed but that's life.

Idea for O'Reilly book: Data Warehousing in a Nutshell.

The talk starts with a story about why MySQL is the most cost-effective data warehousing solution available. Compared to Microsoft SQL Server (the cheapest closed-source solution), MySQL is a big savings.

Warehouses vs. Marts

Data Warehouse vs. Data Mart. Plan for a warehouse, but build a mart. There are a lot of things you might need, but there's no need to build all of it until you need it. Data marts lock together to become a warehouse.

Book recommendation: The Data Warehouse Toolkit.

Data warehouse: focused on business processes, using standardized granular facts. It's a collection of standardized marts.

Data mart: focused on one narrow business, includes lightly summarized data. It's a component of a data warehouse.

Metadata

Metadata capture. Need to get terms and definitions correct and agreed upon in advance. Policies and company practices factor into the decision making. How does your business really quantify things? You need to ask users what their business needs are. Sometimes this involves going quite high in the organization.

Dimensions

Every time you hear "by", think about dimensions. They're wide and flat tables (compared to fact tables). Lots of redundant data. Make sure the measurement units are the same. This is hard in multi-national companies. Which day (time zones)? Sizes and volumes of items, etc. How will they be formatted? What enumerations (possible values) will exist? M/F, 0/1, Y/N, Small/Medium/Large/X-Large.

Calendar dimension example: date_id, date_value, description, month, day, year, quarter, is_weekday, day_of_week, fiscal_month, fiscal_quarter, astrological_sign, etc. Lots of duplication (imagine one record for each day of the year). You could add weather info, abnormal business closes, etc.

Fact Tables

All about keys. Many keys, few facts. Very deep (tall) and narrow. It's best not to store calculated values because you may need to recalculate someday (margin, for example). You can calculate on the fly (in the query) or in code that's pulling the data. Use facts that the business users understand.

Don't use anything meaningful for keys. Never. Ever. Meaningful things change when companies merge, change, etc. Just invent numbers that are meaningful to the database only.

Star Schema

There's a central fact, many dimensions (the arms), and no other tables. Don't "snowflake" or over-normalize by hanging new tables off of the dimension tables.

When building the schema, decide on the grain. What's the smallest bit of data anyone will ask for? One day? One hour? One week?

Schenario

Sales from a web-based meal order system (Vmeals.com). Many clients/customers, caterer/restaurants, delivery locations, etc. The database is relatively small now (300MB or so).

Lots of data to track (on the white board). The most important bit will likely be order items. Starting simple, our facts are orders and customer service metrics. Dimensions are calendar, customers, products, promotions, and so on. Create a bus design.

We'll use sales for this example.

Determine Grain

Pick the lowest possible grain that makes sense. For this example, it's orders or more specifically ordered items. Order will be the second fact table. Dimensions: calendar (order date, deliver date), product (menu items), customer, delivery location, provider (vendor), licensee (market).

Finding the Data

Need to pull data from the on-line system to populate the warehouse. Some may came from other places too: market information system (MS Access), promotion engine, etc. Larger companies will have many more.

The order fact table will be primarily built from line items from orders. Think about additive, non-additive, and semi-additive values. You generally want additive data. Store the data needed to compute things, not the resultant values. Snapshot values (daily bank balance) not additive.

Special Dimension Types

Degenerate dimensions have no corresponding dimension table. Invoice or oder number are common examples. They're only used in groups or rollups, typically.

Role-playing dimensions are used over and over. Dates are good example: payment date, order date, delivery date. In some systems you'd use views for this. They'd all be views over the underlying calendar table. You can use MySQL Merge tables to work around the lack of views. Or you can create several one-table merge tables.

Slowly changing dimensions. Fixed data: just update. Changed data: add a new row (like a new address). Fundamental schema change: add a new column, keep old column data around.

Getting the Data

ETL process: extract, transform, load. Go to the source of the data. Extraction can be tricky if you have lots of data and little time (24x7 system). If the source data has date stamps, you can perform incremental dumps. With some systems you can use a row checksum and a computed index. Transformation is about making the data match the warehouse's metadata standards. Perl to the rescue! (Or maybe using an intermediate database.)

Microsoft DTS is a good option in the Windows world. It comes with SQL Server and is scripted via a scripting host language. Lots of expensive commercial tools to do this too.

When dumping from other systems, watch out for blobs. They probably don't belong anyway. Make sure that stuff comes out as quoted ASCII rather than some internal representative that MySQL won't grok.

To load, MySQL's LOAD DATA INFILE works quite well for this. It's fast and flexible.

Demo: Dump via MSSQL BCP and load into MySQL using LOAD DATA.

Having a staging environment is important. Rather than loading all the data into the warehouse, you can do a lot of intermediate work on a different server before loading into the "real" warehouse. You can use this staging area for run validation checks, manage any changes needed (SQL, Perl, custom apps, DTS or other ETL tools), and perform multi-step extractions.

A frehness date helps users understand when the latest data isn't as new as they might think.

Reporting Tools

All Java, open source: Jasper Reports, jFreeReport/Chart, DataViz.

OLAP Tools

Open Source: Mondrian (Java), JPivot (Java/JSP), BEE (perl).

More demos at the end.

Posted by jzawodn at July 08, 2003 12:02 PM | edit

Reader Comments

# tlack said:

This seems like a major application of MySQL, as it's suited to the inherent speed and lack of safety (which isn't required) that MySQL provides. I'm surprised there aren't more discussion groups and websites dedicated to this topic.

on July 9, 2003 12:22 PM

# jacky said:

very interesting, thanks

on July 11, 2003 07:39 AM

# Ganesh said:

Nice one! This gives a quick refresh for DW designers of the points to keep in mind.

on September 7, 2003 05:53 AM

# Greece said:

very interesting, thanks

on September 24, 2003 07:51 AM

# Alexandru Toth said:

Hi,

There is a new Open Source OLAP project, called Cubulus OLAP. It is in early-alpha, none the less maybe it can already show some measures across some hierarchical dimensions.

OLAP aggregation engine uses "hierarchical range-clustering of keys" (read "V. Markl, F.Ramsak, R. Bayer: Improving OLAP Performance by Multidimensional Hierarchical Clustering. Proc. of the Intl. Database Engineering and Applications Symposium, pp. 165-177, 1999") . Dr Volker Markl expressed in mail his consent to use the range clustering in an OSS project.
The main difference to the closest Open Source OLAP project - Mondrian a.k.a Pentaho, is the fact that Cubulus does all the aggregations inside the relational database (by generating multiple queries).
Even if there are multiple tables, there is no "star schema", no GROUP BY and no JOINS. Due to the "Multidimensional Hierarchical Clustering" all SQL statements are simple SELECT SUM(measure) with a range WHERE clause for each dimension. There is always a full-table scan, so indexes provide only little help. Partitioning is also of no great use, because PARTITION BY is just by one dimension. The "curse of dimensionality" makes full-table scans more attractive in large databases due to better sequential read of disks (and memory pre-fetching), opposed to random seeking based on indexes. Also, tuning indexes requires lot of manual work, and slows data loading.
Theoretical background for simple SELECTS with WHERE clauses can be derived from the article "Anastassia Ailamaki , David J. DeWitt , Mark D. Hill , David A. Wood, DBMSs on a Modern Processor: Where Does Time Go?, Proceedings of the 25th International Conference on Very Large Data Bases", p.266-277, September 07-10, 1999 . A C language case study is presented "Conjunctive Selection Conditions in Main Memory", Kenneth A. Ross, 2002 (?)

Cheers, Alex
http://sourceforge.net/projects/cubulus
alxtoth at users . sourceforge . net

on March 23, 2007 04:28 AM

# Robert said:

You wisly wrote 2003:

"Idea for O'Reilly book: Data Warehousing in a Nutshell."

Here we about six years later, it continues to make sense.

Take care.

on December 12, 2008 04:13 PM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.