braindump ... cause thread-dumps are not enough ;)

Data Warehouse Toolkit Chapter VI - notes

  • Avoid normalizing fact tables - in most situations it is far better to have some duplication and a simple schema than no redundant data but very complex schema.
  • Instead of using a single date dimension table it might be better to create one physical table and multiple views - each for a different context. (Kimball call this role playing)
  • Product dimension guidelines:
    • Avoid normalizing product dimension table (rather than using “snow flakes” just create one wide table with duplicated data inside)
    • Don’t use the operational product key - use a surrogate key
    • Instead of operational codes used in operational database - use descriptive attribute values (eg. instead product type MGHM use “Magical Hat - Color Magenta”
    • Do a quality check which rule out misspellings, impossible values etc.
    • Document all assumptions in the metadata (table comments?)
  • Customer dimension guideline
    • Avoid creating geographical hierarchies :)
  • Junk dimension - sometimes there is a bunch of indicators/flags/enumerations which don’t fit anywhere use - you can create one table which will include them all and just reference to a row with combination of those indicators.
  • There should be a orders of magnitude difference in size between dimension and fact tables.
  • Strategies for storing multiple currencies:
    • Duplicate fact rows in different currencies
    • Hold converted currencies in columns
    • Create a fact table with daily currency conversions.
  • Don’t join fact tables!
  • If you need to store the same information in different units:
    • Think about adding already converted values as columns
    • If there are too many columns which would have to be converted - add the conversion factor as column

Data Warehouse Toolkit Chapter V - notes

  • Sometimes fact table which seems a monolith at the beginning turns out to contain facts which are not compatible with each other - this requires creating a separate fact table for each identified type. Indicators that this might be the case:
    • Granularity levels.
    • Facts are completely different from each other - they describe different business processes
    • Users want to analyze those groups of facts separately
  • Accumulating snapshot models something that has well-defined milestones.
  • Types of dimension changes (… i have a weird feeling that i already seen that somewhere before …)
    • Type 0: retain original - nothing ever changes
    • Type 1: overwrite value in row
    • Type 2: add a new row and reference that new row only in new facts
      • You may want to add additional columns which denote the effective and expiration dates (dates when the value started being used and when it stopped).
    • Mix of Type 1 and Type 2
    • Type 3: add a new attribute (column) (waaaat?) - do this only when the change impacts a lot of rows in dimensions
    • Type 4: Mini-dimension - create a new dimension (with small number of rows) which contains the attributes which frequently change. Unfortunately you need to add a new key to fact table which will refer to the mini-dimension.
    • Type 5: add mini-dimension key to the primary dimension. Overwrite the key reference with each profile change.
    • Type 6: Type 1 (overwrite) attributes in Type 2 (adding new row) dimension (eg. a column denoting the current value of something - there might be a second column which will denote the historic value)
    • Type 7: Dual Type 1 and Type 2 dimensions

It seems that those types are mostly about introducing a combination of volatile (get updated over time) and non-volatile columns which present historical values. If you look at it from hadoop perspective (we can pretty much only add - unless you use hbase) - all types including updates (1,4,5,6,7) don’t have much use.

Data Warehouse Toolkit Chapter IV - notes

  • Three ways of modelling inventory:
    • Daily snapshots with amounts on hand and sold.
    • Record all transactions which affect inventory (but it is hard to use for analytics!)
    • Accumulating snapshot: 1 row denotes the whole lifecycle of an item (eg. there is a column with a date when arrived to warehouse, date when was inspected and row when leaves the warehouse - this row is updated over a longer timespan).
  • Enterprise Data Warehouse Bus Matrix - the whole thing just means that different business processes contribute to different dimensions. Some of the dimensions are common across the company. If you’re implementing a warehouse dear reader - just do it one process at a time.
  • Author emphasizes that the “bus matrix” is a tool for communication and work organization (you can use it to tell someone what is being implemented and you can plan that eg. you implement now that particular row).
  • Don’t overgeneralize - ‘employees != customers’ (although in physical world both are people… well at least usually ;))
  • Ideally - use the same dimension tables across all fact tables (from different business processes)
  • Unfortunately - world isn’t ideal (mostly…). It is also acceptable to have the dimensions conform not fully - but in a subset of features:
    • Dimension for one of the fact tables contains a subset of other dimension’s columns (eg. a Product Dimension may contain Brand Description, Category Description, Subcategory Description and a Brand Dimension may contain only Brand Description and Category Description. Those tables have the same values in the subset of columns: Brand Description and Category Description. This is what makes them conform.
    • Dimension which contains a subset of rows of another dimension.
  • Sometimes it doesn’t make sense to integrate everything (eg. each department might have own customers and there is no will to do a cross-sell deal).
  • Fact tables also need to conform!

You must be disciplined in your data naming practises. If it is impossible to conform a fact exactly, you should give different names to the different interpretations so that business users do not combine these incompatible facts in calculations.

Oozie MR1 compatibility mode

It took me some time to figure out what does that mean:

> java.io.IOException: mapreduce.job.outputformat.class is incompatible with reduce compatability mode.
>   at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1190)
>   at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1242)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
>   at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
>   at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
>   at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
>   at org.apache.oozie.action.hadoop.MapReduceMain.submitJob(MapReduceMain.java:97)
>   at org.apache.oozie.action.hadoop.MapReduceMain.run(MapReduceMain.java:57)
>   at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37)
>   at org.apache.oozie.action.hadoop.MapReduceMain.main(MapReduceMain.java:40)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:495)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)

Turns out that Oozie (at least in version 3.3.2) runs the map-reduce jobs as if they were designed for MR1. To make it treat your nice-and-shiny MR2 code properly, use those options in configuration for <map-reduce> action:

<!-- New API for map -->
<property>
    <name>mapred.mapper.new-api</name>
    <value>true</value>
</property>

<!-- New API for reducer -->
<property>
    <name>mapred.reducer.new-api</name>
    <value>true</value>
</property>

The Data Warehouse Toolkit Chapter III - notes

Chapter is a use case of applying dimensional modelling in a typical retail organization scenario.

  1. Select business process
    • Sales transactions at POS (Point Of Sale).
  2. Declare the grain
    • Grain level == single product on a POS transaction.
  3. Identifying dimensions:
    • Date of sale (when)
    • Where the sale occurred
    • Promotions (context)
    • Cashiers (context)
    • Method of payment (context)
  4. Identify facts (may require changing grain!). Note that they are mostly nicely additive
    • Sales quantity
    • Per unit regular price
    • Discount
    • Net paid prices
    • Extended discount
    • Sales dollar amount
    • Extended dollar amount (if provided)
    • Extended cost amount (if provided)
    • Extended …
    • Quote: “Percentages and ratios, such as gross margin, are non-additive.”
  5. Next estimate the size of data inflow (rows per day/week/month) to fact table.
  6. Define dimension table attributes
    • Columns in date dimension table can include the regular date info, fiscal year/month/date info, date exploded to some particular format, quarter indicator (pretty much anything which enables querying by some specific date conditions).
    • Use textual attributes instead of flags and indicators (Holiday, Non-holiday instead of Y/N).
    • Products dimension illustrates the “dimension hierarchy within table” technique. Instead of creating a complicated table structure - just duplicate the data in several columns (brand, category, etc)
    • If product key contains embedded information - explode it to separate columns.
    • Rule of a thumb when in doubt what is fact or dimension attribute:

    Data elements that are used both for fact calculations and dimension constraining, grouping, and labeling should be stored in both locations.

  • Kimball suggests that derived data such as gross profit (diff between cost and sales price) should be computed at ETL stage and made available to users. Don’t think it is a good idea - I would prefer the solution with view and dynamic computation of this kind of value. On the other hand depending on what is the actual data set size is it might be the only option.
  • Drill down - just add a column to report, which slices already presented rows in more detailed categories.
  • Degenerate dimension - dimension without a table. Exists just as a column in fact table. Can be used for grouping facts or just as a back reference to operational system.
  • Adding new dimension: create a table, add a single row “Prior to introducing new dimension”, add a new column to fact table, populate it with key to that “gravestone” row.
  • Sometimes it is necessary to introduce a factless fact table - eg. to describe what didn’t happen. Example with promotions: we want to know whether promoted products were sold during promotion. Sales fact table doesn’t contain that - we store only info about what was actually sold. Solution: create a new fact table (artificially) which will contain the number of product units sold per day. A bit aggregatish - but for specified use should be ok.
  • Dimensions primary keys should be surrogate keys instead of natural keys.
    • They protect the whole warehouse from operational changes (if natural keys changes - you still have your own).
    • Enable integration of different source systems which might potentially clash keys. The other observation: the same fact in different sources can have a different natural key!
    • Performance (nobody should need more than an int as a dimension table key - LOL :))
    • It is easier to handle natural key null values.
    • Can handle changes of dimension tables gracefully.
  • Dimension natural keys should be stored as a separate attribute in dimension table
  • Fact table key is often a subset if table’s foreign keys (to dimension tables) and a degenerate dimension.
    • Having the surrogate key as primary key in fact table has many benefits (technical - not analytical)
      • Checkpoint in bulk load
      • Immediate unique row identification
      • Replace updates with row insert/delete - now this is something useful for hive @ HDFS

Final thoughts:

  • don’t do “snowflakes” in dimensions (don’t normalize dimensions) - doesn’t make much sense and can actually harm performance (joins!)
  • outriggers (referencing dimensions in dimensions) are permissible - but not advised
  • avoid using too many dimensions in a single fact table - 20/25 tops. If you have more - try to combine correlated dimensions into one.