braindump ... cause thread-dumps are not enough ;)

The Data Warehouse Toolkit Chapter II - notes

  • Fundamental concepts of dimensional modelling:
    1. Gather requirements and needs of the business.
    2. Before designing anything - talk with experts (workshops - WAAT?).
    3. Design the dimensional model:
      1. Select process.
      2. Define the granularity level.
      3. Identify dimensions.
      4. Identify facts.
    4. Business process is something done by the organization - one execution of a process is usually a single row in fact table.
    5. Grain (granularity) - define what is a single row in a table (atomic grain - lowest level at which data is captured within the business process - can’t be split up; atomic grain - food for atomic zombies).
    6. Identify the context - which is enclosed in dimensions.
    7. Identify facts - which are almost always numeric and usually correspond some physical action.
  • Star schema = dimensional structure deployed in a RDBMS.
  • OLAP cube - dimensional structure implemented in a multidimensional database.
  • Facts can be additive (can be summed across all dimensions related to the table), semi-additive (can be summed across some dimensions), non-additive.
  • Instead of putting NULL values into FKs in fact table - create guard rows in dimension tables.
  • Dimension table primary key can’t be the operations system’s natural key - it needs to be artificial, because there will be multiple dimension rows for that natural key (dimension rows changes are tracked over time).
  • Dimension tables can contain multiple hierarchies (eg. for products: categories, brands etc)
  • Dimensions can be used multiple times in the same fact table row in different contexts. It is recommended to create views for dimension table to make the context explicit.
  • Merge low-cardinality flags/indicator dimensions into a single dimension table (less joins!). Don’t create cross product of all combinations - create rows only for combinations which are actually used.
  • Avoid normalizing dimension tables (multilevel structure of snowflake) - it is difficult for business users to understand it.
  • Conformed Dimensions - dimensions which are used across different fact tables - can be used to join data from different business processes.
  • Changing dimension attributes:
    1. No need to change - attributes never change (!)
    2. Overwrite - old row gets overwritten with new value
    3. Add new row - old fact rows still refer to old rows - thus history is preserved
    4. Add new attribute/column to dimension table (alternate reality) - the old value is still visible and can be easily used in the same time with the new one
    5. Create a mini-dimension - containing the frequently changing attributes within a dimension table. Mini-dimension has own primary key - this key is stored in fact table. (table with 2 primary keys?)
    6. Combination of overwriting, adding new rows etc - based upon a specific need.
  • Dimension hierarchies:
    • Create column per hierarchy level.
    • Create bridge table (which can represent the hierarchy).
    • Create a string attribute which contains the path in hierarchy
    • Don’t create separate dimensions for each level of hierarchy!
  • Creating audit dimensions is very useful and makes debugging easier.
  • Abstract Generic Dimensions - avoid creating generic dimensions which eg. contain all information for location for location in warehouse, location in country etc. It most probably will have negative impact on performance.
  • In case fact create hierarchies: create one core table which can be used to store common information, and separate fact tables for each subtype.
  • Create a fact table for error situations: eg. if there is an error during processing, a row describing what failed should be added.

The Data Warehouse Toolkit Chapter I - notes

  • Data Warehouses are analytical tools - based on them business is going to make decisions.
  • It doesn’t have any sense to create a warehouse if nobody wants to use it.
  • Warehouse needs to be:
    • understandable by business people - using business vocabulary
    • trustworthy
    • consistent
    • easily accessible
    • adaptable (needs to change with business needs)
    • secure
    • fast enough
  • Star schema - dimensional model implemented in RDBMS.
  • Fact == bussiness measure
  • Fact table - stores the performance measurements result from a business event. All measurement rows must be on the same grain (level of detail). This is a bedrock principle for dimensional modeling.
    • The most useful facts are numeric and additive - it is possible to carry out aggregations on them.
    • There are semi-additive facts (eg. account balance) - can be summed in general, but eg. not in time dimension.
    • Facts are often described as continuously valued (dimensions can be discrete)
    • Include only true activity (don’t fill fact tables with zeros - those zeros will probably overwhelm the whole table).
    • Transaction grain fact tables are the most common.
    • Fact tables have foreign keys to dimension tables.
    • Fact table has own primary key composed of a subset of the foreign keys. This key is called a composite key.
    • Every table having a composite key is a fact table - all other are dimension tables.
  • Dimension tables contain context associated with the measurement event.
    • They describe the “who, what, where, when, how, and why”.
    • They have many columns/attributes (even up to 100!).
    • Have fewer rows than fact tables - but are wide.
    • Have a single primary key.
    • Are a primary source of query constraints, groupings, report labels.
    • Attributes should be described with actual human-understandable names
    • In many ways, the data warehouse is only as good as the dimension attributes.
    • Rule of a thumb:

      Continuously valued numeric observations are almost always facts; discrete numeric observations drawn from a small list are almost always dimension attributes.

    • Almost always dimension table space should be traded off for simplicity and accessability.
  • Each business model should be modelled with its own star schema model.
  • Kimball’s DW/BI architecture
    • Comprises of four components:
      • Operational source systems - systems the data is copied from
      • ETL system - everything between presentation area and operational systems; basically tools which gather the data and pre-process it (extract), clean, combine multiple sources, de-duplicate, etc (generally speaking transform) and transfer data to target physical storage (load).
  • Using a 3NF structure for ETL processing doesn’t have much sense - it would require to do a single ETL process from operational to that format, and then again extract-transform-load into presentation space.
    • Presentation area (supporting business intelligence) - it is all BI sees from warehouse project. The data available to users should be fine-grained. It is necessary to be able to use the warehouse for ad-hoc queries.
      • All dimensional structures need to be built using common, conformed dimensions.
    • BI applications - from query tools to machine learning algorithms.
  • Other (than Kimball’s) architectures:
    • Data mart per company department (large duplication: of data, processes and funding).
    • Corporate Information Factory - ingest data to Enterprise Data Warehouse (normalized to 3NF) and later pump it to individual departments’ data marts (summaries/aggregated data). EDW with fine-grain data is available to users.
    • Hybrid - pretty much the same as CIF - but this time users can’t use EDW - atomic data is pushed to individual marts.
  • “Pearls of wisdom”:
    • Data needs to be on the most detailed level - summaries will never be able to answer all questions.
    • Dimensions should be organized around business processes - not around departments.
    • Dimensional models are scalable (? no evidence provided).
    • Dimensional models are for ad-hoc usage (aggregation, summing etc - only for performance reasons and addition to raw data).
    • Need to rely on conformed naming schemes to be able to integrate dimensional models with outside tools.

Erik Meijer drops atomic bombs on agile dogmas

Based on the title I was expecting that video to be yet another talk about how modern software companies are disrupting all kinds of business. Couldn’t be more wrong :) Erik demolished totally commonly used agile rules and techniques. I wouldn’t agree with all what he said there … but he made his point: I started to think about them with more criticism. For me this is the whole point: stop merely following the process. Think of your own and work the way which make you the most effective.

Would forget: video from talk is here.

Use Text or create new Text - this is the question

One of the things, which gets repeated in all MapReduce tutorials, is that if you use TextOutputFormat, you need avoid creating new instances of Text class. Instead , can just create a single instance, store it in a field and use set method. Unfortunately set for byte array doesn’t behave in the most intuitive way… The input array is copied to an internal buffer. Unless the new data chunk is bigger than the buffer, the same piece of memory is used over and over again. Guess for performance reasons, the unused part of array is left dirty. If you want to get the actual data from the internal buffer - you need to use copyBytes. Unfortunately - it seems that not all parts of Hadoop code know that. For example, the TextOutputFormat class uses this method:

private void More ...writeObject(Object o) throws IOException {
    if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
    } else {
        out.write(o.toString().getBytes(utf8));
    }
}

There are two solutions for this problem:

  • create new Text instance each time a new tuple is going to be emitted (some information over the internet indicate this might not be the worst idea)
  • clear Text instance by calling set("") and set it again

But which one is better? The only metric in this case is speed of execution. Lets solve that issue by writing a small experiment. Below you can find a small test case which carries out a small benchmark.

@Test
public void shouldTellWhatIsBetterCreatingTextOrReusing() throws Exception {

    byte[] longText = toBytes("123456789");

    int iterations = 1000000;

    System.out.println("reusing... ");
    long start1 = System.nanoTime();
    Text text = new Text();
    for (int i = 0; i < iterations; i++) {
        text.set(longText);
        text.set(""); // resets the 
    }
    long stop1 = System.nanoTime();

    long diff1 = stop1 - start1;
    System.out.println("reusing took: " + diff1 + " " + (diff1/1000000000));

    System.out.println("creating... ");
    long start2 = System.nanoTime();
    for (int i = 0; i < iterations; i++) {
        new Text(longText);
    }
    long stop2 = System.nanoTime();

    long diff2 = stop2 - start2;
    System.out.println("creating took: " + diff2 + " " + (diff2/1000000000));
}

The result is … interesting:

reusing... 
reusing took: 104000088 0
creating... 
creating took: 4898000991 4

So actually it is faster to clear the Text instance! I was expecting that kind of result, but to be honest the difference is much higher than I thought. Either way, problem solved! :)

Using mini-clusters to do integration testing in Hadoop

One of the most painful things for me, while working with Apache Hadoop, is testing. Until recently, to test eg. a Hive query or a MR job I had to use a VM with all things installed inside. Thankfully, this has changed when I found this. Turns out you can create nice in-memory mini clusters with HBase/HDFS/YARN (at least this is the part I was playing around). I created this sample project to document how to make it working (gradle build definition included!). It contains only a single test case which adds some rows to a test HBase table and then checks if the table actually contains them.

Code is based on CDH 4.7.0, however it should work if You just bump up versions or switch to artifacts from other providers.