Sample projects - dropwizard and Apache Spark

06 Dec 2014

Recently I was playing a bit with dropwizard and Apache Spark. Both projects are extremely interesting - although are completely different and come from separate domains. I hope that someone will find those simple projects helpful:

Data Warehouse Toolkit Chapter XXI notes

06 Dec 2014

First build stuff in a sandbox environment - productionize when they are successful
Try with simple applications first
In general do the same as with RDBMS:
- Drive the data source with business needs
- Focus on user usability and good-enough performance
- Think dimensionally
- Use conformed dimensions
- Use Slowly Changing Dimensions techniques
- Use surrogate keys
Don’t do ad hoc querying - rather focus on analytics
Use clouds for prototyping
Search for performant, highly tuned dedicated solutions for your specific use-case
Use hadoop for integration of many sources of data (possibly some of them are unstructured)

HBase - copy table

24 Nov 2014

In case you’re looking for a quick&dirty (ok - maybe not so dirty) way of copying a table in HBase: hbase docs to the rescue!. Works like a charm.

Data Warehouse Toolkit Chapter VIII - notes

23 Nov 2014

Customer contact table:
- Split it down to as many elemental parts as possible.
Use unicode as encoding - otherwise handling multiple encodings will be a nightmare
If warehouse is to be used internationally - translate the reports, and not all content of warehouse
According to authors, data mining teams should get exports from warehouse, so they can do computations on their own servers. (imagine exporting couple hundred TB :))
Solution for tracking user’s activity - create a “steps” dimension table which will refer to what user can do. Then link each event to that dimension.

Data Warehouse Toolkit Chapter VII - notes