braindump ... cause thread-dumps are not enough ;)

Sample projects - dropwizard and Apache Spark

Recently I was playing a bit with dropwizard and Apache Spark. Both projects are extremely interesting - although are completely different and come from separate domains. I hope that someone will find those simple projects helpful:

Data Warehouse Toolkit Chapter XXI notes

  • First build stuff in a sandbox environment - productionize when they are successful
  • Try with simple applications first
  • In general do the same as with RDBMS:
    • Drive the data source with business needs
    • Focus on user usability and good-enough performance
    • Think dimensionally
    • Use conformed dimensions
    • Use Slowly Changing Dimensions techniques
    • Use surrogate keys
  • Don’t do ad hoc querying - rather focus on analytics
  • Use clouds for prototyping
  • Search for performant, highly tuned dedicated solutions for your specific use-case
  • Use hadoop for integration of many sources of data (possibly some of them are unstructured)

HBase - copy table

In case you’re looking for a quick&dirty (ok - maybe not so dirty) way of copying a table in HBase: hbase docs to the rescue!. Works like a charm.

Data Warehouse Toolkit Chapter VIII - notes

  • Customer contact table:
    • Split it down to as many elemental parts as possible.
  • Use unicode as encoding - otherwise handling multiple encodings will be a nightmare
  • If warehouse is to be used internationally - translate the reports, and not all content of warehouse
  • According to authors, data mining teams should get exports from warehouse, so they can do computations on their own servers. (imagine exporting couple hundred TB :))
  • Solution for tracking user’s activity - create a “steps” dimension table which will refer to what user can do. Then link each event to that dimension.

Data Warehouse Toolkit Chapter VII - notes

  • Don’t include “to-date” totals in fact table - they should be calculated dynamically, not stored.