06 Dec 2014
Recently I was playing a bit with dropwizard and Apache Spark.
Both projects are extremely interesting - although are completely different and come from separate domains.
I hope that someone will find those simple projects helpful:
06 Dec 2014
- First build stuff in a sandbox environment - productionize when they are successful
- Try with simple applications first
- In general do the same as with RDBMS:
- Drive the data source with business needs
- Focus on user usability and good-enough performance
- Think dimensionally
- Use conformed dimensions
- Use Slowly Changing Dimensions techniques
- Use surrogate keys
- Don’t do ad hoc querying - rather focus on analytics
- Use clouds for prototyping
- Search for performant, highly tuned dedicated solutions for your specific use-case
- Use hadoop for integration of many sources of data (possibly some of them are unstructured)
24 Nov 2014
In case you’re looking for a quick&dirty (ok - maybe not so dirty) way of copying a table in HBase: hbase docs to the rescue!. Works like a charm.
23 Nov 2014
- Customer contact table:
- Split it down to as many elemental parts as possible.
- Use unicode as encoding - otherwise handling multiple encodings will be a nightmare
- If warehouse is to be used internationally - translate the reports, and not all content of warehouse
- According to authors, data mining teams should get exports from warehouse, so they can do computations on their own servers. (imagine exporting couple hundred TB :))
- Solution for tracking user’s activity - create a “steps” dimension table which will refer to what user can do. Then link each event to that dimension.
23 Nov 2014
- Don’t include “to-date” totals in fact table - they should be calculated dynamically, not stored.