braindump ... cause thread-dumps are not enough ;)

Notes on "Clean Code - Systems" chapter

  • Dependency Injection (using Inversion of Control to manage dependencies) splits construction of objects from places where they are used (this helps with preserving Single Responsibility Principle).
  • While using dependency injection the user-class (the client) exposes only constructor parameters or setter methods. They are later used to inject the dependencies.
  • It is impossible to build the system properly the first time. Instead of that, it is much safer to implement what is actually required and rebuild and extend the system along the way. Iterative approach is essential.

Software systems are different than physical systems. Their architecture can be expanded iteratively, but it requires to care about concern separation.

  • In AOP aspects define those points in system, which should be changed in cohesive way. Such a specification is provided in a declarative or programmatic way.
  • Software systems should be designed with the “naive simple” architecture and deliver some working software to the users. More functionality can be added later on when you’re scaling up. (Doesn’t it sound like a MVP?)

Optimal architecture consists of modular problem subdomains; each of them is implemented with use of POJOs. Separate domains are integrated by using non-intrusive aspect-oriented tools. That kind of architecture can be tested similarly to business logic code.

  • Delaying decisions about architecture is good - you can make the decision with most actual information.
  • Standards are not everything. Because setting a standard takes a lot of time, usually standards loose the focus on needs of users, who were going to use them.
  • Use domain specific languages
  • Regardless whether you design a big system or an individual module - use simplest, most effective tools.

Provided scope in Gradle

Very useful thing (especially when you’re more familiar with maven :)): link.

The trick is to create a provided configuration and add it’s artifacts to compile.

EDIT: You often need to add the provided scope separately in idea plugin configuration: this explains how.

Notes on "Clean Code - Classes" chapter

  • Order of class elements (note no public variables on the list :)):
    1. public static final constants
    2. private static variables
    3. instance variables
    4. public methods
    5. private methods
  • Uncle Bob allows methods and variables to have default or protected access modifiers for testing purposes. (Personally - I don’t like this idea - it is still breaking the encapsulation)
  • First rule of classes: they should be small.
  • Second rule of classes: they should be smaller than they are.
  • Class should have a single responsibility
    • SRP principle: a class should have a single reason to change.
  • Rule of thumb: if you can’t create a cohesive class number - this means the class is too big.
  • Processor, Manager, Super in name of the class suggests that it has too broad range of responsibility.
  • Rule of thumb: you should be able to write a short class description - around 25 words, without using words “if”, “and”, “or”, “but”

    Each bigger system will contain a lot of business logic code and will be complicated. The basic way to manage this complexity is to organize it in such a way, that the programmer will know where to find specific elements, and at given time he will need to analyze only parts of code directly connected to the problem.

  • Each class should have a small number of instance variables. Each method should use one or more of these variables. The more instance variables are used in the methods the more cohesive a class is. Perfectly cohesive class uses all instance variables in all methods.
  • When class looses cohesion - this means that you probably need to split it.
  • Refactoring of legacy code:
    1. Cover the code with unit tests.
    2. Make a small change to the code to improve it.
    3. Run the tests to prove no bugs were introduced.
    4. Go to 2. :)
  • OCP - Open/Closed principle: classes should be open to extensions and closed to modification.
  • DIP - Dependency Injection principle - classes should rely on abstractions - not implementations.

MythBusters Impala edition "Impala udf functions runs even after jar is removed" BUSTED

While I was working on a Java Impala UDF, a colleague pointed me to: Impala udf functions runs even after jar is removed

Thought that should give it a try - it might happen that all my work will be useless because of that. I prepared two jar files with the same class names (but different implementations!). UDF signatures were the same. The scenario was as follows:

  1. Create a function from first jar in schema A.
  2. Execute query with that function.
  3. Drop function.
  4. Try to execute query with function (failed - as expected).
  5. Replace the first UDF jar file in HDFS with second implementation.
  6. Create a function from the same path in HDFS in schema A (with the same UDF signature!).
  7. Execute query with that function - query worked and second version of code was invoked.

No signs of problems described in the original post. Version of Imapala used: 1.4.1. I’ve also tried a different scenario:

  • two jars with the same name (but different paths in HDFS)
  • same class names
  • two different implementattions
  • each implementation registered in different schema

Unfortunately - in this case it also worked fine ;)

Hive fails when inserting data to dynamically partitioned table

Today I spent some time investigating why one of our hive queries failed. Cause: YARN container for map task got killed:

Container[(…)] is running beyond physical memory limits. Current usage: XXX MB of XXX MB physical memory used; XXX MB of XXX GB virtual memory used. Killing container. (…)

Quick look on configuration which was used to configure dynamic partitioning:

    <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
    </property>
    <property>
        <name>hive.exec.dynamic.partition</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.max.dynamic.partitions</name>
        <value>1000</value>
    </property>
    <property>
        <name>hive.exec.max.dynamic.partitions.pernode</name>
        <value>1000</value>
    </property>

First idea: the number of partitions can be limited with a WHERE clause (changes data range). Lets try it out… The number of partitions dropped to ~400 - updated query worked fine. The solution no 1 is then to split processing into multiple queries with separate data ranges. Uffff… Lets have another look. I’ve made couple interesting observations:

  • There wasn’t that much data to insert (~600MB)… Hive estimated that it needs only 2 mappers for processing.
  • -Xmx of JVM is set to 1536M - which is way below the container limit (2GB).
  • The stage which fails is a Map Reduce TableScan stage. And this stage uses only 2 mappers:

    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0

After some googling I found this. The symptoms are the same - big number of partitions (in my case > 2,5k), container being killed because of high memory usage. This suggests that this might be a problem with properly distributing data across cluster. There is a part of HiveQL which can influence how it is done: the DISTRIBUTE BY clause. What does it do? This seems to provide quite nice explanation. After adding DISTRIBUTE BY by the dynamic partition Hive stopped failing (hurray!). But what actually happened?

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

Turns out the query plan got slightly changed. Execution plan before:

 Stage: Stage-1
   Map Reduce
     Alias -> Map Operator Tree:
       source_table
         TableScan
           alias: source_table
           Filter Operator
             predicate:
               expr: (some expr)
               type: boolean
             Select Operator
               expressions:
                 expr: column_name
                 type: string
                 ...
               outputColumnNames: _col0, ...
               Select Operator
                 expressions:
                   expr: UDFToLong(_col0)
                   type: bigint
                   expr: _col1
                   ...
                 outputColumnNames: _col0, ...
               File Output Operator
               ... (output to file)

Execution plan after:

  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        source_table 
          TableScan
            alias: source_table
            Filter Operator
              predicate:
                expr: (some expression)
                type: boolean
              Select Operator
                expressions:
                  expr: column_name
                  type: string
                  ...
                outputColumnNames: _col0, ...
                Reduce Output Operator <- added output operator
                  sort order: 
                  Map-reduce partition columns:
                    expr: _col5 <- column used in DISTRIBUTE BY
                    type: int
                  tag: -1
                  value expressions:
                    expr: _col0
                    type: string
                  ...
    Reduce Operator Tree: <- added reduce step
      Extract
        Select Operator
          expressions:
            expr: UDFToLong(_col0)
            type: bigint
            ...
          outputColumnNames: _col0, ...
          File Output Operator
          ... (output to file)

Hive added a reduction step after initial table scan. Processing seems kind of slow (only a single reducer is used - however I’m not sure if this isn’t caused by data size). Reducer probably processes partitions separately - what makes it require less memory. One important note: hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode still need to be set in such a way that hive doesn’t break just because it has too many partitions to process (you need to estimate how many partitions will actually be used).

As a dessert: Hortonworks documentation suggests turning off memory limits when you hit that kind of problem :) Short quote:

Problem: When using the Hive script to create and populate the partitioned table dynamically, the following error is reported in the TaskTracker log file:

TaskTree [pid=30275,tipID=attempt_201305041854_0350_m_000000_0] is running beyond memory-limits. Current usage : 1619562496bytes. Limit : 1610612736bytes. Killing task. TaskTree [pid=30275,tipID=attempt_201305041854_0350_m_000000_0] is running beyond memory-limits. Current usage : 1619562496bytes. Limit : 1610612736bytes. Killing task. Dump of the process-tree for attempt_201305041854_0350_m_000000_0 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 30275 20786 30275 30275 (java) 2179 476 1619562496 190241 /usr/jdk64/jdk1.6.0_31/jre/bin/java ...

Workaround: The workaround is disable all the memory settings (…)