30 Oct 2014
- Dependency Injection (using Inversion of Control to manage dependencies) splits construction of objects from places where they are used (this helps with preserving Single Responsibility Principle).
- While using dependency injection the user-class (the client) exposes only constructor parameters or setter methods. They are later used to inject the dependencies.
- It is impossible to build the system properly the first time. Instead of that, it is much safer to implement what is actually required and rebuild and extend the system along the way. Iterative approach is essential.
Software systems are different than physical systems. Their architecture can be expanded iteratively, but it requires to care about concern separation.
- In AOP aspects define those points in system, which should be changed in cohesive way. Such a specification is provided in a declarative or programmatic way.
- Software systems should be designed with the “naive simple” architecture and deliver some working software to the users. More functionality can be added later on when you’re scaling up. (Doesn’t it sound like a MVP?)
Optimal architecture consists of modular problem subdomains; each of them is implemented with use of POJOs. Separate domains are integrated by using non-intrusive aspect-oriented tools. That kind of architecture can be tested similarly to business logic code.
- Delaying decisions about architecture is good - you can make the decision with most actual information.
- Standards are not everything. Because setting a standard takes a lot of time, usually standards loose the focus on needs of users, who were going to use them.
- Use domain specific languages
- Regardless whether you design a big system or an individual module - use simplest, most effective tools.
29 Oct 2014
Very useful thing (especially when you’re more familiar with maven :)): link.
The trick is to create a provided
configuration and add it’s artifacts to compile.
EDIT: You often need to add the provided
scope separately in idea plugin configuration: this explains how.
28 Oct 2014
While I was working on a Java Impala UDF, a colleague pointed me to: Impala udf functions runs even after jar is removed
Thought that should give it a try - it might happen that all my work will be useless because of that. I prepared two jar files with the same class names (but different implementations!). UDF signatures were the same. The scenario was as follows:
- Create a function from first jar in schema A.
- Execute query with that function.
- Drop function.
- Try to execute query with function (failed - as expected).
- Replace the first UDF jar file in HDFS with second implementation.
- Create a function from the same path in HDFS in schema A (with the same UDF signature!).
- Execute query with that function - query worked and second version of code was invoked.
No signs of problems described in the original post. Version of Imapala used: 1.4.1.
I’ve also tried a different scenario:
- two jars with the same name (but different paths in HDFS)
- same class names
- two different implementattions
- each implementation registered in different schema
Unfortunately - in this case it also worked fine ;)
28 Oct 2014
Today I spent some time investigating why one of our hive queries failed. Cause: YARN container for map task got killed:
Container[(…)] is running beyond physical memory limits. Current usage: XXX MB of XXX MB physical memory used; XXX MB of XXX GB virtual memory used. Killing container. (…)
Quick look on configuration which was used to configure dynamic partitioning:
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.max.dynamic.partitions</name>
<value>1000</value>
</property>
<property>
<name>hive.exec.max.dynamic.partitions.pernode</name>
<value>1000</value>
</property>
First idea: the number of partitions can be limited with a WHERE
clause (changes data range). Lets try it out… The number of partitions dropped to ~400 - updated query worked fine. The solution no 1 is then to split processing into multiple queries with separate data ranges. Uffff… Lets have another look. I’ve made couple interesting observations:
After some googling I found this. The symptoms are the same - big number of partitions (in my case > 2,5k), container being killed because of high memory usage. This suggests that this might be a problem with properly distributing data across cluster. There is a part of HiveQL which can influence how it is done: the DISTRIBUTE BY
clause. What does it do? This seems to provide quite nice explanation. After adding DISTRIBUTE BY
by the dynamic partition Hive stopped failing (hurray!). But what actually happened?
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
Turns out the query plan got slightly changed. Execution plan before:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
source_table
TableScan
alias: source_table
Filter Operator
predicate:
expr: (some expr)
type: boolean
Select Operator
expressions:
expr: column_name
type: string
...
outputColumnNames: _col0, ...
Select Operator
expressions:
expr: UDFToLong(_col0)
type: bigint
expr: _col1
...
outputColumnNames: _col0, ...
File Output Operator
... (output to file)
Execution plan after:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
source_table
TableScan
alias: source_table
Filter Operator
predicate:
expr: (some expression)
type: boolean
Select Operator
expressions:
expr: column_name
type: string
...
outputColumnNames: _col0, ...
Reduce Output Operator <- added output operator
sort order:
Map-reduce partition columns:
expr: _col5 <- column used in DISTRIBUTE BY
type: int
tag: -1
value expressions:
expr: _col0
type: string
...
Reduce Operator Tree: <- added reduce step
Extract
Select Operator
expressions:
expr: UDFToLong(_col0)
type: bigint
...
outputColumnNames: _col0, ...
File Output Operator
... (output to file)
Hive added a reduction step after initial table scan. Processing seems kind of slow (only a single reducer is used - however I’m not sure if this isn’t caused by data size). Reducer probably processes partitions separately - what makes it require less memory. One important note: hive.exec.max.dynamic.partitions
and hive.exec.max.dynamic.partitions.pernode
still need to be set in such a way that hive doesn’t break just because it has too many partitions to process (you need to estimate how many partitions will actually be used).
As a dessert: Hortonworks documentation suggests turning off memory limits when you hit that kind of problem :)
Short quote:
Problem: When using the Hive script to create and populate the partitioned table dynamically, the following error is reported in the TaskTracker log file:
TaskTree [pid=30275,tipID=attempt_201305041854_0350_m_000000_0] is running beyond memory-limits. Current usage : 1619562496bytes. Limit : 1610612736bytes. Killing task. TaskTree [pid=30275,tipID=attempt_201305041854_0350_m_000000_0] is running beyond memory-limits. Current usage : 1619562496bytes. Limit : 1610612736bytes. Killing task. Dump of the process-tree for attempt_201305041854_0350_m_000000_0 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 30275 20786 30275 30275 (java) 2179 476 1619562496 190241 /usr/jdk64/jdk1.6.0_31/jre/bin/java ...
Workaround: The workaround is disable all the memory settings (…)