14 Nov 2014
Based on the title I was expecting that video to be yet another talk about how modern software companies are disrupting all kinds of business. Couldn’t be more wrong :) Erik demolished totally commonly used agile rules and techniques. I wouldn’t agree with all what he said there … but he made his point: I started to think about them with more criticism. For me this is the whole point: stop merely following the process. Think of your own and work the way which make you the most effective.
Would forget: video from talk is here.
13 Nov 2014
One of the things, which gets repeated in all MapReduce tutorials, is that if you use TextOutputFormat
, you need avoid creating new instances of Text
class. Instead , can just create a single instance, store it in a field and use set
method. Unfortunately set
for byte
array doesn’t behave in the most intuitive way… The input array is copied to an internal buffer. Unless the new data chunk is bigger than the buffer, the same piece of memory is used over and over again. Guess for performance reasons, the unused part of array is left dirty. If you want to get the actual data from the internal buffer - you need to use copyBytes
. Unfortunately - it seems that not all parts of Hadoop code know that. For example, the TextOutputFormat
class uses this method:
private void More ...writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
There are two solutions for this problem:
- create new
Text
instance each time a new tuple is going to be emitted (some information over the internet indicate this might not be the worst idea)
- clear
Text
instance by calling set("")
and set it again
But which one is better? The only metric in this case is speed of execution. Lets solve that issue by writing a small experiment. Below you can find a small test case which carries out a small benchmark.
@Test
public void shouldTellWhatIsBetterCreatingTextOrReusing() throws Exception {
byte[] longText = toBytes("123456789");
int iterations = 1000000;
System.out.println("reusing... ");
long start1 = System.nanoTime();
Text text = new Text();
for (int i = 0; i < iterations; i++) {
text.set(longText);
text.set(""); // resets the
}
long stop1 = System.nanoTime();
long diff1 = stop1 - start1;
System.out.println("reusing took: " + diff1 + " " + (diff1/1000000000));
System.out.println("creating... ");
long start2 = System.nanoTime();
for (int i = 0; i < iterations; i++) {
new Text(longText);
}
long stop2 = System.nanoTime();
long diff2 = stop2 - start2;
System.out.println("creating took: " + diff2 + " " + (diff2/1000000000));
}
The result is … interesting:
reusing...
reusing took: 104000088 0
creating...
creating took: 4898000991 4
So actually it is faster to clear the Text
instance! I was expecting that kind of result, but to be honest the difference is much higher than I thought. Either way, problem solved! :)
11 Nov 2014
One of the most painful things for me, while working with Apache Hadoop, is testing. Until recently, to test eg. a Hive query or a MR job I had to use a VM with all things installed inside. Thankfully, this has changed when I found this. Turns out you can create nice in-memory mini clusters with HBase/HDFS/YARN (at least this is the part I was playing around). I created this sample project to document how to make it working (gradle build definition included!). It contains only a single test case which adds some rows to a test HBase table and then checks if the table actually contains them.
Code is based on CDH 4.7.0, however it should work if You just bump up versions or switch to artifacts from other providers.