braindump ... cause thread-dumps are not enough ;)

notes on "Clean Code - Formatting" chapter

  • Formatting is important
  • A source code file should read like a newspaper article.
    • Name should be simple and explanatory.
    • Top parts of the file should provide high-level concepts.
    • The lower == the more detailed information.
  • Blank lines help in identification of separate concepts.
  • Pieces of code which are related should be kept close in source code.
  • Instance variables should live at the top of the class file.
  • Local variables should be declared as close as possible to the place of usage.
  • Code which is executed from a function, should be below that function.
  • Lines should be at most 120 columns wide. (Uncle’s Bob personal preference)
  • Preserve indentation of blocks.

Set team formatting rules and make everyone use them!

notes on "Clean Code - Comments" chapter

  • Better refactor code than comment it.
  • Maintaining code takes time and requires great discipline - what usually is hard in fast-paced projects.
  • Don’t rely too much on comments: they get outdated very easily. This makes them innacurate, not relevant, misleading or just simply lying.
  • Instead of commenting out code - throw it away. We have a copy of it in code versioning system.
  • “HTML in source code comments is an abomination”.
  • Most comments are useless (too far from code, too obvious, just mumbling, not accurate enough).

When comments might be a good idea:

  • Legal obligation to write specific comments (copyrights).
  • Clarification of code which just can’t be written in more expressive way (quite rare).
  • Docs on public API.

What I disagree with:

  • TODO comments are bad - not good. Instead of TODO comment there should be a task in backlog, explaining what should be corrected.

notes on "Clean Code - Functions" chapter

Main thoughts I want to remember from that chapter:

  • Functions should be small. Small means up to several lines.
  • Functions should have descriptive names.
  • Functions should do one thing. They should do it well. They should do it only. THis means that the side effects should be as limited as possible (preferably - there should be none).
  • It should be possible to read the function as a “To …” sentence:

    TO functionName we do something and then we do something else...
    
  • Functions should operate on a single level of abstraction.
  • switch statements should be avoided. They can probably get replaced by inheritance - if not should be hidden in a single place in such way that it won’t be necessary to create similar switch statements again.
  • The smaller the number of arguments == the better. Rule of thumb: three parameters is almost always too much. One and two parametrs is acceptable.
  • Don’t use arguments as a way to communicate with outer world - don’t use output parameters.
  • Throw exceptions instead of returning error codes. You don’t enforce handling the error right away - it can be dealt with in a convenient place in code.

Remember: functions are not written correctly at first. Write something meeting functional requirements with TDD and then refactor mercilessly.

faking DUAL table in hive

It is sometimes beneficial to have something constant in database. RDBMs engines like Oracle or DB2 have tables like DUAL or SYSIBM.SYSDUMMY1. In hive there is no such thing by default … But why not create a custom one? The easiest way (which I think works on all (???) Linux boxes) is to create one based on /etc/hostname.

CREATE OR REPLACE TABLE dual (dummy STRING);
LOAD DATA LOCAL INPATH '/etc/hostname' OVERWRITE INTO TABLE dual;

INSERT OVERWRITE TABLE dual
SELECT
    "X" AS dummy
FROM
    dual
LIMIT 1;

Hacky… But on the other hand pretty simple. The other way to do it is … write a custom UDTF. Check how it can be done with this project: github

Oracle docs on DUAL: Oracle docs

hive <=> operator

Human being learns all the time … I’ve just found out that there is a wonderfully simple way of writing equal conditions including NULL values in Hive: the <=> operator. I used to write something like this:

...
WHERE 
  column1 = column2 OR (column1 IN NULL AND column2 IS NULL) 

Now it boils down to:

WHERE
  column1 <=> column2

For reference: NULL = NULL is false by definition in SQL