braindump ... cause thread-dumps are not enough ;)

Four principles of scalable Big Data systems

Below are my notes from listening to a podcast with Ian Gorton who talked about principles in design of systems which process massive amounts of data.

What are the key problems with the Big Data:

  • Volume of the data: there is a lot of it, storing it and providing resources for processing is a challenge.
  • Velocity of the data: how fast does it cahenge, the rate of change.
  • Variety of data types which need to be analyzed.
  • Veracity which is about data integrity, quality etc.

They all boil down to a single issue: very large, complex data sets are available now and it is not possible anymore to process them with traditional databases and data processing techniques.

My comment: in 2014 it was indeed true. I think that there is one caveat to this: Moore’s law. 4 years ago CPUs and memory were not as fast/not as optimized for parallel processing as of today. Some size of data which might be considered Big Data back then, right now could be handled by older technologies. This boundary is only going up. This means that for some slower organizations, actually moving to Big Data tech doesn’t make any sense, as they can solve the problem by just buying faster hardware instead of building a Hadoop/Spark cluster.

The processing systems need to be distributed and horizontally scalable (i.e. you can add capacity by adding more hosts of the same type instead of building a faster CPU).

Principles to follow building Big Data systems:

  • “You can’t scale your efforts and costs for building a big data system at the same rate that you scale the capacity of the system.” - if you estimate that within a year your system will be 4x bigger, you can’t expect to have 4x bigger team.
  • “The more complex solution, the less it will scale” - this one is more about making the right decisions on choosing technologies. If you choose something very complex - it will be very complex to scale.
  • “When you start to scale things, statefulness is a real problem because it’s really hard to balance the load of the stateful objects across the available servers” - it is difficult to handle failures, because loosing state and recreating it is hard. Using stateless objects is the way to go here.
  • Failure is inevitable - redundancy and resilience to failure is key. You have to account for failure and be ready for problems with many parts of the system.

I think one golden thought which may be easily overlooked is this:

“Hence, how do you know that your test cases and the actual code that you have built are going to work anymore? The answer is you do not, and you never will.”

If you are working with a Big Data system, you can never know how it will behave in production, because recreating the real conditions is too costly. This means that the only reliable and predictable way to build such a system is to introduce a feedback loop which will tell you if you haven’t broken anything as early as possible, what boils down to: continuous, in-depth monitoring of the infrastructure and using CI/CD in connection with techniques like blue/green deployment.

My observations:

  • It is important to do proper capacity planning and if not capacity planning, just estimation of data inflow in the future (e.g. year).
  • Key factor which allows to introduce efficiency and cut the costs of operating a large system is automation (e.g. instead of manually installing servers you need to automate this).
  • Simplicity allows for better understanding of what is happening in the system => this leads to better understanding bottlenecks and figuring out how to avoid them.

Link to the original talk and transcript.

Apache Airflow tricks

I will update this post from time to time with more learnings.

Problem: I fixed problem in my pipeline but airflow doesn’t see this.

Possible things you can do:

  • check if you actually did fix it :)
  • try to refresh the DAG through UI
  • remove *.pyc files from the dags directory
  • check if you have installed the dependencies in the right python environment (e.g. in the right version of system-wide python or in the right virtualenv)

Problem: This DAG isn’t available in the web server’s DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database.

  • Refresh the DAG code from the UI
  • Restart webserver - this did the trick in my case. Some people report that there might be a stalled gunicorn process. If restart doesn’t help, try to find rogue processes and kill them manually (source, source 2)

Problem: I want to delete a DAG

  • Airflow 1.10 has a command for this: airflow delete ...
  • Prior to 1.10, you can use following script:
import sys
from airflow.hooks.postgres_hook import PostgresHook

dag_input = sys.argv[1]
hook=PostgresHook( postgres_conn_id= "airflow_db")

for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
    sql="delete from {} where dag_id='{}'".format(t, dag_input), True)

(tested: works like a charm :))


Do not update pip instaled from a system package!

Today I spent quite a bit of my time trying to understand why the hell am I getting weird errors when trying to work with multiple versions of python on the same debian machine. Below are my (quite emotional) notes from the process:

  • So here’s the deal: we want to use use two versions of python in parallel In normal world - not a huge deal, just install python2 and python3 packages and then python-pip and python3-pip. All cool, right? Until… you decide to set the default version to python3 and pip2 begins to die in weird circumstances. Why? What did change? “I haven’t changed anything about python2” you say. It turns out… pip2 is a script which does the following:

This means - it will use only the default version of python. You change to python3 to be the default - pip2 switches sides and stabs you in the back.

Solution: always use pythonX -m pip ....

  • It turns out that deb packages do not contain scripts… but they contain binaries! I’ve been replacing them with scripts through pip install --upgrade pip. Never EVER update system package though a different package manager.


Dockerizing Scala apps. Notes.

At some point I realized that building and shipping my software with deb packages sucks a little bit (especially managing a deb repo on a small scale). I have been looking at docker and containers for a some time already and it seemed to be a great idea. When I started working on my new home backup system I decided to give it a try. Since the app I was building was a Scala app, the learning curve was a little bit steeper than for other people. Along the way I noticed following things:

  • docker file can be automatically created in couple ways: by using the sbt-docker plugin, the sbt-native-packager plugin or created manually
  • gitlab hosts its own docker image registry and you can use it for free (yay!)
  • gitlab offers also CI pipelines what makes it super easy to build everything after a push and then automatically deploy to infra

The sbt-native-packager is probably the right way to go in my case as I just want something working and don’t want to configure a lot of things (and this plugin’s philosophy is exactly that). For more complex setups probably sbt-docker is better.

The changes I had to make in build configuration to get the Dockerfile generated correctly:

  • get correct imports: add import com.typesafe.sbt.packager.docker.DockerPlugin.autoImport._
  • enable the plugin: add enablePlugins(DockerPlugin)
  • customize the dockerfile (add the keys from autoImport to settings sequence):
    dockerExposedPorts := Seq(8080),
    dockerUpdateLatest := true,

There are some hacks which are not obvious (and helped in my scenario):

  • adding commands in the middle of the dockerfile:

      dockerCommands := {
        val (left, right) = dockerCommands.value.splitAt(3)
        left ++ Seq(
          ExecCmd("RUN", "mkdir", "-p", "/directory/path"),
        ) ++ right
  • pushing to gitlab registry requires some hacks:

      dockerRepository := Some(""), // set the repository url
      packageName in Docker := "project-name", // UUUGLY hack - do this if your repo name is different than project name in sbt
  • if you’re using a private repository (as I do) remember to include a docker login ... somewhere before ./sbt docker:publish in the build pipeline
  • when you want to access the image from outside, remember to prefix it with the private repo name!

      $ docker pull


Lusca Csrf Problem

TL;DR: When you try to upload a file to Express.js app from angular, use the $http service and don’t try to push it with a <form .../> element.

I tried to start with the most obvious and simple way: just create a form, with file input element, create a simple backend endpoint in Express and combine them together. It turns out, that this approach doesn’t work well when you integrate the lusca anti-CSRF library. Upon request call there is going to be a ugly message that the CSRF token wasn’t found.

Funny enough: angular has builtin support for XSRF protection. At least in theory the integration with lusca should be transparent. I started with checking the communication in dev console in chrome: the XSRF-TOKEN cookie was indeed being set in response messages but missing in requests. Then I modified the local copy of lusca to check when the token was actually being set. I noticed that the first call I was doing to the API was actually the file upload one. That meant that there was no chance the server was able to set up the token prior to the actual API call. This suggested that there is a problem on the client side. I’ve tried some weird approaches (e.g. with adding a hidden input field with _csrf name - a solution suggested here), however nothing worked. Then the enlightenment came: the itegration was implemented in the http service and the form send wasn’t using AJAX at all! I switched to sending the file through http service AJAX calls and … it just started working!