braindump ... cause thread-dumps are not enough ;)

Do not update pip instaled from a system package!

Today I spent quite a bit of my time trying to understand why the hell am I getting weird errors when trying to work with multiple versions of python on the same debian machine. Below are my (quite emotional) notes from the process:

  • So here’s the deal: we want to use use two versions of python in parallel In normal world - not a huge deal, just install python2 and python3 packages and then python-pip and python3-pip. All cool, right? Until… you decide to set the default version to python3 and pip2 begins to die in weird circumstances. Why? What did change? “I haven’t changed anything about python2” you say. It turns out… pip2 is a script which does the following:
#!/usr/bin/python
# GENERATED BY DEBIAN

This means - it will use only the default version of python. You change to python3 to be the default - pip2 switches sides and stabs you in the back.

Solution: always use pythonX -m pip ....

  • It turns out that deb packages do not contain scripts… but they contain binaries! I’ve been replacing them with scripts through pip install --upgrade pip. Never EVER update system package though a different package manager.

Links:

Dockerizing Scala apps. Notes.

At some point I realized that building and shipping my software with deb packages sucks a little bit (especially managing a deb repo on a small scale). I have been looking at docker and containers for a some time already and it seemed to be a great idea. When I started working on my new home backup system I decided to give it a try. Since the app I was building was a Scala app, the learning curve was a little bit steeper than for other people. Along the way I noticed following things:

  • docker file can be automatically created in couple ways: by using the sbt-docker plugin, the sbt-native-packager plugin or created manually
  • gitlab hosts its own docker image registry and you can use it for free (yay!)
  • gitlab offers also CI pipelines what makes it super easy to build everything after a push and then automatically deploy to infra

The sbt-native-packager is probably the right way to go in my case as I just want something working and don’t want to configure a lot of things (and this plugin’s philosophy is exactly that). For more complex setups probably sbt-docker is better.

The changes I had to make in build configuration to get the Dockerfile generated correctly:

  • get correct imports: add import com.typesafe.sbt.packager.docker.DockerPlugin.autoImport._
  • enable the plugin: add enablePlugins(DockerPlugin)
  • customize the dockerfile (add the keys from autoImport to settings sequence):
    ...
    dockerExposedPorts := Seq(8080),
    dockerUpdateLatest := true,
    ...

There are some hacks which are not obvious (and helped in my scenario):

  • adding commands in the middle of the dockerfile:

      dockerCommands := {
        val (left, right) = dockerCommands.value.splitAt(3)
    
        left ++ Seq(
          ExecCmd("RUN", "mkdir", "-p", "/directory/path"),
        ) ++ right
      },
    
  • pushing to gitlab registry requires some hacks:

      dockerRepository := Some("registry.gitlab.com/..."), // set the repository url
      packageName in Docker := "project-name", // UUUGLY hack - do this if your repo name is different than project name in sbt
    
  • if you’re using a private repository (as I do) remember to include a docker login ... somewhere before ./sbt docker:publish in the build pipeline
  • when you want to access the image from outside, remember to prefix it with the private repo name!

      $ docker pull registry.gitlab.com/username/image:tag
    

Links:

Lusca Csrf Problem

TL;DR: When you try to upload a file to Express.js app from angular, use the $http service and don’t try to push it with a <form .../> element.

I tried to start with the most obvious and simple way: just create a form, with file input element, create a simple backend endpoint in Express and combine them together. It turns out, that this approach doesn’t work well when you integrate the lusca anti-CSRF library. Upon request call there is going to be a ugly message that the CSRF token wasn’t found.

Funny enough: angular has builtin support for XSRF protection. At least in theory the integration with lusca should be transparent. I started with checking the communication in dev console in chrome: the XSRF-TOKEN cookie was indeed being set in response messages but missing in requests. Then I modified the local copy of lusca to check when the token was actually being set. I noticed that the first call I was doing to the API was actually the file upload one. That meant that there was no chance the server was able to set up the token prior to the actual API call. This suggested that there is a problem on the client side. I’ve tried some weird approaches (e.g. with adding a hidden input field with _csrf name - a solution suggested here), however nothing worked. Then the enlightenment came: the itegration was implemented in the http service and the form send wasn’t using AJAX at all! I switched to sending the file through http service AJAX calls and … it just started working!

Notes on scalability techniques

Scalability is one probably one of the hardest challenges when you’re building a distributed system. It also the desired one: if you need to scale your app, it means you have many happy users! Recently I found a great resource which tries to answer the question of how to design a scalable system. Funny enough: it is a website which contains materials and tips preparing for software engineering interviews (hired in tech). Below are my notes from reading through those materials.

The design process can be summarized as a list of steps one has to follow. It is also called as the strategy for approaching system design questions at tech interviews :)

  1. Define use-cases of the system. This means specifying exactly what the system should and maybe more importantly what it shouldn’t do. The objective is to have a clear understanding of the scope of the requirements (e.g. does it have to gather specific extra statistics, do we plan to provide a UI). Setting some priorities is also important: it will allow to make better decisions about trade-offs later on.
  2. Identifying constraints. In this step we want to identify the sources of CPU cost and volume of processed/stored data. For the first part we want to typically estimate:
    • How many users are we going to have? Time-frame (per second/per day/per week etc) is important: thanks to that you can estimate roughly how many machines are needed.
    • What operations will be typically executed?
    • How many requests of a kind can be expected based on the number of users?
    • Is there a type of request which dominates everything else?
    • Does it require using a special system?
    • What are the limits within which we want to operate (e.g. is there a max cap on concurrent users)?

To estimate the volume of data we need to think about those topics:

  • What information is stored?
  • What is the size of each of the atomic pieces of information?
  • What is the size of atomic pieces of information x the number of requests?
  • What do you need to do with that data? Only store it for further analysis? Or read it/update constantly?
  • Can you purge some of it after some time?

At this stage it is also ok to make some assumptions: remember to clearly document/specify what assumptions are being made.

  1. Create the abstract design. The goal here is to create the diagram of components which will be present in the system, specify how each of them works and how they work (communicate) together. This can be organized as layers and services. The specification of each component should include the crucial details (e.g. if this service computes an integral - what kind of algorithm it will include): this allows to further refine the data load and computational costs. Try to analyze how the system works in an end-to-end case (you’re trying to determine whether the use-cases from 1. are fulfilled).

  2. Understand the bottlenecks. First: identify where the bottlenecks are (e.g. how many requests per second are there - can a single machine cope with that? how much data is going to be read in each request? if there is a db: can it be served from a single machine? can you distribute it?). Second: understand how they impact the initial design: maybe if there would be more fine-grained components it would be easier to solve problems with the bottlenecks?

  3. Scaling. Here we address the problems found in point 4. You should first try to use some well known techniques to avoid reinventing the wheel. List of techniques + their trade-offs is below.

The important side notes are:

  • The scope is really important. It is necessary to understand the priorities to make good trade-offs.
  • The abstract design is not finite - you can tweak it :)
  • ALL estimations on this level are just ballpark figures: we want to know whether the system requires 1000s of servers or just several to handle the projected load. To know better: just get the current numbers and run them against your projections. Based on experience: this will feel good/bad. At the end you can always just experiment with letting only a small percentage of users in and checking if the projections still hold. If not - get back to design because you just identified new bottlenecks.

After writing this down I realized that this look like a check-list for any new system design process :) So, here goes the list of techniques which can be used to solve scalability/availability problems:

  1. Load-balancing. This can be done on different levels (e.g. DNS round-robin over a pool of IPs, dedicated entry point to the stack - the load-balancer). The trade-off to make is introducing a single point of failure (the load balancer itself). To solve this: buy another load-balancer and set it to replicate the first one’s work in active-active model (each of them handles a portion of traffic; if one goes down the other one takes all the load). There is also a problem with session data: user can potentially hit a different server each time. This can be solved by delegating storage to a centralized external data store available to all servers (RDBMS, external persistent cache - e.g. Redis). The application code should be therefore stateless what means it doesn’t store any extra data on side (just contains business logic). This way it is very easy to scale the pool of application servers by simply adding more units. External means that the db doesn’t reside on the same machines as application code (if app server crashes - db is left intact).
  2. Caching. Replace costly queries (typically db queries) by storing the result of query in memory. The next time a service tries to run the same query, result is almost instantaneous. The caveats are: 1) size of cache is limited to e.g. size of RAM, 2) when data in db is updated you need to update the cached value (and possibly evict all results of queries which depend on that value), 3) cached queries have to be frequent enough to see a boost (if you issue the query only once in a while, the cached result might be evicted before the query hits again). Examples of in-memory caches (other types are usually not performant enough): memcached and redis. Le Cloud blog suggests that to avoid some of the problems (e.g. no 2 from presented list) it is better to cache objects as in understanding of your programming language. The pattern is as follows: when the first request comes, the code ‘assembles’ the object, what may include multiple db queries. Next you serialize the object and send it to cache. When the next request come in you can just deserialize the necessary objects from cache. The other upside is that it is possible to apply the idea of asynchronous processing (instead of filling the cache with the objects which are required by the current request, you can fill the cache with pre-made objects).
  3. DB optimization . Basically the more queries you throw at the RDBMS the slower it gets (well after reaching a specific threshold set up by the machine capabilities and how well is the db tuned). There are many things actually which you can do: 1) replicate the db in a master-slave scheme (master handles writes, slaves handle reads), 2) sharding - table is split across multiple nodes (each node has a distinct set of rows; there are multiple problems: synchronization across nodes, network connections, handling failures, transactions spanning across several nodes etc), 3) partitioning - dividing data logically into independent parts which can be processed on different machines/nodes (the problem arises when you need to use data from multiple partitions in the same time). Turns out that usually instead of going this route you can usually just switch to NoSQL type database, loosen the transactional constraints and experience better performance, 4) denormalizing - usually the most costly thing are the JOINs. Restructuring data in such a way that you don’t need to do them will result in a speedup. The other dimension is just making the costly JOINs in the application layer (it is easier to scale application servers than db!).
  4. Asynchronous processing. As mentioned in caching you can change the way you look at processing data: instead of waiting for requests you can prepare some data structures/objects upfront and just use them when needed. This means that even if they take a considerable amount of time to process, it is hidden from the user. Asynchronous might also mean that you give feedback to the user more often: instead of just freezing the browser until the request is handled, you can show a progress bar, or just show a notification that a request was added to processing queue and you will notify user about the progress/finalization.

It is very important to understand the difference between horizontal and vertical scaling. Vertical means to increase the capacity of the system by using more powerful hardware (e.g. by using better CPUs, more/faster RAM, bigger hard drives). Horizontal means to add more units of hardware of similar capacity, so instead of 2 servers with 2GHz CPUs and 4GB RAM, you’re using 10 or 20. Horizontal scaling is widely considered as a better way of solving capacity issues: it is way cheaper to add more commodity hardware and you can do it by using existing technologies. Vertical scaling is limited with the capacity of currently available top-shelf hardware.

In each of the sentences from the previous paragraph you can replace the hardware with virtual machines coming from cloud providers.

Things to consider when thinking about scalability:

  • failures - each part of the system may fail and you have to remember to think about those scenarios
  • read-heavy or write-heavy - depending on to which category the system falls into, the design will be completely different (there are also some templates how to deal with specific problems in each case)
  • security - how the system design affects the security, where is the weakest link of the whole chain

Sources of information:

Bootstrapping a new web app

Several week ago one of my friends asked whether I could help in bootstrapping a new start-up. It seemed like a completely crazy idea. I rolled up my sleeves and was about to jump into coding … and then I realised that there is no version control, infra to build on it. Just to make my life easier (any maybe someone else) this is my list of things to do when you’re starting a new system completely from scratch.

  1. Setup a gitlab: this is an absolute must-have. Unlimited private repos, unlimited code space. For free!
  2. Setup a free slack account - communicating in style is important ;)
  3. Bootstrap the web-app. yeoman is a great help! To setup a complete, working web app with OAuth2 login (integrated with Facebook, Google+, Github etc) I chose the Angular full-stack generator: generator-angular-fullstack. In general there are many others that offer similar functionality (like MEAN) but ultimately this one turned out to be the easiest to tune and configure. It includes BrowserSync for automatic reload of code, a template of an Angular (v1.x) web app, a template of an express.js app. Everything is wrapped up into a nice npm & gulp build and gets you running really quickly.
  4. Wrap the different parts of the app into docker containers. I have chosen to use a single container for front-end static files (just plain nginx image) and one for the back-end express service. There should be probably another one for a DB - but this needs only to be added when you actually need one (and not use stuff like Amazon Dynamo, etc.).
  5. AWS EC2 account - an absolute must have… Or at least any kind of cloud infrastructure provider which can be reliably used to host all docker containers and nicely integrates into the docker toolkit.