braindump ... cause thread-dumps are not enough ;)

Uber's Michelangelo notes

  • Michelangelo is built on top open source tools + integrations with the Uber’s compute and data infra
  • Unified infrastructure for ML-projects
  • Abstracts one particular version of a ML workflow
  • Both Cassandra (for streaming) and Hive (for batch) are mentioned as feature stores - this suggests that “regular” data warehouse/datalake assets are/can be used as features
  • Uber supports a concept of a “partitioned” model (basically the same model appplied to a different subset of data e.g. at a city, country level)
  • Architecture includes explicit feature and model stores
    • feature store seems like a subset of datalake/warehouse with a special purpose
    • model store is more tailored - this is a db with records of all models trained in Michelangelo (see excerpts for a list)
  • Supports offline training, models can be deployed to make batch or online predictions
  • After training commences, a set of metrics is computed and assembled into a report; all of that is stored in the model repository
    • popular/important model types have special visualizations implemented
  • There is a feature of automatically searching for hyperparameters
  • Michelangelo also creates a “feature report” which explains influence of features on the model
  • Models can be deployed as:
    • Offline job (spark job) which generates batch predictions (the predictions are stored back to Hive - from there they are copied to target systems)
    • Online prediction service (there is a prediction service cluster - set of machines behind an Load Balancer) - client can send RPC calls to get a prediction about a sample/batch of samples
    • Library deployment - model is embedded into a library and can be invoked as Java API
  • Training is triggered through API - people can start it manually or write scripts which schedule it automatically over time (this seems like not the best idea)
  • Models deployed to prod are uniquely identified when serving predictions (seems like the online use-case), client can choose which model it wants to use (this nicely integrates with the client experimentation api)
  • Online predictions can be done in P95 5ms (without accessing Cassandra) or P95 10ms (with accessing Cassandra)
  • The whole system seems to have an API (you can control everything through an API)
    • this makes it very similar to kubeflow
  • While a model is being served, Michelangelo can log and store a percentage of predictions and calculate metrics on top of them and use this information later to measure/track model’s accuracy over time (numbers go to an internal tracking tool which allows to e.g. configure alerts).

Key excerpts

For every model that is trained in Michelangelo, we store a versioned object in our model repository in Cassandra that contains a record of:

* Who trained the model
* Start and end time of the training job
* Full model configuration (features used, hyper-parameter values, etc.)
* Reference to training and test data sets
* Distribution and relative importance of each feature
* Model accuracy metrics
* Standard charts and graphs for each model type (e.g. ROC curve, PR curve, and confusion matrix for a binary classifier)
* Full learned parameters of the model
* Summary statistics for model visualization
Michelangelo consists of a mix of open source systems and components built in-house. The primary open sourced components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow. We generally prefer to use mature open source options where possible, and will fork, customize, and contribute back as needed, though we sometimes build systems ourselves when open source solutions are not ideal for our use case.
Michelangelo is built on top of Uber’s data and compute infrastructure, providing a data lake that stores all of Uber’s transactional and logged data, Kafka brokers that aggregate logged messages from all Uber’s services, a Samza streaming compute engine, managed Cassandra clusters, and Uber’s in-house service provisioning and deployment tools.
The same general workflow exists across almost all machine learning use cases at Uber regardless of the challenge at hand, including classification and regression, as well as time series forecasting. The workflow is generally implementation-agnostic, so easily expanded to support new algorithm types and frameworks, such as newer deep learning frameworks. It also applies across different deployment modes such as both online and offline (and in-car and in-phone) prediction use cases.
We designed Michelangelo specifically to provide scalable, reliable, reproducible, easy-to-use, and automated tools to address the following six-step workflow:

1. Manage data
2. Train models
3. Evaluate models
4. Deploy models
5. Make predictions
6. Monitor predictions
A platform should provide standard tools for building data pipelines to generate feature and label data sets for training (and re-training) and feature-only data sets for predicting. These tools should have deep integration with the company’s data lake or warehouses and with the company’s online data serving systems.

Michelangelo seems to be using a similar model to organize workflow execution as kubeflow:

Training jobs can be configured and managed through a web UI or an API, often via Jupyter notebook. Many teams use the API and workflow tools to schedule regular re-training of their models.
When a model is trained and evaluated, historical data is always used. To make sure that a model is working well into the future, it is critical to monitor its predictions so as to ensure that the data pipelines are continuing to send accurate data and that production environment has not changed such that the model is no longer accurate.

To address this, Michelangelo can automatically log and optionally hold back a percentage of the predictions that it makes and then later join those predictions to the observed outcomes (or labels) generated by the data pipeline. With this information, we can generate ongoing, live measurements of model accuracy. In the case of a regression model, we publish R-squared/coefficient of determination, root mean square logarithmic error (RMSLE), root mean square error (RMSE), and mean absolute error metrics to Uber’s time series monitoring systems so that users can analyze charts over time and set threshold alerts, as depicted below:

kubeflow evaluation notes

  • platform for machine learning which is supposed to streamline the whole process of data prep, researching, coding a model and then hosting it as an API
  • youtube playlist with Kubeflow 101
  • it focuses mostly on standardizing the whole flow of work, joins together different tools, e.g. the jupterlab and KFserving
  • kubeflow works by “extending” kubernetes into a stack/environment for ML
    • it enhances it with CRDs which fall into the ML domain (like jupter notebooks, or ML processing pipelines)
    • Operators make sure Kubernetes executes what is defined in CRDs
    • It uses Kubernetes more like a framework to implement something
    • The issue seems to be that the “abstraction” level isn’t fully adjusted to use-cases (e.g. I find using Airflow for scheduling a more mature/featurefull tool)
  • runs everywhere kubernetes runs
  • there is a nice web app - the dashboard, which allows to manage the resources
  • provides managed jupyter notebooks (you can start them from the central dashboard)
    • in the kube
  • resources can be split between teams in form of namespaces
  • there is a pipeline component, where you plugin different steps of model processing pipeline (data collection, training, serving etc) as Kubernetes resources (I guess CRDs)
  • you can easily serve your model by using one of available serving platforms (KFserving or Seldom Core)
  • easiest way to do a test install - through microk8s:
      $ sudo snap install microk8s --classic
      $ microk8s status --wait-ready
      $ microk8s enable kubeflow
    
  • provides an abstraction for “pipelines” which are essentially DAGs of pods:
    • you specify the docker image
    • you specify the input (parameters, data etc)
    • you specify what is the output (file, model etc)
      • output of one of the steps can become input of others
    • you define how the steps are executed in python in a notebook
    • you trigger a run from the notebook (this makes it also visible in the kubeflow dashboard)
    • once deployed to k8s, pipelines are visible
    • video which shows an example of a pipeline
  • has a built in metadata store
    • can be interacted with through a Python SDK
    • link to docs
    • kubeflow tools write a bunch of info in there (e.g. pipelines will write info about each run execution)
    • contains an artifact store which can be used e.g. to host models (“model store”)
  • KFserving
    • takes a definition + a ML model and turns it into an API
    • supports GPUs, autoscaling, canary roll-outs
    • built on top of Istio and Knative
    • you specify the model (effectively the api config) as a k8s CRD
  • kubeflow pipelines disadvantages:
    • you need to create a docker image for every step (or have a template one)
    • the pipeline needs to be compiled from python code

FB Learner Flow notes

  • FBs ML infrastructure project
  • Not mentioned explicitly, but probably very well integrated with the other parts of fb stack
  • There are 3 key concepts:
    • Workflows - pipelines defined in python; they are DAGs of operators
    • Operators - individual building blocks (similar to Operators in Airflow)
    • Channels - represent inputs and outputs (they mark what flows between operators); use a type-system avoid errors when connecting operators
  • Platform has 3 main components:
    • Tool to author the workflows
    • Tool to execute workflows (launch experiments) with differents inputs and viewing their results
    • A catalog of ready to use workflows/pipelines
  • Platform seems to support only offline models (but hey, the description is from 2016 so by now it is probaly supporting also online learning)
  • UI has following functions:
    • Can launch a workflow
      • You can specify dates of data (partitions), features (those come from an internal system), which workflow is being executed, name/desc/tags etc metadata.
    • Visualize output
      • Type system is used to infer how each output can be visualized
      • Output reports can be enhanced with custom plugins (e.g. if a model/algorithm has a special visualization, a plugin which supports it can be added)
    • Manage experiments
      • This is basically running a few versions of the same workflow, but with different hyperparameters of training
      • experiment reports are centrally stored and easy to query (indexed with elasticsearch)
  • Core properties:

    * Each machine learning algorithm should be implemented once in a reusable manner.
    * Engineers should be able to write a training pipeline that parallelizes over many machines and can be reused by many engineers.
    * Training a model should be easy for engineers of varying ML experience, and nearly every step should be fully automated.
    * Everybody should be able to easily search past experiments, view results, share with others, and start new variants of a given experiment.
    
  • Conceptually this is a set of easy to re-use workflows, which can be plugged in into various sets of features
  • Workflow definitions are Python-based (similarly to Airflow).
  • Core insight: model kind isn’t that important, it is more important to correctly choose the set of features
    • This means it is far more important to make it easy to switch between features

Key excerpts from blog

Explanation of how the execution works:

Rather than execute serially, workflows are run in two stages: 1) the DAG compilation stage and 2) the
operator execution stage. In the DAG compilation stage, operators within the workflow are not actually
executed and instead return futures. Futures are objects that represent delayed pieces of computation.
So in the above example, the dt variable is actually a future representing decision tree training that has
not yet occurred. FBLearner Flow maintains a record of all invocations of operators within the DAG
compilation stage as well as a list of futures that must be resolved before it operates. For example, both
ComputeMetricsOperator and PredictOperator take dt.model as an input, thus the system knows that dt
must be computed before these operators can run, and so they must wait until the completion of
TrainDecisionTreeOperator.

Philosophy of the system:

In some of our earliest work to leverage AI and ML — such as delivering the most relevant content to each
person — we noticed that the largest improvements in accuracy often came from quick experiments,
feature engineering, and model tuning rather than applying fundamentally different algorithms. An
engineer may need to attempt hundreds of experiments before finding a successful new feature or set of
hyperparameters. Traditional pipeline systems that we evaluated did not appear to be a good fit for our
uses — most solutions did not provide a way to rerun pipelines with different inputs, mechanisms to
explicitly capture outputs and/or side effects, visualization of outputs, and conditional steps for tasks like
parameter sweeps.

Sample workflow

  # The typed schema of the Hive table containing the input data
feature_columns = (
    ('petal_width', types.INTEGER),
    ('petal_height', types.INTEGER),
    ('sepal_width', types.INTEGER),
    ('sepal_height', types.INTEGER),
)
label_column = ('species', types.TEXT)
all_columns = feature_columns + (label_column,)

# This decorator annotates that the following function is a workflow within
# FBLearner Flow
@workflow(
    # Workflows have typed inputs and outputs declared using the FBLearner type
    # system
    input_schema=types.Schema(
        labeled_data=types.DATASET(schema=all_columns),
        unlabeled_data=types.DATASET(schema=feature_columns),
    ),
    returns=types.Schema(
        model=types.MODEL,
        mse=types.DOUBLE,
        predictions=types.DATASET(schema=all_columns),
    ),
)
def iris(labeled_data, unlabeled_data):
    # Divide the dataset into separate training and evaluation dataset by random
    # sampling.
    split = SplitDatasetOperator(labeled_data, train=0.8, evaluation=0.2)

    # Train a decision tree with the default settings then evaluate it on the
    # labeled evaluation dataset.
    dt = TrainDecisionTreeOperator(
        dataset=split.train,
        features=[name for name, type in feature_columns],
        label=label_column[0],
    )
    metrics = ComputeMetricsOperator(
        dataset=split.evaluation,
        model=dt.model,
        label=label_column[0],
        metrics=[Metrics.LOGLOSS],
    )

    # Perform predictions on the unlabeled dataset and produce a new dataset
    predictions = PredictOperator(
        dataset=unlabeled_data,
        model=dt.model,
        output_column=label_column[0],
    )

    # Return the outputs of the workflow from the individual operators
    return Output(
        model=dt.model,
        logloss=metrics.logloss,
        predictions=predictions,
    )

A few notes about Objectives and Key Results

My notes from reading the John Doerr’s great “Measure What Matters” book (highly recommend btw).

  • OKR = Objective and Key Results
  • Methodology of setting goals and tracking them over time
  • “[…] OKRs are not a silver bullet. They cannot substitute for sound judgement, strong leadership, or a creative workplace culture. But if those fundamentals are in place, OKRs can guide you to the mountaintop.”
  • OKRs are: “A management methodology that helps to ensure that the company cfocuses efforts on the same important issues throughout the organization”.
  • Objective:
    • “[…] WHAT is to be achieved, no more no less.”
    • “By definition, objectives are significant, concrete, action oriented and (ideally) inspirational”.
    • “When properly designed and deployed, they’re a vaccine against fuzzy thinking - and fuzzy execution”
    • Can be long lived (e.g. rolls for a year or over)
    • OKRs should surface primary goals and allow to say no to things which do not contribute towards reaching the goals.
    • must be significant (to stay motivated to achieve them)
    • Andy Grove “The art of management lies in the capacity to select from the many activities of seemingly comparable significance the one or two or three that provide leverage well beyond the others and concentrate on them”
  • Key results:
    • “[…] benchmark and monitor HOW we get to the objective”
    • “Effective KRs are specific and time-bound, aggressive yet realistic.”
    • “they are measurable and verifiable” - “It’s not a key result unless it has a number”
    • There is no judgement/gray area - you either meet it or not.
    • evolve as work in progress
    • Ideal number: 3 - 5
  • Connection between objectives and key results lies in the “as measured by”: “We will achieve a certain OBJECTIVE as measured by the following KEY RESULTS.”
  • The OKRs should be roughly set in half by the specific team (in consultation with managers).
  • OKRs are not set in stone - if situation changes, they should be re-assessed and changed.
  • OKRs should be made public to everyone in the company (this creates accountability and makes it easier to say “no” to things)
  • OKR superpowers:
    • focus and committment to priorities - OKRs make it clear to what people need to work on, key results are clearly measurable so it should be quite easy to understand what needs to be done
    • alignment and connecting the team - OKRs should be publicly visible, people should be able to view one another’s OKRs, what should enable collaboration; this should also provide context why one team does not want to focus on something you ask them to do - because they have clear OKRs which go in a different way
    • tracking for accountability - OKRs progress can be tracked (because key results are measurable); daily progress measurement are not required, but a weekly checkin is beneficial to avoid spending time on stuff that doese not matter
      • you can do 4 things when you review an OKR:
        • continue (green - all is ok)
        • update (yellow - needs attention) - think about what needs to be done to get on track
        • start - launch a new OKR mid-cycle if necessary
        • stop - kill an OKR if it doesn’t make sense anymore
      • after an OKR finishes, there needs to be a post-mortem done on it
        • if it was killed - why? was there anything that would indicate issues with that OKR that would allow to save that time?
        • it was successfull - should be scored
          • 0.7-1.0 - green (delivered)
          • 0.4-0.6 - yellow (progress made by not delivered)
          • 0.0-0.3 - red (failed to make real progress)
      • the goals should be set in a way that it is probably impossible to hit all of them (if someone does, it means the bar was set too low so they should plan for much more)
      • after the cycle ends, people who set the OKRs should reflect on them, and see if there are some learnings from misses/hits
    • stretch for amazing
      • the OKRs become a “north star”
      • even if the goal seems completely unrealistic, stretching for achieving it can make a huge difference
      • google aims to meet stretch goals at 70% (so going far in the right direction is good enough)
  • At Intel, OKRs were disconnected (largely but not fully) from performance reviews and decisions about promotions.

The book provides a load of examples on how introducing OKRs worked for a lot of different organizations (bigger and smaller). It also describes practices and processes which ideally should be implemented along OKRs - e.g. CFR = Conversation, Feedback, Recognition - alternative to performance reviews.

Kubernetes testing environments

Testing deployment pipelines, services setup etc in kubernetes is quite a complex task. The first instinct to start playing around the configuration in a safe way is to setup a new cluster in your cloud-vendor-of-choice environment. However, that turns out to be a pricey (and actually quite sluggish) solution usually. As a second thought, you will probably realize that testing out locally might be a better option (its cheaper, faster, probably nicely isolated etc). Below are my few notes on tools which should enable testing locally of kubernetes stuff.

  • k3s
    • this isn’t a test environment per se, it is just a pretty compact k3s distribution which is supposed to even work at the smaller devices
    • installation
      • you can choose the use the script: curl -sfL https://get.k3s.io | sh - however this is an option if you want to run k8s as a service on the node where you run the command
        • the script is dedicated for systemd / openrc systems
      • you can also just download the binary (from here) and run it from commandline
    • batteries includes: has traefik, helm controller, flannel etc embedded
  • k3d
  • kind
    • kind is a tool for running local Kubernetes clusters using Docker container “nodes”. kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
    • has interesting features which allow it to be easily used in k8s development (e.g. you can pass your own k8s image when creating the cluster)
  • minikube
    • sets up a docker cluster (or in a VM)
    • works on Windows/Mac/Linux (because the cluster can live within the VM)
    • adding typically included pieces of setup is done through add-ons
    • seems to be a bit heavier than kind not really - the docker version is quite snappy
    • great tutorial
    • supports multi-node setups: minikube start --nodes 2
  • microk8s
    • made by canonical (yup, the ubuntu folks)
      MicroK8s is the smallest, fastest, fully-conformant Kubernetes that tracks upstream releases and makes clustering trivial. MicroK8s is great for offline development, prototyping, and testing. Use it on a VM as a small, cheap, reliable k8s for CI/CD. It’s also the best production grade Kubernetes for appliances. Develop IoT apps for k8s and deploy them to MicroK8s on your Linux boxes.
      
    • seems to be using a different deamon as a controller for containers (snapd instead of containerd) - but can’t find a good confirmation for that information (apart from the presentation below)
    • https://microk8s.io/docs
    • seems to position itself similarly to k3s as a production-ready plaform and a potential dev env in the same time
    • more closely tracks the main kubernetes development
    • nice write-up about minikube vs microk8s: blog link
    • seems to be limited to Ubuntu (and those few platforms which support snap format)

Finally, as a summary: