braindump ... cause thread-dumps are not enough ;)

MLFlow evaluation notes

  • MLFlow is an Open Source framework (set of libs + a model server + tracking server + ui) which organizes your work/research in ML into a more reproducible/reliable process
  • What problem does it solve?
    • making the research part of ML development more transparent and trackable
    • the artifacts of training and their properties are stored in a model store
    • it can organize your data processing into pipelines
  • It is distributed as a python package, can be installed with pip install mlflow
  • What I like about it?
    • Quite lightweight
    • Nice UI with metric visualizations
    • This is a real improvment in ML research
    • The concept of a universal model format is really nice
  • Tracking data can be managed centrally:
    To log runs remotely, set the MLFLOW_TRACKING_URI environment variable to a
    tracking server’s URI or call mlflow.set_tracking_uri().
    
  • If you register the model in the model registry, MLFlow has ways to serve it.
  • What is a model?
    • Each MLflow Model is a directory containing arbitrary files, together with an MLmodel file in the root of the directory that can define multiple flavors that the model can be viewed in. (https://mlflow.org/docs/latest/models.html#storage-format)
    • MLFlow introduces its own model format. After you are done with training (in any supported framework) you can simply save (or log - if you are using model registry) the model to storage in that format. Using it enables MLFlow to e.g. easily wrap it with a REST API service (so you can easily deploy it somewhere).
  • What is a “run”?
    • Run seems to be a logging session which starts at the process start and commences when it ends.
    • There can be multiple runs within the same process, you need to explicitly create them with user of the mlflow.start_run() context API or through MfClient:
  • In MLFlow two things are tracked: runs and artifacts.
    • Runs can be stored in dbs, files etc
    • Artifacts need a form of blob storage (S3, Azure Blob Storage, FTP/SFTP, HDFS etc)
  • MLFlow has a “Model registry”

    The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, stage transitions (for example from staging to production), and annotations. Source

  • Tracking
  • MLFlow introduces a concept of “projects”:
    • Those seem to be “pipelines” which combine multiple steps into a reproducible workflow
    • Seem quite simplistic comparing to other solutions like DAGs in Airflow or in Kubeflow
  • MLFlow can be extended by writing plugins: *
    The MLflow Python API supports several types of plugins:
    
    * Tracking Store: override tracking backend logic, e.g. to log to a
      third-party storage solution
    * ArtifactRepository: override artifact logging logic, e.g. to log to a
      third-party storage solution
    * Run context providers: specify context tags to be set on runs created via
      the mlflow.start_run() fluent API.
    * Model Registry Store: override model registry backend logic, e.g. to log to
      a third-party storage solution
    * MLFlow Project backend: override the local execution backend to execute a
      project on your own cluster (Databricks, kubernetes, etc.)
    
  • Logging/storing of data can be done in different modes (e.g. artifacts are stored

While researching MLFlow capabilities, I have prepared some sample notebooks here.