braindump ... cause thread-dumps are not enough ;)

Model stores in ML platforms

What is a model store/registry?

Component of a ML platform, which is responsible for storage of models and related metadata. The related metadata can be anything from performance metrics, through source code to related datasets. The exact content depends on the use-cases implemented by the model store. Introducing a model store has the following advantages:

  • it makes the model development process more structured and systematic: it is possible to easily compare different versions of the same model (e.g. with different training parameters)
  • the deployment of the model to production can be simplified: one can prepare a common infrastructure which would have well defined requirements towards model format, those requirements could be fulfilled and enforced centrally in the model store
  • it makes it possible to browse through existing models what should encourage knowledge sharing

Examples of model stores

Michelangelo

Uber’s Michelangelo platform introduced an explicit model store component:

For every model that is trained in Michelangelo, we store a versioned object in our model repository in Cassandra that contains a record of:

* Who trained the model
* Start and end time of the training job
* Full model configuration (features used, hyper-parameter values, etc.)
* Reference to training and test data sets
* Distribution and relative importance of each feature
* Model accuracy metrics
* Standard charts and graphs for each model type (e.g. ROC curve, PR curve, and confusion matrix for a binary classifier)
* Full learned parameters of the model
* Summary statistics for model visualization
The information is easily available to the user through a web UI and programmatically through an API, both for inspecting the details of an individual model and for comparing one or more models with each other.

source

MLFlow

  • The Model Registry is one of the central concepts in MLFlow
  • Inserting into MLFlow is executed when you include MLFlow sdk’s code in model code
  • MLFlow introduces its own model format. After you are done with training (in any supported framework) you can simply save (or log - if you are using model registry) the model to storage in that format. Using it enables MLFlow to e.g. easily wrap it with a REST API service (so you can easily deploy it somewhere).
  • MLFlow provides infra for easy model serving (although comparing to KFServing looks a bit like a toy)

source

Kubeflow

In kubeflow there is no explicit model registry, although its features seem to be implemented to some degree in KFServing. KFServing introduces a description of model deployment (a Custom Resource in k8s) which ties together:

  • framework which is used for inference (e.g. pytorch) - this is done through different resource specs
  • parameters (e.g. weights of the model) - just a resource like file blob in s3
  • resource requirements (CPU, mem)
  • serving configuration (e.g. the k8s service spec wrapped in a relevant kubeflow resource)

This introduces a common description of models, which is kind-of-a model registry.

FB Learner flow

  • An explicit model registry seems not to be present
  • … however its role is implemented by two other components:
    • the repository of reusable training pipelines (they have a notion of a model embedded, that model just needs to be trained)
    • the experiment/run repository (which allows to compare runs with each other)

If an environment doesn’t provide an explicit model store, one could be implemented using version control tools, e.g.

  • keepsake
    • versioning for ML artifacts (code, hyperparameters, weights etc)
    • a little bit similar to MLFLow: installed through embedding in code
  • dvc
    • “git” for ML
    • similar to keepsake

Feature stores from Data Engineering perspective

What is a feature store?

Introduction to feature stores
Feature stores are systems that help to address some of the key challenges that ML teams face when productionizing features

* Feature sharing and reuse: Engineering features is one of the most time consuming activities in building an end-to-end ML system, yet many teams continue to develop features in silos. This leads to a high amount of re-development and duplication of work across teams and projects.

* Serving features at scale: Models need data that can come from a variety of sources, including event streams, data lakes, warehouses, or notebooks. ML teams need to be able to store and serve all these data sources to their models in a performant and reliable way. The challenge is scalably producing massive datasets of features for model training, and providing access to real-time feature data at low latency and high throughput in serving.

* Consistency between training and serving: The separation between data scientists and engineering teams often lead to the re-development of feature transformations when moving from training to online serving. Inconsistencies that arise due to discrepancies between training and serving implementations frequently leads to a drop in model performance in production.

* Point-in-time correctness: General purpose data systems are not built with ML use cases in mind and by extension don’t provide point-in-time correct lookups of feature data. Without a point-in-time correct view of data, models are trained on datasets that are not representative of what is found in production, leading to a drop in accuracy.

* Data quality and validation: Features are business critical inputs to ML systems. Teams need to be confident in the quality of data that is served in production and need to be able to react when there is any drift in the underlying data.

Source

In my opinion most of those reasons boil down to three key things:

  • features data needs to be stored in a well specified, consistent format
  • features data quality needs to stay high
  • features metadata needs to make it clear and easy to find relevant features, check their provenance and understand whether they are fit in a given context

With that in mind I would argue that the existence of feature stores is caused by shortcomings and inconsistencies of data infrastructures deployed in different companies. Effectively introducing a feature store is just duplicating data (which could/should be stored back in the existing datalake/data warehouse) and introducing complexity (e.g. by introducing another type of a database in your infrastructure). On the other hand, it is perfectly reasonable to use a dedicated system to solve a very specific use-case (especially if the technical requirements are very different, e.g. latency needs to be small comparing to e.g. HDFS used in a traditional Hive warehouse).

Ideally, the data warehouse (and/or datalake) can be used as a feature store (this means your data is clean enough and has good metadata to trust it can be fed to a ML model). If there are performance issues related the technology choices made in the past, I would look into creating a general solution (e.g. introduce an additional layer to the datalake which nicely caches the data in another type of database, but hides the complexity of managing it behind some nice interface).

Examples of feature stores

Michelangelo - explicit feature store in the architecture

Offline
Uber’s transactional and log data flows into an HDFS data lake and is easily accessible via Spark and Hive SQL compute jobs. We provide containers and scheduling to run regular jobs to compute features which can be made private to a project or published to the Feature Store (see below) and shared across teams, while batch jobs run on a schedule or a trigger and are integrated with data quality monitoring tools to quickly detect regressions in the pipeline–either due to local or upstream code or data issues.

Online
Models that are deployed online cannot access data stored in HDFS, and it is often difficult to compute some features in a performant manner directly from the online databases that back Uber’s production services (for instance, it is not possible to directly query the UberEATS order service to compute the average meal prep time for a restaurant over a specific period of time). Instead, we allow features needed for online models to be precomputed and stored in Cassandra where they can be read at low latency at prediction time.

We support two options for computing these online-served features, batch precompute and near-real-time compute, outlined below:

Batch precompute. The first option for computing is to conduct bulk precomputing and loading historical features from HDFS into Cassandra on a regular basis. This is simple and efficient, and generally works well for historical features where it is acceptable for the features to only be updated every few hours or once a day. This system guarantees that the same data and batch pipeline is used for both training and serving. UberEATS uses this system for features like a ‘restaurant’s average meal preparation time over the last seven days.’
Near-real-time compute. The second option is to publish relevant metrics to Kafka and then run Samza-based streaming compute jobs to generate aggregate features at low latency. These features are then written directly to Cassandra for serving and logged back to HDFS for future training jobs. Like the batch system, near-real-time compute ensures that the same data is used for training and serving. To avoid a cold start, we provide a tool to “backfill” this data and generate training data by running a batch job against historical logs. UberEATS uses this near-realtime pipeline for features like a ‘restaurant’s average meal preparation time over the last one hour.’

Shared feature store
We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others. At a high level, it accomplishes two things:

It allows users to easily add features they have built into a shared feature store, requiring only a small amount of extra metadata (owner, description, SLA, etc.) on top of what would be required for a feature generated for private, project-specific usage.
Once features are in the Feature Store, they are very easy to consume, both online and offline, by referencing a feature’s simple canonical name in the model configuration. Equipped with this information, the system handles joining in the correct HDFS data sets for model training or batch prediction and fetching the right value from Cassandra for online predictions.
At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily.

In the future, we intend to explore the possibility of building an automated system to search through Feature Store and identify the most useful and important features for solving a given prediction problem.

Feast, the feature store for Kubeflow

  • Deployed directly in the k8s cluster
  • After deployment, users can use the API right away
  • Also uses a concept of an Entity (this basically defines a primary key in a “table”)
  • Then you define some features (basically names + data types) …
  • … and a data source (e.g. the format, physical location, timestamp columns etc)…
  • … and you tie all those things together in a FeatureTable object
  • Finally you call the “ingest()” function
  • All of the above steps are done in the Python SDK (see demo notebook)
  • Data can be retrieved in large offline batches (get_historical_features)
  • there is an online access mode (get_online_features) for low-latency access
    • but this requires triggering loading from offline -> online storage
  • source
  • sample notebook with API demo

FB Learner Flow

  • Does not seem to have a distinct feature store

Tecton

  • paid tool
  • seems to be a bit more broader than just a “feature store” - provides also serving and transformations for input data
  • provides online and offline serving (batch and API access)
  • interactions with the system are implemented in 3 ways:
    • cli - feature definition changes
    • web ui - monitoring of the environment
    • python sdk - allows to access the feature store in Databricks or EMR notebooks or to access the data through Consumption API (rest api ?)
  • data model
    • first, tecton introduces Entities (object classes)
    • then every feature value is associated with one or more entities
  • input data can be transformed with SQL or PySpark, there can be online transformations too
  • source

AWS Sage Maker

  • Note: weirdly similar to Feast
  • Fully integrated in Amazon’s console (there is ajupter workbook though :))
  • Feature records are analogs of db rows
  • You work with it in a following way:
    • in a jupyter notebook you load/clean the data
    • then you define the “schema” - you create “features” in a “feature group”
    • there are two key “features”:
      • unique ids (transaction id, etc - something that uniquely identify a “row”)
      • event time - timestamp of the event/feature
    • then you define the data location
    • … and call the “ingest()” function to actually load the data
    • there is a batch API and a single record API
  • after ingestion data is both in offline and online stores
  • you can access the data by:
    • fetching it from s3
    • reading it from a hive table (there is a nice api to generate external hive tables)
    • dumping it to csv
  • source
  • video

More links

Sales Force Marketing Cloud REST API upload errors

While trying to insert data into Sales Force Marketing Cloud I frequently encounter one of the following errors:

{"message":"Primary key 'subscriberkey' does not exist.","errorcode":10000,"documentation":""}

… or …

{"message":"Parameter {guid} is invalid.","errorcode":10001,"documentation":""}

… or …

{"message":"Unable to save rows for data extension ID <ID>","errorcode":10006,"documentation":""}

Unfortunately those messages are pretty cryptic and its quite hard to identify what exactly is the root cause. Fortunately there are couple things which you can try out:

  • Verify that the url has a correct format (mind the key: portion:
    • https://<YOUR_SUBDOMAIN>.rest.marketingcloudapis.com/hub/v1/dataevents/key:<DATA_EXTENSION_NAME>/rowset
  • Check if you there are “null” values in your data. Even if you allowed a specific column to be null in the DE definition, you still have to send empty strings to make the API call work.
  • Check if you have set the external key of Data Extension correctly. I found that setting it from Audience Builder does not usually work, but in Email Studio everything is ok.

ML platforms evaluation summary

Below are my general notes from the recent evaluation of various ML tools/platforms. At the end I try to summarize the features of an ideal platform (at least from my humble POV :)).

Which things I did like?

  • KFserving from kubeflow for serving
    • uses knative + istio to provide a feature-full environment for microservices, which in this case delivers ML model inference
    • using those tools allows to use advanced techniques like canary deployments
  • Concept of a workflow from FB Learner: the complete pipeline stays the same and people can just plug-in different features
  • MLFlow has a good idea on tracking the experiments on the sdk/framework level (e.g. there are defaults for pytorch etc) and in general organizing research workflow
  • MLFlow has a really nice concept of a universal model format. On one side it ensures that the right things get stored in Model Registry, on the other it is a nice, common definition which e.g. can be used for deployment/model serving.
    • It this could be combined with KFserving, this would be a ML platform superpower.

What did I learn?

  • Data infra (datalake/data warehouse) and elastic compute are a foundation for an ML platform
  • Tight integration with the datalake/warehouse (FB Learner/Michelangelo) is a very good idea.
    • Enables using available data sources as features (so the datalake/warehouse == feature store)
  • The pipelines/workflows should have data-quality checks built-in to allow for quick catching of problems with data
  • Using DAGs and python as a workflow definition (FB Learner) seems to have the best abstraction level, e.g. Airflow DAGs seem like a better abstraction than a series of interconnected container executions (although seem a bit limited in the sense that you need to follow specific rules when you integrate)
  • kubeflow, michelangelo, fblearner support automated hyperparameter search functionality
  • Model evaluation reports should be centrally stored and be easy to query (e.g. indexed with elasticsearch)
  • Model should have a report from training associated
  • Support for both online/offline training is a nice feature (although this would be more a requirements based decision)
  • MLFlow seems to be a good step in making research reproducible, or in general just more organized, it needs some improvements though to make it digestable for wider audience:
    • configuration patterns should be easier (e.g. the environment - Jupyter Notebooks, injects the default tracking configuration)
    • some defaults on metrics to track
    • artifact storage for the tracing server setup should be solved a bit better (the client should only see the server and not the storage accounts in AWS/Azure)

What would be the features of an ideal ML platform

  • A solid foundation with a datalake/data warehouse with well organized metadata
    • DL/DWH become in general large feature stores - we do not have to have an explicit feature store this way
  • Pipelines are written as Python code in something similar to Airflow
  • There is a range of ready to use ML pipelines (pre-processing piepeline + model definition + training procedure + easy to change feature configuration)
  • Provides easily accessible Jupyter notebooks which have good integration with:
    • Scalable Spark clusters
    • Metadata store from datalake/DWH
    • Default MLFlow setup for remote tracing and artifact and model storage out of the box
  • The underlying infrastructure is an autoscalable, multitenant Spark cluster
    • This can be achieved with Spark-on-Kubernetes, where k8s effectively becomes the queue manager for Spark jobs.
  • Universal model format, which can be used by KFserving to expose the model through an API or by an automated pipeline to produce batch predictions.
  • Support for online and batch training
  • Support for automated hyperparameter tuning
  • Data Quality checks built into the ML pipelines (something similar to great_expectations)
  • MLFlow Model Registry integrated and extended with easy browsing and searching functionality, integrated also with KFserving and batch prediction calculation pipelines.

This would enable the following:

  • A Data Scientist can easily iterate over his model with use of ML Flow, his work is nicely tracked and models are ready to deploy in Model Registry. MLFlow integration is done without any additions from his side.
  • ML Engineer can easily deploy those models to production with KFserving or batch pipelines.
  • If model needs to be periodically retrained, an Airflow DAG can be written to create a pipeline
  • Both Data Scientist and ML Engineer have full access to all DWH/DL data, so their models/pipelines can be easily adjusted if new data begins to be captured.
  • Quality is ensured on every stage with great_expectations (and its extension to cover e.g. accuracy of the production models).
  • All information about past workflows and models can be easily searched through.

MLFlow evaluation notes

  • MLFlow is an Open Source framework (set of libs + a model server + tracking server + ui) which organizes your work/research in ML into a more reproducible/reliable process
  • What problem does it solve?
    • making the research part of ML development more transparent and trackable
    • the artifacts of training and their properties are stored in a model store
    • it can organize your data processing into pipelines
  • It is distributed as a python package, can be installed with pip install mlflow
  • What I like about it?
    • Quite lightweight
    • Nice UI with metric visualizations
    • This is a real improvment in ML research
    • The concept of a universal model format is really nice
  • Tracking data can be managed centrally:
    To log runs remotely, set the MLFLOW_TRACKING_URI environment variable to a
    tracking server’s URI or call mlflow.set_tracking_uri().
    
  • If you register the model in the model registry, MLFlow has ways to serve it.
  • What is a model?
    • Each MLflow Model is a directory containing arbitrary files, together with an MLmodel file in the root of the directory that can define multiple flavors that the model can be viewed in. (https://mlflow.org/docs/latest/models.html#storage-format)
    • MLFlow introduces its own model format. After you are done with training (in any supported framework) you can simply save (or log - if you are using model registry) the model to storage in that format. Using it enables MLFlow to e.g. easily wrap it with a REST API service (so you can easily deploy it somewhere).
  • What is a “run”?
    • Run seems to be a logging session which starts at the process start and commences when it ends.
    • There can be multiple runs within the same process, you need to explicitly create them with user of the mlflow.start_run() context API or through MfClient:
  • In MLFlow two things are tracked: runs and artifacts.
    • Runs can be stored in dbs, files etc
    • Artifacts need a form of blob storage (S3, Azure Blob Storage, FTP/SFTP, HDFS etc)
  • MLFlow has a “Model registry”

    The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, stage transitions (for example from staging to production), and annotations. Source

  • Tracking
  • MLFlow introduces a concept of “projects”:
    • Those seem to be “pipelines” which combine multiple steps into a reproducible workflow
    • Seem quite simplistic comparing to other solutions like DAGs in Airflow or in Kubeflow
  • MLFlow can be extended by writing plugins: *
    The MLflow Python API supports several types of plugins:
    
    * Tracking Store: override tracking backend logic, e.g. to log to a
      third-party storage solution
    * ArtifactRepository: override artifact logging logic, e.g. to log to a
      third-party storage solution
    * Run context providers: specify context tags to be set on runs created via
      the mlflow.start_run() fluent API.
    * Model Registry Store: override model registry backend logic, e.g. to log to
      a third-party storage solution
    * MLFlow Project backend: override the local execution backend to execute a
      project on your own cluster (Databricks, kubernetes, etc.)
    
  • Logging/storing of data can be done in different modes (e.g. artifacts are stored

While researching MLFlow capabilities, I have prepared some sample notebooks here.