Aarno@TheGlobalMinima
Blueprint of ML System Design
For most MLEs, ML is only 10% of the job. The engineering aspect is crucial and
often requires skills from multiple fields. To begin, a good starting point is a robust
design of how the ml models will interact with users and other services.
These are the major components of a typical ML System Design:
> Data Ingestion & ETL service
> Data Storage & Feature engineering
> Training (and Retraining) pipelines, model management
> Inference service & monitoring
This, along with the actual Machine learning and Data modelling need to be combined
to create an architecture that can support and scale all the above functions.
Data Ingestion & ETL service
Sourcing of data is the first step in this design. While you're not responsible for
actually finding the data (orgs have their sources), the existing sources are often scattered across multiple
known sources.
The major responsibility here is to design an extensive and fault tolerant system
that can collect raw data coming in at different volumes and velocities, designing
schemas and snapshots.
Data storage & Feature engineering
Once you have data consistently flowing into the system, the next step is to organize and
persist it. This phase requires understanding of databases and data storage architectures.
The significant challenge here is to engineer learnable features out of a raw dump of data.
There are major considerations to be made in this phase
> The data should have distributions and patterns
> Chosen features should be available consistently.
> No Personally Identifiable Information (PII) should be present / should be mockable
> Features should be interpretable and explainable
> Features must be computable at inference time under the same constraints as training
A common tool used here is a feature store, which is used to maintain and version data
specifically used for training the ml models
Training pipelines & Model management
With trainable features now available and consistently updated, it's time to train your ml model.
This step requires proper monitoring since you train, evaluate and test models that may or may
not end up in production. This is where the idea of Model Registry becomes prominent. A Registry
allows you to record and version models along with their parameters and hyperparameters, data snapshots
and other metadata. You also log metrics and errors over time, which help in choosing the best model
for production.
The training workflow is orchestrated using a DAG based system (Airflow, Prefect, etc.). These
workflows need to be loosely coupled to ensure failures are graceful and logged. We'll cover more
on this phase, since a large part of this phase is more machine learning than the engineering.
Inference service & monitoring
This part of the system faces the users. This is also the only source of real world feedback for
the models, that ultimately becomes very crucial and should ideally influence the model's learning,
either as a learnable feature or as a hyperparameter.
The technique of inference depends on required frequency of predictions and the system.
Here are a couple of scenarios:
> The model faces real world users, the inference needs to be real-time and low latency. An API is the best way here.
> The predictions are used in another service (another ml model // analytics), so inference needs to scheduled and batch the predictions.
Here, another DAG based workflow or a serverless function with scheduled job works well.
How to approach designing these systems
Here are the first questions to ask:
1. what is the source of data, volume, variety and velocity. (Determines pipeline // storage choices)
2. what's the target feature, what are the most important features affecting it (this is from domain knowledge pov)
3. latency vs correctness (some cases need near absolute correctness, others need low latency, ask acceptable accuracy)
4. where are model predictions used (Choose between real time // batch inference)
These are preliminary questions, which lead into further investigations
Remember, no system design starts off with perfect choices, so it's essential to keep things simple
initially. Add a feature store or tiered data architecture only when the complexity demands it.
These systems evolve over time, and even years later are not perfect. The idea is to ensure
that the architecture remains abstractive and extensible.
Start simple, add complexity slowly.