Skip to content

ML System Lifecycle

End-to-end stages for building and operating ML systems (roles in ml-ecosystem.md map to these stages).

Stage Goal Typical technologies
1. Frame Define the ML problem, task, success metrics, constraints, assumptions, and product tradeoffs. Product docs, design docs, Jira/Linear, stakeholder interviews, success-metric definitions
2. Data Prepare, validate, version, and monitor data quality, labeling, splits, leakage, feature availability, and drift. DVC, LakeFS, S3/GCS/Azure Blob, Hugging Face Datasets, Great Expectations, Pandera, data catalogs
3. Train Train models reproducibly while understanding model fundamentals, bias-variance tradeoffs, overfitting, regularization, loss functions, optimizers, learning rate schedules, and batch effects. PyTorch, TensorFlow, JAX, Lightning, Hugging Face Transformers, MLflow, Weights & Biases, TensorBoard
4. Evaluate Analyze model performance with validation/test sets, per-class and slice analysis, thresholding, calibration, robustness checks, and responsible ML considerations. scikit-learn metrics, custom evaluation scripts, Evidently, SHAP/LIME, fairness and calibration reports
5. Package Export model artifacts, track versions, define preprocessing/postprocessing, and choose appropriate formats and runtimes. TorchScript, ONNX, MLflow Model Registry, model cards, artifact stores, Docker build artifacts
6. Serve Run models under real latency, throughput, memory, CPU/GPU placement, scaling, batching, and application integration constraints. FastAPI, BentoML, KServe, Seldon Core, TorchServe, Triton Inference Server, Ray Serve, Docker, Docker Compose, Infisical, HashiCorp Vault, Doppler, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault
7. Optimize Improve inference for target hardware using quantization, distillation, pruning, runtime tuning, and preprocessing/postprocessing optimization. ONNX Runtime, TensorRT, OpenVINO, torch.compile, quantization, pruning, distillation, profiling tools
8. Test Build testing and CI/CD workflows for code, data, models, pipelines, performance, and deployment gates. pytest, GitHub Actions, GitLab CI, Jenkins, pre-commit, model/data validation gates, load tests
9. Operate Monitor model quality, data drift, latency, errors, cost, and business impact; use alerting, canaries, rollback, retraining, and reliability safeguards. Prometheus, Grafana, OpenTelemetry, Loki/ELK, Datadog, Arize, WhyLabs, PagerDuty, Kubernetes
10. Communicate Explain decisions, assumptions, metrics, validation evidence, fairness/privacy/security risks, and product tradeoffs to technical and non-technical stakeholders. README files, model cards, system cards, dashboards, reports, runbooks, release notes