ML System Lifecycle
End-to-end stages for building and operating ML systems (roles in ml-ecosystem.md map to these stages).
| Stage | Goal | Typical technologies |
|---|---|---|
| 1. Frame | Define the ML problem, task, success metrics, constraints, assumptions, and product tradeoffs. | Product docs, design docs, Jira/Linear, stakeholder interviews, success-metric definitions |
| 2. Data | Prepare, validate, version, and monitor data quality, labeling, splits, leakage, feature availability, and drift. | DVC, LakeFS, S3/GCS/Azure Blob, Hugging Face Datasets, Great Expectations, Pandera, data catalogs |
| 3. Train | Train models reproducibly while understanding model fundamentals, bias-variance tradeoffs, overfitting, regularization, loss functions, optimizers, learning rate schedules, and batch effects. | PyTorch, TensorFlow, JAX, Lightning, Hugging Face Transformers, MLflow, Weights & Biases, TensorBoard |
| 4. Evaluate | Analyze model performance with validation/test sets, per-class and slice analysis, thresholding, calibration, robustness checks, and responsible ML considerations. | scikit-learn metrics, custom evaluation scripts, Evidently, SHAP/LIME, fairness and calibration reports |
| 5. Package | Export model artifacts, track versions, define preprocessing/postprocessing, and choose appropriate formats and runtimes. | TorchScript, ONNX, MLflow Model Registry, model cards, artifact stores, Docker build artifacts |
| 6. Serve | Run models under real latency, throughput, memory, CPU/GPU placement, scaling, batching, and application integration constraints. | FastAPI, BentoML, KServe, Seldon Core, TorchServe, Triton Inference Server, Ray Serve, Docker, Docker Compose, Infisical, HashiCorp Vault, Doppler, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault |
| 7. Optimize | Improve inference for target hardware using quantization, distillation, pruning, runtime tuning, and preprocessing/postprocessing optimization. | ONNX Runtime, TensorRT, OpenVINO, torch.compile, quantization, pruning, distillation, profiling tools |
| 8. Test | Build testing and CI/CD workflows for code, data, models, pipelines, performance, and deployment gates. | pytest, GitHub Actions, GitLab CI, Jenkins, pre-commit, model/data validation gates, load tests |
| 9. Operate | Monitor model quality, data drift, latency, errors, cost, and business impact; use alerting, canaries, rollback, retraining, and reliability safeguards. | Prometheus, Grafana, OpenTelemetry, Loki/ELK, Datadog, Arize, WhyLabs, PagerDuty, Kubernetes |
| 10. Communicate | Explain decisions, assumptions, metrics, validation evidence, fairness/privacy/security risks, and product tradeoffs to technical and non-technical stakeholders. | README files, model cards, system cards, dashboards, reports, runbooks, release notes |