ML System Lifecycle

End-to-end stages for building and operating ML systems (roles in ml-ecosystem.md map to these stages).

Stage	Goal	Typical technologies
1. Frame	Define the ML problem, task, success metrics, constraints, assumptions, and product tradeoffs.	Product docs, design docs, Jira/Linear, stakeholder interviews, success-metric definitions
2. Data	Prepare, validate, version, and monitor data quality, labeling, splits, leakage, feature availability, and drift.	DVC, LakeFS, S3/GCS/Azure Blob, Hugging Face Datasets, Great Expectations, Pandera, data catalogs
3. Train	Train models reproducibly while understanding model fundamentals, bias-variance tradeoffs, overfitting, regularization, loss functions, optimizers, learning rate schedules, and batch effects.	PyTorch, TensorFlow, JAX, Lightning, Hugging Face Transformers, MLflow, Weights & Biases, TensorBoard, Optuna
4. Evaluate	Analyze model performance with validation/test sets, per-class and slice analysis, thresholding, calibration, robustness checks, and responsible ML considerations.	scikit-learn metrics, custom evaluation scripts, Evidently, SHAP/LIME, fairness and calibration reports
5. Package	Export model artifacts, track versions, define preprocessing/postprocessing, and choose appropriate formats and runtimes.	TorchScript, ONNX, MLflow Model Registry, model cards, artifact stores, Docker build artifacts
6. Serve	Run models under real latency, throughput, memory, CPU/GPU placement, scaling, batching, and application integration constraints.	FastAPI, BentoML, KServe, Seldon Core, TorchServe, Triton Inference Server, Ray Serve, Docker, Docker Compose, Infisical, HashiCorp Vault, Doppler, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault
7. Optimize	Improve inference for target hardware using quantization, distillation, pruning, runtime tuning, and preprocessing/postprocessing optimization.	ONNX Runtime, TensorRT, OpenVINO, `torch.compile`, quantization, pruning, distillation, profiling tools
8. Test	Build testing and CI/CD workflows for code, data, models, pipelines, performance, and deployment gates.	pytest, GitHub Actions, GitLab CI, Jenkins, pre-commit, model/data validation gates, load tests
9. Operate	Monitor model quality, data drift, latency, errors, cost, and business impact; use alerting, canaries, rollback, retraining, and reliability safeguards.	Prometheus, Grafana, OpenTelemetry, Loki/ELK, Datadog, Arize, WhyLabs, PagerDuty, Kubernetes
10. Communicate	Explain decisions, assumptions, metrics, validation evidence, fairness/privacy/security risks, and product tradeoffs to technical and non-technical stakeholders.	README files, model cards, system cards, dashboards, reports, runbooks, release notes