🐴 Horse Colic Survival Analysis

An end-to-end pipeline from exploratory data analysis to machine learning survival prediction, built on the UCI Horse Colic dataset.

Python 3.12 scikit-learn pandas seaborn n = 299

Dataset Overview

Samples299
Features28
Missing (avg)17.9%
OutcomeCountShare
Lived17859.5%
Died7725.8%
Euthanized4414.7%

1 — Exploratory Data Analysis

Univariate and bivariate examination of the four key clinical vitals (pulse, rectal temperature, PCV, total protein) stratified by outcome.

Fig 1 — Tachycardia is the strongest single biomarker separating survivors from non-survivors.

Fig 3 — Cases below the dashed line cluster among non-survivors, suggesting protein loss compounds circulatory failure.

Fig 2 — PCV and pulse show the strongest positive correlation (r ≈ 0.41), reflecting haemoconcentration-driven tachycardia.

Fig 4 — Mild pain is the most common presentation, but extreme pain predicts surgical necessity.

Key finding: Median pulse in survivors is ~50 bpm versus ~90 bpm in non-survivors — a 1.8× difference that is visible without any feature engineering.

2 — Dimensionality Reduction (PCA)

Principal Component Analysis on five standardised numeric vitals. The scree plot and 2D projection reveal why numeric vitals alone are insufficient for clean class separation.

Fig 5 — PC1 + PC2 capture 55.2% of variance. Outcome classes overlap substantially, confirming that categorical features (pain level, mucous membrane) carry critical predictive weight beyond raw vitals.

Implication: Outcomes overlap in PCA space, indicating that categorical features (pain level, mucous membrane, surgery) are essential for accurate classification — confirmed by the feature importance results below.

3 — Random Forest Classifier

A RandomForestClassifier (300 trees, balanced class weights) trained on eight features including both numeric vitals and encoded categorical variables. Missing values imputed with per-feature medians.

5-Fold Balanced Accuracy 55.0% ± 9.6%
ClassPrecisionRecallF1Support
Lived0.730.750.7436
Died0.530.670.5915
Euthanized0.500.220.319

Fig 6 — Pulse is the dominant predictor, consistent with the EDA finding. Pain level and surgery status together contribute as much as the remaining vitals combined.

Fig 7 — 'Lived' cases are classified with the highest precision. 'Euthanized' cases show most confusion with 'Died', reflecting clinical similarity.

Interpretation: Pulse is the dominant predictor (consistent with EDA), but pain level and surgery status together contribute as much as the four remaining vitals combined. The model struggles most at the Died / Euthanized boundary — clinically, these represent similar physiological states resolved by veterinary judgement rather than biomarkers alone.

4 — Methods & Reproducibility

Pipeline

1. Load CSV → standardise column values → coerce numerics.
2. Encode categoricals with LabelEncoder; impute numeric NAs with median.
3. 80/20 stratified train-test split, 5-fold cross-validation on balanced accuracy.
4. Feature importance extracted from mean decrease in impurity.

Limitations

The dataset has substantial missingness (~30% of rows for some features). Median imputation preserves sample size but understates variance. The balanced accuracy metric was chosen because the Lived class is overrepresented (~58%). This report should not be used for clinical decision-making.

Reproduce

pip install -r requirements.txt && python analyse.py