Dataset Overview
| Outcome | Count | Share |
|---|---|---|
| Lived | 178 | 59.5% |
| Died | 77 | 25.8% |
| Euthanized | 44 | 14.7% |
1 — Exploratory Data Analysis
Univariate and bivariate examination of the four key clinical vitals (pulse, rectal temperature, PCV, total protein) stratified by outcome.
Fig 1 — Tachycardia is the strongest single biomarker separating survivors from non-survivors.
Fig 3 — Cases below the dashed line cluster among non-survivors, suggesting protein loss compounds circulatory failure.
Fig 2 — PCV and pulse show the strongest positive correlation (r ≈ 0.41), reflecting haemoconcentration-driven tachycardia.
Fig 4 — Mild pain is the most common presentation, but extreme pain predicts surgical necessity.
2 — Dimensionality Reduction (PCA)
Principal Component Analysis on five standardised numeric vitals. The scree plot and 2D projection reveal why numeric vitals alone are insufficient for clean class separation.
Fig 5 — PC1 + PC2 capture 55.2% of variance. Outcome classes overlap substantially, confirming that categorical features (pain level, mucous membrane) carry critical predictive weight beyond raw vitals.
3 — Random Forest Classifier
A RandomForestClassifier (300 trees, balanced class weights) trained on
eight features including both numeric vitals and encoded categorical variables.
Missing values imputed with per-feature medians.
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Lived | 0.73 | 0.75 | 0.74 | 36 |
| Died | 0.53 | 0.67 | 0.59 | 15 |
| Euthanized | 0.50 | 0.22 | 0.31 | 9 |
Fig 6 — Pulse is the dominant predictor, consistent with the EDA finding. Pain level and surgery status together contribute as much as the remaining vitals combined.
Fig 7 — 'Lived' cases are classified with the highest precision. 'Euthanized' cases show most confusion with 'Died', reflecting clinical similarity.
4 — Methods & Reproducibility
Pipeline
1. Load CSV → standardise column values → coerce numerics.
2. Encode categoricals with LabelEncoder; impute numeric NAs with median.
3. 80/20 stratified train-test split, 5-fold cross-validation on balanced accuracy.
4. Feature importance extracted from mean decrease in impurity.
Limitations
The dataset has substantial missingness (~30% of rows for some features). Median imputation preserves sample size but understates variance. The balanced accuracy metric was chosen because the Lived class is overrepresented (~58%). This report should not be used for clinical decision-making.
Reproduce
pip install -r requirements.txt && python analyse.py