Model Details — CMU March Madness ML

Model Performance

Cross-validated metrics using walk-forward evaluation (no data leakage).

🏀 Men's Model Metrics

🏀 Women's Model Metrics

Feature Engineering

28 differential features computed by comparing team statistics: team1_stat - team2_stat.

📊 Statistical Features (differential)

✓

Win Percentage
Season win rate

✓

Points For / Against
Offensive and defensive scoring averages

✓

Point Differential
Average margin of victory/defeat

✓

Shooting Efficiency
FG%, 3P%, FT% — all three shooting categories

✓

Rebounds (Off + Def)
Rebounding prowess

✓

Assists, Turnovers, Steals, Blocks, Fouls
Full box score statistics

🎯 Contextual Features

★

Seed Difference
Tournament seed differential (most predictive!)

★

Higher Seed Flag
Binary: is team1 the higher seed?

★

Massey Rankings
Average/best ordinal rank across multiple systems (NET, KPI, KenPom, SAG)

★

Offensive Efficiency
Derived: points scored per game differential

★

Net Efficiency
Offensive + Defensive efficiency combined

Model Hyperparameters

Configuration of each component in the ensemble.

🚀 XGBoost (40% weight)

Gradient boosted trees with regularization. Excels at capturing non-linear interactions.

n_estimators300

max_depth4

learning_rate0.05

subsample0.8

colsample0.8

reg_alpha0.1

⚡ LightGBM (40% weight)

Leaf-wise tree growth for efficiency. Handles ordinal ranking features naturally.

n_estimators300

max_depth4

learning_rate0.05

subsample0.8

colsample0.8

min_data10

📉 Logistic Regression (20% weight)

Calibrated linear baseline. Provides stable probability estimates and prevents overfitting.

C1.0

penaltyL2

max_iter1000

scalerStandard

imputerMedian

calibrationSigmoid

Using Real Kaggle Data

For production predictions, use real NCAA basketball data from Kaggle. See the full guide at Getting Started.

📥 Data Setup

# 1. Get Kaggle API credentials
# Go to: https://www.kaggle.com/settings/account → Create New Token
# Place kaggle.json at ~/.kaggle/kaggle.json (chmod 600 on Linux/Mac)

# 2. Download NCAA data (~500 MB)
python scripts/download_data.py

# Key files downloaded to data/raw/:
#   MRegularSeasonDetailedResults.csv  — All regular season game results (men's)
#   MNCAATourneyDetailedResults.csv    — Tournament results (training labels)
#   MSeeds.csv                          — Tournament seedings by year
#   MMasseyOrdinals.csv                 — Rankings from 30+ systems
#   W*.csv                              — Corresponding women's files

# 3. Retrain with real data (~2-5 minutes)
python -m src.train --data-dir data/raw

# 4. Backtest to verify performance
python -m src.evaluate --data-dir data/raw

# 5. Generate 2026 predictions
python -m src.predict --data-dir data/raw

Expected improvement: Synthetic data → ~55% accuracy. Real Kaggle data → 65–75% accuracy. Historical Kaggle leaderboards show top models at 73–75% on tournament games. A seed-only heuristic baseline achieves ~69%, so the ML model should beat that with full data.

How to Improve the Model

Ideas for increasing prediction accuracy beyond the baseline ensemble.

🧪 More Features

Edit src/feature_engineering.py to add:

Strength of schedule
Conference strength rating
Tempo and pace statistics
Home/away game splits
Recent form (last 10 games)
Head-to-head historical record

⚙️ Better Hyperparameters

Edit src/model.py and tune with Optuna or scikit-learn's GridSearchCV:

Increase n_estimators (500+)
Tune learning_rate (0.01–0.1)
Adjust subsample ratio
Change ensemble weights
Add CatBoost or Neural Net

📊 Better Training Strategy

Edit src/train.py to improve the CV strategy:

Weight recent seasons more
Add tournament-specific features
Use calibrated probabilities
Optimize for bracket score (not accuracy)
Train a meta-model on ensemble outputs

Feature Importance

Feature importance averaged across XGBoost and LightGBM.

Men's Top 10 Features

Women's Top 10 Features

CI/CD Pipeline

Automated GitHub Actions workflows for training, evaluation, and deployment.

🔬 ci.yml — Tests

Runs on every push and PR. Validates data pipelines, feature engineering, model training, and prediction generation. Ensures submission files have correct format (72,390 M / 71,631 W rows).

On Push On PR

🤖 train.yml — Training

Manual trigger or weekly schedule. Downloads latest data, retrains models, runs evaluation, generates predictions, and commits results back to the repo.

Manual Trigger Weekly

🌐 pages.yml — Deploy

Deploys the docs/ directory to GitHub Pages on every push to main. Copies model metrics and backtest scores to the site's data directory.

On Main Push

🔬 Model Details