CMU March Madness ML

Note: These results are from models trained on real Kaggle NCAA data. Historical cross-validation shows ~71% accuracy (Men) and ~76% accuracy (Women) on tournament games. See Getting Started for details and how to retrain.

Model Architecture

Ensemble of three complementary ML models, each contributing different strengths to the final prediction.

XGBoost

300 trees, depth 4
Weight: 40%

+

LightGBM

300 trees, depth 4
Weight: 40%

+

Logistic Reg.

L2 regularized
Weight: 20%

=

Ensemble

Win probability
per matchup

🔧 ML Pipeline

1

Data Collection

15+ seasons of NCAA basketball data including regular season games, tournament results, team rankings (Massey/NET/KPI), and seedings. Features include all available box-score statistics.

2

Feature Engineering

Per-team season averages → differential features (team A minus team B). 28 features per matchup including win%, point differential, shooting efficiency, rebounds, assists, turnovers, steals, blocks, and rankings.

3

Walk-Forward Training

Models trained on seasons up to year N, validated on season N+1. This prevents data leakage and reflects real-world deployment where we only have past data at prediction time.

4

Prediction Generation

All C(381,2) = 72,390 men's team pairs and C(379,2) = 71,631 women's pairs are predicted. Output: CSV files with WTeamID/LTeamID columns per competition format.

5

Bracket Simulation

Both Regular (pre-tournament) and Progressive (updated after each round) brackets are supported. Historical backtesting validates performance across previous tournaments.

Feature Importance

Top features driving predictions (averaged across XGBoost and LightGBM). Higher = more influential.

🏀 Men's Top Features

🏀 Women's Top Features

Scoring System

Points are awarded for each correct prediction. Points double each round — getting a champion pick right is worth 32 points!

1

Round of 64 (32 games)

2

Round of 32 (16 games)

4

Sweet 16 (8 games)

8

Elite 8 (4 games)

16

Final Four (2 games)

32

Championship (1 game)

Max Score: 32×1 + 16×2 + 8×4 + 4×8 + 2×16 + 1×32 = 32+32+32+32+32+32 = 196 points
Progressive brackets are updated after each round with real results, so upsets don't cascade through your entire bracket — typically yielding higher scores.

Quick Start

Get up and running with the full pipeline. See the detailed guide →

# 1. Clone and install
git clone https://github.com/Qrytics/cmuMarchMadness-ML
cd cmuMarchMadness-ML
pip install -r requirements.txt

# 2a. Use real Kaggle data (recommended — see getting-started.html)
#     Place ~/.kaggle/kaggle.json first, then:
python scripts/download_data.py
python -m src.train --data-dir data/raw
python -m src.predict --data-dir data/raw

# 2b. Or use synthetic sample data (quick start, lower accuracy)
python scripts/generate_sample_data.py
python -m src.train
python -m src.predict

# 3. Evaluate historical performance
python -m src.evaluate --data-dir data/raw

# 4. Update this dashboard
python scripts/export_site_data.py

# Submission files:
#   predictions/MNCAATourneyPredictions.csv  (72,010 rows) ← submit this
#   predictions/WNCAATourneyPredictions.csv  (71,253 rows) ← and this

🚀 Full Setup Guide 🏆 View Bracket Predictions

🏀 March Madness ML

Quick Actions

Competition Overview

Historical Backtest

📊 Game Accuracy by Season

🏆 Bracket Score by Season