ML-powered NCAA tournament bracket predictions using an XGBoost + LightGBM + Logistic Regression ensemble. Trained on 15+ years of basketball data.
Jump to the most important pages and resources.
Second Annual CMU March Madness Machine Learning Competition โ Deadline: March 17 at Noon EDT
Walk-forward cross-validation: trained on prior seasons, tested on each subsequent season. No data leakage.
Ensemble of three complementary ML models, each contributing different strengths to the final prediction.
15+ seasons of NCAA basketball data including regular season games, tournament results, team rankings (Massey/NET/KPI), and seedings. Features include all available box-score statistics.
Per-team season averages โ differential features (team A minus team B). 28 features per matchup including win%, point differential, shooting efficiency, rebounds, assists, turnovers, steals, blocks, and rankings.
Models trained on seasons up to year N, validated on season N+1. This prevents data leakage and reflects real-world deployment where we only have past data at prediction time.
All C(381,2) = 72,390 men's team pairs and C(379,2) = 71,631 women's pairs are predicted. Output: CSV files with WTeamID/LTeamID columns per competition format.
Both Regular (pre-tournament) and Progressive (updated after each round) brackets are supported. Historical backtesting validates performance across previous tournaments.
Top features driving predictions (averaged across XGBoost and LightGBM). Higher = more influential.
Points are awarded for each correct prediction. Points double each round โ getting a champion pick right is worth 32 points!
Get up and running with the full pipeline. See the detailed guide โ
# 1. Clone and install
git clone https://github.com/Qrytics/cmuMarchMadness-ML
cd cmuMarchMadness-ML
pip install -r requirements.txt
# 2a. Use real Kaggle data (recommended โ see getting-started.html)
# Place ~/.kaggle/kaggle.json first, then:
python scripts/download_data.py
python -m src.train --data-dir data/raw
python -m src.predict --data-dir data/raw
# 2b. Or use synthetic sample data (quick start, lower accuracy)
python scripts/generate_sample_data.py
python -m src.train
python -m src.predict
# 3. Evaluate historical performance
python -m src.evaluate --data-dir data/raw
# 4. Update this dashboard
python scripts/export_site_data.py
# Submission files:
# predictions/MNCAATourneyPredictions.csv (72,010 rows) โ submit this
# predictions/WNCAATourneyPredictions.csv (71,253 rows) โ and this