Getting Started — CMU March Madness ML

What's Already Done

You don't need to start from scratch. Here's the current state of the project.

✅ Models are trained and predictions are generated! You can download and submit the prediction files immediately. The models are trained on real Kaggle NCAA data, with historical cross-validation accuracy around 70–75% for tournament games.

✓
Men's model trained — ensemble of XGBoost, LightGBM, and Logistic Regression
✓
Women's model trained — separate model for women's tournament
✓
Men's predictions generated — predictions/MNCAATourneyPredictions.csv (72,010 rows)
✓
Women's predictions generated — predictions/WNCAATourneyPredictions.csv (71,253 rows)
✓
Live dashboard deployed — qrytics.github.io/cmuMarchMadness-ML
✓
CI pipeline configured — tests run on every push, auto-deploy on push to main
✓
Real Kaggle data connected — models and dashboard are using real NCAA data from Kaggle

Choose Your Path

⚡ Path A — Submit Now (5 minutes)

The prediction files already exist. Just download and submit them. No setup required. Accuracy will be competitive out of the box, since the models are already trained on real NCAA data.

Download MNCAATourneyPredictions.csv
Download WNCAATourneyPredictions.csv
Email/submit both files to your team captain

🏆 Path B — Retrain with Real Data (recommended)

Use real NCAA data from Kaggle to train a much better model. Expected accuracy: 65–75%. Takes 20–30 minutes total.

Set up Kaggle API (Step 1)
Install dependencies (Step 2)
Download NCAA data (Step 3)
Retrain models (Step 4)
Generate predictions (Step 5)
Submit files (Step 6)

~55%

Current (synthetic data)
slightly better than guessing

65–75%

With real Kaggle data
top Kaggle competition range

Step-by-Step Guide

Follow these steps in order for the best results. Steps 1–3 are one-time setup.

Get Kaggle API Credentials Required for real data

One-time setup · ~5 minutes

Create a free account at kaggle.com
Go to kaggle.com/settings/account
Scroll to the "API" section → click "Create New Token"

A file kaggle.json downloads. It looks like:

{"username": "yourusername", "key": "your-api-key-abc123"}

Move it to the right location:

# Linux / macOS:
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

# Windows (PowerShell):
mkdir $env:USERPROFILE\.kaggle
move $env:USERPROFILE\Downloads\kaggle.json $env:USERPROFILE\.kaggle\kaggle.json

Note: The chmod 600 on Linux/Mac sets file permissions so only you can read the file. Kaggle will refuse to work if the permissions are too open.

Clone the Repo & Install Dependencies Required

One-time setup · ~2 minutes

# Clone the repository
git clone https://github.com/Qrytics/cmuMarchMadness-ML
cd cmuMarchMadness-ML

# Install Python dependencies
pip install -r requirements.txt

Python version: Python 3.9+ recommended. If pip doesn't work, try pip3. If you're using a virtual environment, activate it first.

Download Real NCAA Data from Kaggle Recommended

One-time setup · ~5 minutes (500 MB download)

python scripts/download_data.py

This downloads the March Machine Learning Mania dataset from Kaggle into data/raw/. Key files:

File	Contents
`MRegularSeasonDetailedResults.csv`	Full box scores for every regular season game (men's)
`MNCAATourneyDetailedResults.csv`	Tournament results — used as training labels
`MSeeds.csv`	Tournament seedings by year
`MMasseyOrdinals.csv`	Rankings from 30+ systems (NET, KPI, KenPom, SAG, ...)
`WRegularSeasonDetailedResults.csv`	Women's regular season games
`WNCAATourneyDetailedResults.csv`	Women's tournament results
`WSeeds.csv`	Women's seedings

No Kaggle account yet? Use synthetic data for now:

python scripts/generate_sample_data.py

This creates realistic-but-fake data in data/sample/. The pipeline will work end-to-end, but accuracy will be lower (~55%).

Retrain the Models Required for accuracy

~2–5 minutes

# Train both Men's and Women's models with real data:
python -m src.train --data-dir data/raw

# Or train each gender separately:
python -m src.train --gender M --data-dir data/raw
python -m src.train --gender W --data-dir data/raw

# Using synthetic data instead:
python -m src.train

You'll see walk-forward cross-validation output like:

Season 2020: acc=0.671 logloss=0.612 auc=0.714
Season 2021: acc=0.659 logloss=0.624 auc=0.698
Season 2022: acc=0.683 logloss=0.599 auc=0.731
...
Walk-forward CV Results:
  Accuracy: 0.671 ± 0.012
  Log Loss: 0.612 ± 0.010
  AUC:      0.714 ± 0.015

Expected with real data: 65–75% accuracy, AUC 0.70–0.78. If you're seeing ~55% accuracy, you may still be using synthetic data — check the --data-dir argument.

Evaluate Historical Performance Optional but useful

~1 minute · generates plots and bracket scores

# Backtest the model on historical tournament seasons:
python -m src.evaluate --data-dir data/raw

# Evaluate specific seasons:
python -m src.evaluate --gender M --seasons 2022 2023 2024 --data-dir data/raw

This generates per-season accuracy reports and bracket simulation scores. Results are saved to models/m_backtest_scores.json and evaluation plots to docs/assets/.

What does "bracket score" mean? It simulates submitting your predictions to the competition and counts how many points you'd earn. A score of 100/196 means you predicted ~51% of games correctly when weighted by round importance.

Generate Final Predictions Required

~2–3 minutes

# Generate all possible matchup predictions:
python -m src.predict --data-dir data/raw

# Using synthetic data:
python -m src.predict

Output files:

predictions/MNCAATourneyPredictions.csv — 72,010 rows (all Division I men's team pairs)
predictions/WNCAATourneyPredictions.csv — 71,253 rows (all Division I women's team pairs)

Why predict every possible pair? The competition format requires predictions for all possible matchups in advance, not just the actual bracket. This way, no matter who wins each game, the submission already contains the correct matchup prediction.

Update the Live Dashboard Optional

~1 minute · updates the GitHub Pages site

# Export updated model metrics and predictions to docs/data/:
python scripts/export_site_data.py

# Commit and push to trigger automatic deployment:
git add -A
git commit -m "Update predictions and model metrics after retraining"
git push

GitHub Actions automatically deploys the docs/ directory to GitHub Pages within about 60 seconds after a push to main. The dashboard will then show your updated model accuracy, backtest scores, and feature importance.

Submit Your Predictions 🏆 Deadline: March 17, 2026 · Noon EDT

Submit BEFORE the tournament bracket is announced on Selection Sunday (March 15, 2026)

Submit both of these files to the competition form. See the Submit page for the full checklist.

📥 Download MNCAATourneyPredictions.csv 📥 Download WNCAATourneyPredictions.csv

File	Rows	Team ID Range	Columns
`MNCAATourneyPredictions.csv`	72,010	1000–1999	`WTeamID, LTeamID`
`WNCAATourneyPredictions.csv`	71,253	3000–3999	`WTeamID, LTeamID`

Important deadlines:

March 15, 2026 — Selection Sunday. Tournament bracket announced. Progressive bracket predictions can update after each round.
March 17, 2026 at Noon EDT — Final submission deadline. Regular bracket must be submitted before games begin.

Optional: Automated Retraining

Set up GitHub Actions to retrain and update predictions automatically — without running anything locally.

Add Kaggle Secrets to GitHub

Go to your repository on GitHub: Settings → Secrets and variables → Actions → New repository secret

Secret Name	Value
`KAGGLE_USERNAME`	Your Kaggle username (e.g., `johndoe`)
`KAGGLE_KEY`	Your Kaggle API key (the long string from `kaggle.json`)

Once set, trigger a retrain by going to Actions → train.yml → Run workflow. The workflow will:

Download the latest Kaggle NCAA data
Retrain both men's and women's models
Run evaluation and backtest
Generate new prediction files
Commit and push results back to the repo
Trigger dashboard re-deployment automatically

Weekly schedule: The train.yml workflow is also set to run on a weekly schedule, keeping predictions fresh as new NCAA season data becomes available.

FAQ

Do I need to run anything to submit?

No — prediction files already exist. You can download MNCAATourneyPredictions.csv and WNCAATourneyPredictions.csv directly from this site and submit them immediately. Running the full pipeline is only needed if you want to improve accuracy.

What's the difference between Regular and Progressive brackets?

Regular: All round picks are locked in before the tournament starts. Upsets in early rounds cascade and affect your later picks. Progressive: After each round, actual winners are used to set up the next round's matchups, and new predictions are made. This avoids cascading errors and typically scores higher.

Why predict every possible pair, not just bracket matchups?

The CMU competition format requires a full prediction matrix for all C(N, 2) team pairs. This way, whatever matchup actually occurs in the tournament, the judge already has your predicted outcome without needing to ask you again.

How accurate are the current predictions?

With synthetic data: ~55% (barely above random guessing). With real Kaggle NCAA data: 65–75%, which is competitive with top submissions on the Kaggle leaderboard. Seed-based heuristics alone give ~69% — the ML model should match or beat that with real data.

What is walk-forward cross-validation?

Instead of randomly splitting data for validation, we simulate real deployment: train on all seasons up to year N, test on year N+1, then train up to N+1 and test on N+2, etc. This prevents data leakage (using future tournament data to predict past results) and gives realistic performance estimates.

Can I modify the model to improve it?

Yes! The best places to improve are:
(1) src/feature_engineering.py — add more features (conference strength, strength of schedule, tempo stats, etc.)
(2) src/model.py — change model hyperparameters or add new models to the ensemble
(3) src/train.py — change the training strategy (e.g., tune hyperparameters with Optuna)

🚀 Getting Started

What's Already Done

Choose Your Path

⚡ Path A — Submit Now (5 minutes)

🏆 Path B — Retrain with Real Data (recommended)

Step-by-Step Guide

Optional: Automated Retraining

Add Kaggle Secrets to GitHub

FAQ

Do I need to run anything to submit?

What's the difference between Regular and Progressive brackets?

Why predict every possible pair, not just bracket matchups?

How accurate are the current predictions?

What is walk-forward cross-validation?

Can I modify the model to improve it?