Everything you need to know to go from zero to submitted predictions before the March 17 deadline.
You don't need to start from scratch. Here's the current state of the project.
predictions/MNCAATourneyPredictions.csv (72,010 rows)predictions/WNCAATourneyPredictions.csv (71,253 rows)The prediction files already exist. Just download and submit them. No setup required. Accuracy will be competitive out of the box, since the models are already trained on real NCAA data.
Use real NCAA data from Kaggle to train a much better model. Expected accuracy: 65โ75%. Takes 20โ30 minutes total.
Follow these steps in order for the best results. Steps 1โ3 are one-time setup.
kaggle.json downloads. It looks like:
{"username": "yourusername", "key": "your-api-key-abc123"}
# Linux / macOS:
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
# Windows (PowerShell):
mkdir $env:USERPROFILE\.kaggle
move $env:USERPROFILE\Downloads\kaggle.json $env:USERPROFILE\.kaggle\kaggle.json
chmod 600 on Linux/Mac sets file permissions so only you can read the file. Kaggle will refuse to work if the permissions are too open.
# Clone the repository
git clone https://github.com/Qrytics/cmuMarchMadness-ML
cd cmuMarchMadness-ML
# Install Python dependencies
pip install -r requirements.txt
pip doesn't work, try pip3. If you're using a virtual environment, activate it first.
python scripts/download_data.py
This downloads the March Machine Learning Mania dataset from Kaggle into data/raw/. Key files:
| File | Contents |
|---|---|
MRegularSeasonDetailedResults.csv | Full box scores for every regular season game (men's) |
MNCAATourneyDetailedResults.csv | Tournament results โ used as training labels |
MSeeds.csv | Tournament seedings by year |
MMasseyOrdinals.csv | Rankings from 30+ systems (NET, KPI, KenPom, SAG, ...) |
WRegularSeasonDetailedResults.csv | Women's regular season games |
WNCAATourneyDetailedResults.csv | Women's tournament results |
WSeeds.csv | Women's seedings |
python scripts/generate_sample_data.py
This creates realistic-but-fake data in data/sample/. The pipeline will work end-to-end, but accuracy will be lower (~55%).
# Train both Men's and Women's models with real data:
python -m src.train --data-dir data/raw
# Or train each gender separately:
python -m src.train --gender M --data-dir data/raw
python -m src.train --gender W --data-dir data/raw
# Using synthetic data instead:
python -m src.train
You'll see walk-forward cross-validation output like:
Season 2020: acc=0.671 logloss=0.612 auc=0.714
Season 2021: acc=0.659 logloss=0.624 auc=0.698
Season 2022: acc=0.683 logloss=0.599 auc=0.731
...
Walk-forward CV Results:
Accuracy: 0.671 ยฑ 0.012
Log Loss: 0.612 ยฑ 0.010
AUC: 0.714 ยฑ 0.015
--data-dir argument.
# Backtest the model on historical tournament seasons:
python -m src.evaluate --data-dir data/raw
# Evaluate specific seasons:
python -m src.evaluate --gender M --seasons 2022 2023 2024 --data-dir data/raw
This generates per-season accuracy reports and bracket simulation scores. Results are saved to models/m_backtest_scores.json and evaluation plots to docs/assets/.
# Generate all possible matchup predictions:
python -m src.predict --data-dir data/raw
# Using synthetic data:
python -m src.predict
Output files:
predictions/MNCAATourneyPredictions.csv โ 72,010 rows (all Division I men's team pairs)predictions/WNCAATourneyPredictions.csv โ 71,253 rows (all Division I women's team pairs)# Export updated model metrics and predictions to docs/data/:
python scripts/export_site_data.py
# Commit and push to trigger automatic deployment:
git add -A
git commit -m "Update predictions and model metrics after retraining"
git push
GitHub Actions automatically deploys the docs/ directory to GitHub Pages within about 60 seconds after a push to main. The dashboard will then show your updated model accuracy, backtest scores, and feature importance.
Submit both of these files to the competition form. See the Submit page for the full checklist.
| File | Rows | Team ID Range | Columns |
|---|---|---|---|
MNCAATourneyPredictions.csv | 72,010 | 1000โ1999 | WTeamID, LTeamID |
WNCAATourneyPredictions.csv | 71,253 | 3000โ3999 | WTeamID, LTeamID |
Set up GitHub Actions to retrain and update predictions automatically โ without running anything locally.
Go to your repository on GitHub: Settings โ Secrets and variables โ Actions โ New repository secret
| Secret Name | Value |
|---|---|
KAGGLE_USERNAME | Your Kaggle username (e.g., johndoe) |
KAGGLE_KEY | Your Kaggle API key (the long string from kaggle.json) |
Once set, trigger a retrain by going to Actions โ train.yml โ Run workflow. The workflow will:
train.yml workflow is also set to run on a weekly schedule, keeping predictions fresh as new NCAA season data becomes available.
No โ prediction files already exist. You can download MNCAATourneyPredictions.csv and WNCAATourneyPredictions.csv directly from this site and submit them immediately. Running the full pipeline is only needed if you want to improve accuracy.
Regular: All round picks are locked in before the tournament starts. Upsets in early rounds cascade and affect your later picks. Progressive: After each round, actual winners are used to set up the next round's matchups, and new predictions are made. This avoids cascading errors and typically scores higher.
The CMU competition format requires a full prediction matrix for all C(N, 2) team pairs. This way, whatever matchup actually occurs in the tournament, the judge already has your predicted outcome without needing to ask you again.
With synthetic data: ~55% (barely above random guessing). With real Kaggle NCAA data: 65โ75%, which is competitive with top submissions on the Kaggle leaderboard. Seed-based heuristics alone give ~69% โ the ML model should match or beat that with real data.
Instead of randomly splitting data for validation, we simulate real deployment: train on all seasons up to year N, test on year N+1, then train up to N+1 and test on N+2, etc. This prevents data leakage (using future tournament data to predict past results) and gives realistic performance estimates.
Yes! The best places to improve are:
(1) src/feature_engineering.py โ add more features (conference strength, strength of schedule, tempo stats, etc.)
(2) src/model.py โ change model hyperparameters or add new models to the ensemble
(3) src/train.py โ change the training strategy (e.g., tune hyperparameters with Optuna)