๐Ÿš€ Getting Started

Everything you need to know to go from zero to submitted predictions before the March 17 deadline.

What's Already Done

You don't need to start from scratch. Here's the current state of the project.

โœ… Models are trained and predictions are generated! You can download and submit the prediction files immediately. The models are trained on real Kaggle NCAA data, with historical cross-validation accuracy around 70โ€“75% for tournament games.

Choose Your Path

โšก Path A โ€” Submit Now (5 minutes)

The prediction files already exist. Just download and submit them. No setup required. Accuracy will be competitive out of the box, since the models are already trained on real NCAA data.

  1. Download MNCAATourneyPredictions.csv
  2. Download WNCAATourneyPredictions.csv
  3. Email/submit both files to your team captain

๐Ÿ† Path B โ€” Retrain with Real Data (recommended)

Use real NCAA data from Kaggle to train a much better model. Expected accuracy: 65โ€“75%. Takes 20โ€“30 minutes total.

  1. Set up Kaggle API (Step 1)
  2. Install dependencies (Step 2)
  3. Download NCAA data (Step 3)
  4. Retrain models (Step 4)
  5. Generate predictions (Step 5)
  6. Submit files (Step 6)
~55%
Current (synthetic data)
slightly better than guessing
65โ€“75%
With real Kaggle data
top Kaggle competition range

Step-by-Step Guide

Follow these steps in order for the best results. Steps 1โ€“3 are one-time setup.

1
Get Kaggle API Credentials Required for real data
One-time setup ยท ~5 minutes
  1. Create a free account at kaggle.com
  2. Go to kaggle.com/settings/account
  3. Scroll to the "API" section โ†’ click "Create New Token"
  4. A file kaggle.json downloads. It looks like:
    {"username": "yourusername", "key": "your-api-key-abc123"}
  5. Move it to the right location:
    # Linux / macOS:
    mkdir -p ~/.kaggle
    mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
    chmod 600 ~/.kaggle/kaggle.json
    
    # Windows (PowerShell):
    mkdir $env:USERPROFILE\.kaggle
    move $env:USERPROFILE\Downloads\kaggle.json $env:USERPROFILE\.kaggle\kaggle.json
Note: The chmod 600 on Linux/Mac sets file permissions so only you can read the file. Kaggle will refuse to work if the permissions are too open.
2
Clone the Repo & Install Dependencies Required
One-time setup ยท ~2 minutes
# Clone the repository
git clone https://github.com/Qrytics/cmuMarchMadness-ML
cd cmuMarchMadness-ML

# Install Python dependencies
pip install -r requirements.txt
Python version: Python 3.9+ recommended. If pip doesn't work, try pip3. If you're using a virtual environment, activate it first.
3
Download Real NCAA Data from Kaggle Recommended
One-time setup ยท ~5 minutes (500 MB download)
python scripts/download_data.py

This downloads the March Machine Learning Mania dataset from Kaggle into data/raw/. Key files:

FileContents
MRegularSeasonDetailedResults.csvFull box scores for every regular season game (men's)
MNCAATourneyDetailedResults.csvTournament results โ€” used as training labels
MSeeds.csvTournament seedings by year
MMasseyOrdinals.csvRankings from 30+ systems (NET, KPI, KenPom, SAG, ...)
WRegularSeasonDetailedResults.csvWomen's regular season games
WNCAATourneyDetailedResults.csvWomen's tournament results
WSeeds.csvWomen's seedings
No Kaggle account yet? Use synthetic data for now:
python scripts/generate_sample_data.py
This creates realistic-but-fake data in data/sample/. The pipeline will work end-to-end, but accuracy will be lower (~55%).
4
Retrain the Models Required for accuracy
~2โ€“5 minutes
# Train both Men's and Women's models with real data:
python -m src.train --data-dir data/raw

# Or train each gender separately:
python -m src.train --gender M --data-dir data/raw
python -m src.train --gender W --data-dir data/raw

# Using synthetic data instead:
python -m src.train

You'll see walk-forward cross-validation output like:

Season 2020: acc=0.671 logloss=0.612 auc=0.714
Season 2021: acc=0.659 logloss=0.624 auc=0.698
Season 2022: acc=0.683 logloss=0.599 auc=0.731
...
Walk-forward CV Results:
  Accuracy: 0.671 ยฑ 0.012
  Log Loss: 0.612 ยฑ 0.010
  AUC:      0.714 ยฑ 0.015
Expected with real data: 65โ€“75% accuracy, AUC 0.70โ€“0.78. If you're seeing ~55% accuracy, you may still be using synthetic data โ€” check the --data-dir argument.
5
Evaluate Historical Performance Optional but useful
~1 minute ยท generates plots and bracket scores
# Backtest the model on historical tournament seasons:
python -m src.evaluate --data-dir data/raw

# Evaluate specific seasons:
python -m src.evaluate --gender M --seasons 2022 2023 2024 --data-dir data/raw

This generates per-season accuracy reports and bracket simulation scores. Results are saved to models/m_backtest_scores.json and evaluation plots to docs/assets/.

What does "bracket score" mean? It simulates submitting your predictions to the competition and counts how many points you'd earn. A score of 100/196 means you predicted ~51% of games correctly when weighted by round importance.
6
Generate Final Predictions Required
~2โ€“3 minutes
# Generate all possible matchup predictions:
python -m src.predict --data-dir data/raw

# Using synthetic data:
python -m src.predict

Output files:

  • predictions/MNCAATourneyPredictions.csv โ€” 72,010 rows (all Division I men's team pairs)
  • predictions/WNCAATourneyPredictions.csv โ€” 71,253 rows (all Division I women's team pairs)
Why predict every possible pair? The competition format requires predictions for all possible matchups in advance, not just the actual bracket. This way, no matter who wins each game, the submission already contains the correct matchup prediction.
7
Update the Live Dashboard Optional
~1 minute ยท updates the GitHub Pages site
# Export updated model metrics and predictions to docs/data/:
python scripts/export_site_data.py

# Commit and push to trigger automatic deployment:
git add -A
git commit -m "Update predictions and model metrics after retraining"
git push

GitHub Actions automatically deploys the docs/ directory to GitHub Pages within about 60 seconds after a push to main. The dashboard will then show your updated model accuracy, backtest scores, and feature importance.

8
Submit Your Predictions ๐Ÿ† Deadline: March 17, 2026 ยท Noon EDT
Submit BEFORE the tournament bracket is announced on Selection Sunday (March 15, 2026)

Submit both of these files to the competition form. See the Submit page for the full checklist.

๐Ÿ“ฅ Download MNCAATourneyPredictions.csv ๐Ÿ“ฅ Download WNCAATourneyPredictions.csv
FileRowsTeam ID RangeColumns
MNCAATourneyPredictions.csv72,0101000โ€“1999WTeamID, LTeamID
WNCAATourneyPredictions.csv71,2533000โ€“3999WTeamID, LTeamID
Important deadlines:
  • March 15, 2026 โ€” Selection Sunday. Tournament bracket announced. Progressive bracket predictions can update after each round.
  • March 17, 2026 at Noon EDT โ€” Final submission deadline. Regular bracket must be submitted before games begin.

Optional: Automated Retraining

Set up GitHub Actions to retrain and update predictions automatically โ€” without running anything locally.

Add Kaggle Secrets to GitHub

Go to your repository on GitHub: Settings โ†’ Secrets and variables โ†’ Actions โ†’ New repository secret

Secret NameValue
KAGGLE_USERNAMEYour Kaggle username (e.g., johndoe)
KAGGLE_KEYYour Kaggle API key (the long string from kaggle.json)

Once set, trigger a retrain by going to Actions โ†’ train.yml โ†’ Run workflow. The workflow will:

  1. Download the latest Kaggle NCAA data
  2. Retrain both men's and women's models
  3. Run evaluation and backtest
  4. Generate new prediction files
  5. Commit and push results back to the repo
  6. Trigger dashboard re-deployment automatically
Weekly schedule: The train.yml workflow is also set to run on a weekly schedule, keeping predictions fresh as new NCAA season data becomes available.

FAQ

Do I need to run anything to submit?

No โ€” prediction files already exist. You can download MNCAATourneyPredictions.csv and WNCAATourneyPredictions.csv directly from this site and submit them immediately. Running the full pipeline is only needed if you want to improve accuracy.

What's the difference between Regular and Progressive brackets?

Regular: All round picks are locked in before the tournament starts. Upsets in early rounds cascade and affect your later picks. Progressive: After each round, actual winners are used to set up the next round's matchups, and new predictions are made. This avoids cascading errors and typically scores higher.

Why predict every possible pair, not just bracket matchups?

The CMU competition format requires a full prediction matrix for all C(N, 2) team pairs. This way, whatever matchup actually occurs in the tournament, the judge already has your predicted outcome without needing to ask you again.

How accurate are the current predictions?

With synthetic data: ~55% (barely above random guessing). With real Kaggle NCAA data: 65โ€“75%, which is competitive with top submissions on the Kaggle leaderboard. Seed-based heuristics alone give ~69% โ€” the ML model should match or beat that with real data.

What is walk-forward cross-validation?

Instead of randomly splitting data for validation, we simulate real deployment: train on all seasons up to year N, test on year N+1, then train up to N+1 and test on N+2, etc. This prevents data leakage (using future tournament data to predict past results) and gives realistic performance estimates.

Can I modify the model to improve it?

Yes! The best places to improve are:
(1) src/feature_engineering.py โ€” add more features (conference strength, strength of schedule, tempo stats, etc.)
(2) src/model.py โ€” change model hyperparameters or add new models to the ensemble
(3) src/train.py โ€” change the training strategy (e.g., tune hyperparameters with Optuna)