๐Ÿ”ฌ Model Details

Technical documentation of the ML models, feature engineering, and training methodology.

Model Performance

Cross-validated metrics using walk-forward evaluation (no data leakage).

๐Ÿ€ Men's Model Metrics

๐Ÿ€ Women's Model Metrics

Feature Engineering

28 differential features computed by comparing team statistics: team1_stat - team2_stat.

๐Ÿ“Š Statistical Features (differential)


โœ“
Win Percentage
Season win rate
โœ“
Points For / Against
Offensive and defensive scoring averages
โœ“
Point Differential
Average margin of victory/defeat
โœ“
Shooting Efficiency
FG%, 3P%, FT% โ€” all three shooting categories
โœ“
Rebounds (Off + Def)
Rebounding prowess
โœ“
Assists, Turnovers, Steals, Blocks, Fouls
Full box score statistics

๐ŸŽฏ Contextual Features


โ˜…
Seed Difference
Tournament seed differential (most predictive!)
โ˜…
Higher Seed Flag
Binary: is team1 the higher seed?
โ˜…
Massey Rankings
Average/best ordinal rank across multiple systems (NET, KPI, KenPom, SAG)
โ˜…
Offensive Efficiency
Derived: points scored per game differential
โ˜…
Net Efficiency
Offensive + Defensive efficiency combined

Model Hyperparameters

Configuration of each component in the ensemble.

๐Ÿš€ XGBoost (40% weight)

Gradient boosted trees with regularization. Excels at capturing non-linear interactions.

n_estimators300
max_depth4
learning_rate0.05
subsample0.8
colsample0.8
reg_alpha0.1

โšก LightGBM (40% weight)

Leaf-wise tree growth for efficiency. Handles ordinal ranking features naturally.

n_estimators300
max_depth4
learning_rate0.05
subsample0.8
colsample0.8
min_data10

๐Ÿ“‰ Logistic Regression (20% weight)

Calibrated linear baseline. Provides stable probability estimates and prevents overfitting.

C1.0
penaltyL2
max_iter1000
scalerStandard
imputerMedian
calibrationSigmoid

Using Real Kaggle Data

For production predictions, use real NCAA basketball data from Kaggle. See the full guide at Getting Started.

๐Ÿ“ฅ Data Setup

# 1. Get Kaggle API credentials
# Go to: https://www.kaggle.com/settings/account โ†’ Create New Token
# Place kaggle.json at ~/.kaggle/kaggle.json (chmod 600 on Linux/Mac)

# 2. Download NCAA data (~500 MB)
python scripts/download_data.py

# Key files downloaded to data/raw/:
#   MRegularSeasonDetailedResults.csv  โ€” All regular season game results (men's)
#   MNCAATourneyDetailedResults.csv    โ€” Tournament results (training labels)
#   MSeeds.csv                          โ€” Tournament seedings by year
#   MMasseyOrdinals.csv                 โ€” Rankings from 30+ systems
#   W*.csv                              โ€” Corresponding women's files

# 3. Retrain with real data (~2-5 minutes)
python -m src.train --data-dir data/raw

# 4. Backtest to verify performance
python -m src.evaluate --data-dir data/raw

# 5. Generate 2026 predictions
python -m src.predict --data-dir data/raw
Expected improvement: Synthetic data โ†’ ~55% accuracy. Real Kaggle data โ†’ 65โ€“75% accuracy. Historical Kaggle leaderboards show top models at 73โ€“75% on tournament games. A seed-only heuristic baseline achieves ~69%, so the ML model should beat that with full data.

How to Improve the Model

Ideas for increasing prediction accuracy beyond the baseline ensemble.

๐Ÿงช More Features

Edit src/feature_engineering.py to add:

  • Strength of schedule
  • Conference strength rating
  • Tempo and pace statistics
  • Home/away game splits
  • Recent form (last 10 games)
  • Head-to-head historical record

โš™๏ธ Better Hyperparameters

Edit src/model.py and tune with Optuna or scikit-learn's GridSearchCV:

  • Increase n_estimators (500+)
  • Tune learning_rate (0.01โ€“0.1)
  • Adjust subsample ratio
  • Change ensemble weights
  • Add CatBoost or Neural Net

๐Ÿ“Š Better Training Strategy

Edit src/train.py to improve the CV strategy:

  • Weight recent seasons more
  • Add tournament-specific features
  • Use calibrated probabilities
  • Optimize for bracket score (not accuracy)
  • Train a meta-model on ensemble outputs

Feature Importance

Feature importance averaged across XGBoost and LightGBM.

Men's Top 10 Features

Women's Top 10 Features

CI/CD Pipeline

Automated GitHub Actions workflows for training, evaluation, and deployment.

๐Ÿ”ฌ ci.yml โ€” Tests

Runs on every push and PR. Validates data pipelines, feature engineering, model training, and prediction generation. Ensures submission files have correct format (72,390 M / 71,631 W rows).

On Push On PR

๐Ÿค– train.yml โ€” Training

Manual trigger or weekly schedule. Downloads latest data, retrains models, runs evaluation, generates predictions, and commits results back to the repo.

Manual Trigger Weekly

๐ŸŒ pages.yml โ€” Deploy

Deploys the docs/ directory to GitHub Pages on every push to main. Copies model metrics and backtest scores to the site's data directory.

On Main Push