Technical documentation of the ML models, feature engineering, and training methodology.
Cross-validated metrics using walk-forward evaluation (no data leakage).
28 differential features computed by comparing team statistics: team1_stat - team2_stat.
Configuration of each component in the ensemble.
Gradient boosted trees with regularization. Excels at capturing non-linear interactions.
Leaf-wise tree growth for efficiency. Handles ordinal ranking features naturally.
Calibrated linear baseline. Provides stable probability estimates and prevents overfitting.
For production predictions, use real NCAA basketball data from Kaggle. See the full guide at Getting Started.
# 1. Get Kaggle API credentials
# Go to: https://www.kaggle.com/settings/account โ Create New Token
# Place kaggle.json at ~/.kaggle/kaggle.json (chmod 600 on Linux/Mac)
# 2. Download NCAA data (~500 MB)
python scripts/download_data.py
# Key files downloaded to data/raw/:
# MRegularSeasonDetailedResults.csv โ All regular season game results (men's)
# MNCAATourneyDetailedResults.csv โ Tournament results (training labels)
# MSeeds.csv โ Tournament seedings by year
# MMasseyOrdinals.csv โ Rankings from 30+ systems
# W*.csv โ Corresponding women's files
# 3. Retrain with real data (~2-5 minutes)
python -m src.train --data-dir data/raw
# 4. Backtest to verify performance
python -m src.evaluate --data-dir data/raw
# 5. Generate 2026 predictions
python -m src.predict --data-dir data/raw
Ideas for increasing prediction accuracy beyond the baseline ensemble.
Edit src/feature_engineering.py to add:
Edit src/model.py and tune with Optuna or scikit-learn's GridSearchCV:
Edit src/train.py to improve the CV strategy:
Feature importance averaged across XGBoost and LightGBM.
Automated GitHub Actions workflows for training, evaluation, and deployment.
Runs on every push and PR. Validates data pipelines, feature engineering, model training, and prediction generation. Ensures submission files have correct format (72,390 M / 71,631 W rows).
Manual trigger or weekly schedule. Downloads latest data, retrains models, runs evaluation, generates predictions, and commits results back to the repo.
Deploys the docs/ directory to GitHub Pages on every push to main. Copies model metrics and backtest scores to the site's data directory.