Flight Price Prediction: Model Selection Report
1. Executive Summary
This report details the systematic evaluation and iterative refinement of machine learning models for the flight price prediction task. The process involved establishing a baseline, tuning multiple advanced models, and conducting a final bake-off.
Initial results pointed to a LightGBM model with suspiciously high accuracy. This prompted a deep-dive investigation that uncovered and corrected a subtle overfitting issue caused by a leaky feature. After this refinement, a final, robust LightGBM model was confirmed as the champion.
The final model demonstrates an excellent balance of high performance and stability, with a Cross-Validation RMSE of $9.57 and a Final Test Set RMSE of $7.60. This documentation tells the complete story, from a wide-ranging bake-off to the crucial investigative work that produced a truly reliable and production-ready model.
2. Initial Bake-Off & The "Too Good to Be True" Result
The first step was to compare our tuned tree-based models against a Linear Regression baseline.
| Model | CV R² Score | CV RMSE | CV RMSE Std Dev (Stability) | CV MAE | Duration |
|---|---|---|---|---|---|
| LGBMRegressor (Tuned v1) | ~1.000 | $1.02 | $0.38 | $0.61 | 2.5 min |
| RandomForestRegressor (Tuned) | 0.999 | $10.50 | $0.11 | $5.38 | 6.9 min |
| XGBoostRegressor (Tuned) | 0.999 | $11.95 | $3.32 | $9.48 | 1.9 min |
| LinearRegression (Base) | 0.986 | $42.64 | $0.18 | $34.32 | 2.2 min |
Analysis and Red Flags
While the initial results were impressive across the board, the LightGBM model's performance was almost perfect (CV RMSE ~$1.02). Such high accuracy on a real-world dataset is a significant red flag for potential overfitting or data leakage. Furthermore, this initial model completely ignored temporal features, which contradicted our findings from the EDA. This warranted a deeper investigation.
3. Iteration 1: The Overfitting Trap
To simplify and optimize, a new iteration was run with two key changes:
- A new
is_tree_modelparameter was introduced to create a more efficient pipeline for tree-based models, bypassing unnecessary steps like one-hot encoding and scaling. - Interaction features that showed zero importance in the initial SHAP analysis were removed.
This left a feature set that included route (a combination of origin and destination) and newly added cyclical temporal features.
3.1. Iteration 1: Results
| Model | CV RMSE | Final Model RMSE (Train+Val) | Overfitting Gap |
|---|---|---|---|
| LightGBM | $7.25 | $0.59 | ~92% |
| XGBoost | $6.47 | $0.72 | ~89% |
- The scores for LightGBM dropped to a reasonable range but the performance on the combined model is very overfitting for both.
- However, this time there was some level of importance given to
temporalfeatures which didn't happen before.
3.2. Diagnosis: Severe Overfitting
The results were clear: both models were severely overfitting. The error on the combined training and validation data (Final Model RMSE) was an order of magnitude lower than the average error during cross-validation (CV RMSE). This indicates the models were memorizing the training data and failing to generalize.
SHAP analysis of this run revealed that the engineered route feature had an overwhelmingly dominant contribution, dwarfing even time and flight_type. This pointed to route as the primary source of data leakage and overfitting.
4. Iteration 2: Taming the Model & Finding the True Champion
The clear next step was to remove the leaky route feature, forcing the models to learn from the more fundamental from_location and to_location features.
4.1. Iteration 2: Results
| Model | CV RMSE | Final Model RMSE (Train+Val) | Overfitting Gap | Verdict |
|---|---|---|---|---|
| LightGBM | $9.57 | $7.66 | ~20% | Stable & Reliable |
| XGBoost | $12.08 | $0.90 | ~92% | Still Overfitting |
4.2. Analysis and Final Decision
This iteration was the breakthrough:
- LightGBM is the Champion: By removing the
routefeature, the LightGBM model's performance stabilized. The CV RMSE and the final model RMSE are now closely aligned, indicating it generalizes well. The slight remaining gap is expected and healthy. - XGBoost is Dropped: The XGBoost model, even without the
routefeature, continued to overfit severely. This made it an unreliable candidate for production.
Conclusion: The stabilized LightGBM model from Iteration 2 was declared the definitive champion.
5. Final Champion Model: Performance on Unseen Test Data
The final step was to evaluate the champion LightGBM model on the hold-out test set to confirm its real-world performance.
| Metric | Value |
|---|---|
| R² Score | 0.99956 |
| Root Mean Squared Error (RMSE) | $7.60 |
| Mean Absolute Error (MAE) | $5.50 |
| Median Absolute Error | $4.21 |
| Max Error | $39.94 |
The test set RMSE of $7.60 is perfectly in line with the final model's training RMSE of $7.66 and the cross-validation RMSE of $9.57. This consistency is the ultimate proof that the model is robust, reliable, and not overfit.
Next Steps: Understanding the Champion
The metrics clearly show that our refined LightGBM is the champion model. The next stage of our analysis is to dive deep into its behavior to ensure it has learned logical and robust patterns from the data.