Comprehensive analysis of model performance, robustness, and reliability
Date of Analysis: {{ report_date }}
Models Evaluated: {{ models_evaluated }}
Key Finding: {{ key_finding }}
{{ summary_text }}
This comprehensive validation report evaluates multiple aspects of model performance and reliability. Below is a summary of the key findings from each test category.
Best Model: {{ best_model }}
Robustness Index: {{ model_metrics[0].robustness_index | string }}
Finding: The model maintains consistent performance under data perturbations up to {{ perturbation_levels[-2] }}.
Critical Parameters: Learning rate, max depth
Sensitivity: Model performance is most sensitive to learning rate changes
Calibration Error: 0.082
Reliability: Model predictions are well-calibrated with slight overconfidence in mid-range probabilities
Data Shift Resilience: Moderate
Critical Threshold: Performance degrades significantly under 15% distribution shift
This analysis evaluates the performance stability of multiple models under various levels of data perturbation. A robust model maintains consistent performance when input data contains noise or variations.
Baseline Perf.: {{ model.baseline }}
Under 50% Perturbation: {{ model.perturbed }}
Performance Drop: {{ model.drop }}%
Boxplot showing the distribution of performance metrics across multiple perturbation trials. Smaller boxes indicate more consistent performance.
Understanding which features contribute most to model robustness can help prioritize data quality efforts and identify potential vulnerabilities.
Features with higher positive values indicate greater sensitivity to perturbation, which may indicate potential vulnerabilities.
Different types of perturbation can affect models in varying ways. This analysis helps understand which types of noise or variation your model is most sensitive to.
This chart compares how the model responds to different types of perturbation methods at various intensity levels.
Model | Robustness Index | {% for level in perturbation_levels %}Perturb {{ level }} | {% endfor %}
---|---|---|
{{ model.name }} | {{ model.robustness_index }} | {% for score in model.scores %}{{ score }} | {% endfor %}
This section analyzes how model hyperparameters affect overall performance and identifies the most critical parameters to tune.
This chart shows the relative importance of each hyperparameter on model performance. Longer bars indicate parameters with higher impact.
This heatmap visualizes how parameters interact with each other, revealing potential dependencies between hyperparameters.
Parameter | Optimal Range | Sensitivity | Recommendation |
---|---|---|---|
Learning Rate | 0.01 - 0.1 | High | Fine-tune within the optimal range |
Max Depth | 4 - 7 | Medium | Balance between complexity and performance |
Min Sample Split | 10 - 30 | Low | Default value is adequate |
n_estimators | 100 - 300 | Medium | Higher values improve performance with diminishing returns |
This section evaluates how well the model's predicted probabilities correspond to the actual likelihood of correctness, a property known as calibration.
This diagram shows how well the predicted probabilities match the actual frequencies. Perfect calibration would follow the diagonal line.
Measure of difference between predicted probability and actual accuracy
Maximum discrepancy in any probability bin
Mean squared error of probability predictions
Slope of reliability curve (1.0 is ideal)
This histogram shows the distribution of predicted probabilities, revealing any potential issues with over or under-confidence.
This section evaluates how well the model performance holds up under various data distribution shifts and challenging conditions.
This chart illustrates how model performance changes under different types of distribution shifts, such as temporal, geographical, or demographic variations.
Model's ability to handle changes in feature distributions
Model's ability to handle changes in class distributions
Model's ability to handle changes in feature-target relationships
Combined score across all distribution shift types
This chart shows how each feature's importance and impact changes under different distribution shifts.
Distribution Shift Type | Critical Threshold | Performance Drop | Affected Features |
---|---|---|---|
Temporal Shift | +6 months | 32% | feature_2, feature_7, feature_9 |
Demographic Shift | 15% change | 45% | feature_1, feature_3 |
Missing Features | 3+ features | 28% | feature_5, feature_8 |
Data Quality Degradation | 20% noise | 38% | feature_4, feature_6 |
This visualization shows how sensitive the model is to adversarial examples with different perturbation magnitudes.
This section analyzes how model hyperparameters affect overall performance and identifies the most critical parameters to tune.
This chart shows the relative importance of each hyperparameter on model performance. Longer bars indicate parameters with higher impact.
This heatmap visualizes how parameters interact with each other, revealing potential dependencies between hyperparameters.
Parameter | Optimal Range | Sensitivity | Recommendation |
---|---|---|---|
Learning Rate | 0.01 - 0.1 | High | Fine-tune within the optimal range |
Max Depth | 4 - 7 | Medium | Balance between complexity and performance |
Min Sample Split | 10 - 30 | Low | Default value is adequate |
n_estimators | 100 - 300 | Medium | Higher values improve performance with diminishing returns |
Subsample | 0.7 - 0.9 | Medium | Values below 0.7 reduce model performance significantly |
colsample_bytree | 0.6 - 0.8 | Low | Minimal impact on performance |
reg_alpha | 0.01 - 1.0 | Low | Only important for preventing overfitting |
reg_lambda | 0.5 - 2.0 | Low | Only important for preventing overfitting |
These curves show how model performance changes across different values of key hyperparameters, helping identify optimal settings and sensitivity ranges.
This visualization shows the optimization trajectory through the hyperparameter space, helping identify promising regions for further exploration.
This section evaluates how well the model's predicted probabilities correspond to the actual likelihood of correctness, a property known as calibration.
This diagram shows how well the predicted probabilities match the actual frequencies. Perfect calibration would follow the diagonal line.
Measure of difference between predicted probability and actual accuracy
Maximum discrepancy in any probability bin
Mean squared error of probability predictions
Slope of reliability curve (1.0 is ideal)
This histogram shows the distribution of predicted probabilities, revealing any potential issues with over or under-confidence.
This chart shows how calibration quality varies across different feature values, helping identify segments where the model might be poorly calibrated.
Data Segment | Samples | Calibration Error | Confidence | Accuracy |
---|---|---|---|---|
High feature_1 values (>0.75) | 328 | 0.057 | 0.83 | 0.79 |
Low feature_1 values (<0.25) | 412 | 0.124 | 0.67 | 0.54 |
High feature_3 values (>0.8) | 276 | 0.098 | 0.92 | 0.82 |
feature_2 = 1 & feature_5 = 0 | 195 | 0.178 | 0.77 | 0.59 |
Rare combinations (<5% of data) | 87 | 0.223 | 0.81 | 0.58 |
This chart compares different calibration methods (Platt scaling, isotonic regression, temperature scaling) and their impact on model calibration.