Cross-Dataset Validation of Machine Learning Models for Breast Cancer Prognosis: An Integrative Analysis of METABRIC and TCGA Cohorts

Mohammad Beheshti; Kambiz Bahaaddini; Ali Farzaneh

doi:10.34172/dhtj.12

Digital Health Trends. 2025;2(1): 42-48.
doi: 10.34172/dhtj.12

Abstract View: 30

PDF Download: 14

Original Article

Cross-Dataset Validation of Machine Learning Models for Breast Cancer Prognosis: An Integrative Analysis of METABRIC and TCGA Cohorts

Mohammad Beheshti ¹ , Kambiz Bahaaddini ², Ali Farzaneh ³^*

¹ Cancer Registry and Research Center, University of Missouri, Columbia, Missouri, USA
² Digital Health Team, Australian College of Rural and Remote Medicine, Brisbane, Australia
³ Department of Epidemiology, Erasmus MC University Medical Center, Rotterdam, The Netherlands

*Corresponding Author: Ali Farzaneh, Email: farzanehali78@gmail.com

Abstract

Background: Breast cancer remains the most prevalent malignancy among women worldwide, characterized by substantial heterogeneity in clinical outcomes. Accurate prognostic models are crucial for optimizing treatment decisions and improving survival. Traditional statistical methods, such as the Cox proportional hazards model, often fail to capture nonlinear relationships and high-dimensional genomic interactions. Recent advances in artificial intelligence (AI) and machine learning (ML) offer novel opportunities to integrate clinical and genomic data for improved predictive performance.

Methods: A comparative analysis of multiple prognostic models was conducted using two large-scale datasets: METABRIC (n=1,904) and TCGA-BRCA (n=1,097). Six models were evaluated: Cox proportional hazards (baseline), logistic regression, random forest, support vector machine, XGBoost, and deep neural networks (DNNs). Models were trained using a 70/30 split and optimized through grid search with five-fold cross-validation. Performance metrics included ROC-AUC, F1-score, and concordance index (C-index). External validation was conducted across datasets. Feature importance was assessed using SHAP analysis.

Results: XGBoost achieved the highest overall performance, with ROC-AUC scores of 0.85 (METABRIC) and 0.83 (TCGA), followed closely by DNN (ROC-AUC: 0.84 and 0.82, respectively). The traditional Cox models demonstrated lower predictive accuracy (C-index ~ 0.65). Cross-dataset validation confirmed the robustness of XGBoost and DNN (ROC-AUC 0.78–0.81), outperforming all other models. Risk stratification based on model-derived probabilities significantly separated high- and low-risk groups (log-rank P<0.001). Feature importance analysis identified both clinical factors (tumor size, nodal status, ER/HER2 status) and genomic markers (TP53, ESR1, BRCA1/2, MKI67) as key prognostic predictors.

Conclusion: This study provides strong evidence that AI-driven approaches, particularly XGBoost and DNN, outperform conventional models for breast cancer prognosis by integrating clinical and genomic features. These models demonstrate high predictive accuracy, robust generalizability, and biological interpretability, underscoring their potential to advance personalized treatment strategies. Prospective validation and integration into real-world clinical workflows are essential next steps toward clinical translation.

Keywords: Breast cancer, Prognosis, Machine learning, Deep learning, Genomic data, XGBoost, Survival prediction