Logo-dht
Digital Health Trends. 2025;2(1): 42-48.
doi: 10.34172/dhtj.12
  Abstract View: 30
  PDF Download: 14

Original Article

Cross-Dataset Validation of Machine Learning Models for Breast Cancer Prognosis: An Integrative Analysis of METABRIC and TCGA Cohorts

Mohammad Beheshti 1 ORCID logo, Kambiz Bahaaddini 2, Ali Farzaneh 3* ORCID logo

1 Cancer Registry and Research Center, University of Missouri, Columbia, Missouri, USA
2 Digital Health Team, Australian College of Rural and Remote Medicine, Brisbane, Australia
3 Department of Epidemiology, Erasmus MC University Medical Center, Rotterdam, The Netherlands
*Corresponding Author: Ali Farzaneh, Email: farzanehali78@gmail.com

Abstract

Background: Breast cancer remains the most prevalent malignancy among women worldwide, characterized by substantial heterogeneity in clinical outcomes. Accurate prognostic models are crucial for optimizing treatment decisions and improving survival. Traditional statistical methods, such as the Cox proportional hazards model, often fail to capture nonlinear relationships and high-dimensional genomic interactions. Recent advances in artificial intelligence (AI) and machine learning (ML) offer novel opportunities to integrate clinical and genomic data for improved predictive performance.

Methods: A comparative analysis of multiple prognostic models was conducted using two large-scale datasets: METABRIC (n=1,904) and TCGA-BRCA (n=1,097). Six models were evaluated: Cox proportional hazards (baseline), logistic regression, random forest, support vector machine, XGBoost, and deep neural networks (DNNs). Models were trained using a 70/30 split and optimized through grid search with five-fold cross-validation. Performance metrics included ROC-AUC, F1-score, and concordance index (C-index). External validation was conducted across datasets. Feature importance was assessed using SHAP analysis.

Results: XGBoost achieved the highest overall performance, with ROC-AUC scores of 0.85 (METABRIC) and 0.83 (TCGA), followed closely by DNN (ROC-AUC: 0.84 and 0.82, respectively). The traditional Cox models demonstrated lower predictive accuracy (C-index ~ 0.65). Cross-dataset validation confirmed the robustness of XGBoost and DNN (ROC-AUC 0.78–0.81), outperforming all other models. Risk stratification based on model-derived probabilities significantly separated high- and low-risk groups (log-rank P<0.001). Feature importance analysis identified both clinical factors (tumor size, nodal status, ER/HER2 status) and genomic markers (TP53, ESR1, BRCA1/2, MKI67) as key prognostic predictors.

Conclusion: This study provides strong evidence that AI-driven approaches, particularly XGBoost and DNN, outperform conventional models for breast cancer prognosis by integrating clinical and genomic features. These models demonstrate high predictive accuracy, robust generalizability, and biological interpretability, underscoring their potential to advance personalized treatment strategies. Prospective validation and integration into real-world clinical workflows are essential next steps toward clinical translation.



First Name
Last Name
Email Address
Comments
Security code


Abstract View:

Your browser does not support the canvas element.

PDF Download:

Your browser does not support the canvas element.


Full Text View:

Your browser does not support the canvas element.


Submitted: 22 Aug 2025
Revision: 07 Nov 2025
Accepted: 18 Dec 2025
ePublished: 19 Dec 2025
EndNote EndNote

(Enw Format - Win & Mac)

BibTeX BibTeX

(Bib Format - Win & Mac)

Bookends Bookends

(Ris Format - Mac only)

EasyBib EasyBib

(Ris Format - Win & Mac)

Medlars Medlars

(Txt Format - Win & Mac)

Mendeley Web Mendeley Web
Mendeley Mendeley

(Ris Format - Win & Mac)

Papers Papers

(Ris Format - Win & Mac)

ProCite ProCite

(Ris Format - Win & Mac)

Reference Manager Reference Manager

(Ris Format - Win only)

Refworks Refworks

(Refworks Format - Win & Mac)

Zotero Zotero

(Ris Format - Firefox Plugin)