Abstract
Background: Cardiovascular diseases (CVDs) remain a leading cause of mortality, demanding timely and accurate diagnosis. Traditional clinical assessments are often prone to errors, highlighting the need for predictive models that leverage large-scale clinical data, including data mining techniques that can extract data from complex medical datasets. This study comparatively analyzed classification-based data mining algorithms for predicting CVDs and evaluating their performance across multiple metrics to identify the most effective predictive model for clinical applications.
Methods: The UCI Heart Disease dataset (270 records with 14 clinical attributes) was used. Data preprocessing involved cleaning, normalization, discretization, and partitioning into training (70%) and testing (30%) sets. NB, ANN, kNN, SVM, and CART algorithms were implemented using Orange. Model performance was evaluated by accuracy, sensitivity, specificity, precision, recall, F-measure, and AUC using hold-out validation and 5-fold cross-validation. Feature importance and decision rules were extracted from tree-based models for interpretability.
Results: SVM and NB achieved the highest overall predictive performance (accuracy: 84.44%, sensitivity: 86.00%, specificity: 82.50%, AUC: 0.9136; accuracy: 84.07%, AUC: 0.9133). ANN and KNN demonstrated moderate predictive ability, while CART (accuracy: 78.52%) provided interpretable decision rules. Decision tree (DT) analysis identified thalassemia status, chest pain type, and number of major vessels colored as the most influential attributes. Several clinically interpretable rules were extracted, offering potential guidance for risk assessment. Statistical comparisons indicated no significant difference between SVM and NB performance, suggesting both models provide reliable predictions.
Conclusion: SVM and NB offer robust predictive capabilities for CVD, outperforming traditional statistical approaches. DT models provide additional interpretability, facilitating clinical understanding and application. These findings underscore the importance of evaluating multiple predictive models in context-specific datasets to identify optimal approaches for risk assessment, resource allocation, and quality of care improvement, thereby enhancing early detection and supporting evidence-based CVD management.