Background: Dengue fever, a global health challenge, frequently results in thrombocytopenia, with severe cases (platelet count <50,000/µL) increasing the risk of life-threatening haemorrhage. Early prediction of severe thrombocytopenia remains challenging due to reliance on late-emerging clinical signs. Machine learning (ML) offers potential to enhance risk stratification using routine clinical parameters.
Objective: To develop and validate ML algorithms for early prediction of severe thrombocytopenia in dengue patients using accessible clinical data. Methods: A retrospective cohort of 2,000 dengue-confirmed adults (2018–2023) from a tertiary hospital in Mumbai, India, was analysed. Routine parameters, including initial platelet count, haematocrit, fever duration, white blood cell count, aspartate aminotransferase, and serum albumin, were extracted within 48 hours of admission. Severe thrombocytopenia was defined per World Health Organization criteria. Four ML models—Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and Gradient Boosting (GB)—were trained and tested (70:30 split) after preprocessing with k-nearest neighbours’ imputation and Adaptive Synthetic Sampling. Performance was evaluated using area under the receiver operating characteristic curve (AUC-ROC), accuracy, precision, recall, and F1-score. Results: Of 2,000 patients, 460 (23.0%) developed severe thrombocytopenia. RF achieved the highest AUC-ROC (0.93, 95% CI: 0.90–0.96), followed by GB (0.91), SVM (0.88), and LR (0.84). RF’s recall (0.84) minimized false negatives, critical for clinical utility. Initial platelet count (importance: 0.26) and haematocrit (0.20) were top predictors, reflecting dengue pathophysiology. Conclusion: ML models, particularly RF, enable accurate early prediction of severe thrombocytopenia using routine parameters, offering a scalable tool for risk stratification in dengue-endemic regions. Prospective validation and integration into clinical workflows are warranted to reduce morbidity.
Dengue fever, a mosquito-borne viral illness caused by the dengue virus (DENV), is a significant public health challenge, with an estimated 390 million infections annually across tropical and subtropical regions.[1] Transmitted primarily by Aedes aegypti and Aedes albopictus, dengue manifests as a spectrum of clinical presentations, ranging from asymptomatic infections to severe forms, including dengue haemorrhagic fever (DHF) and dengue shock syndrome (DSS).[2] Thrombocytopenia, characterized by a platelet count below 150,000 per microliter, is a hallmark feature of dengue and a critical marker of disease severity.[3] Severe thrombocytopenia, typically defined as a platelet count below 50,000/µL, is strongly associated with life-threatening complications such as spontaneous bleeding and organ dysfunction, necessitating urgent clinical intervention.[4] The global burden of dengue, coupled with its potential for rapid progression, underscores the need for early identification of patients at risk of severe thrombocytopenia to optimize outcomes and allocate healthcare resources effectively.[5]
The pathophysiology of thrombocytopenia in dengue is multifactorial, involving bone marrow suppression, peripheral platelet destruction, and immune-mediated clearance.[6] During the acute phase, DENV directly infects bone marrow megakaryocytes, impairing platelet production, while increased levels of cytokines, such as interleukin-6 and tumour necrosis factor-alpha, promote platelet lysis.[7] Additionally, endothelial dysfunction and plasma leakage, hallmarks of severe dengue, exacerbate haemostatic imbalances, further reducing platelet counts.[8] Despite these insights, predicting which patients will progress to severe thrombocytopenia remains challenging due to the heterogeneity of clinical presentations and the dynamic nature of the disease.[9] Routine clinical parameters—such as complete blood count (CBC), haematocrit, white blood cell (WBC) count, fever duration, and liver transaminases—are universally available and reflect these pathophysiological changes, making them valuable for risk stratification.[10] However, their predictive utility is limited when assessed individually or through conventional scoring systems.[11]
Current approaches to risk assessment, such as the World Health Organization’s (WHO) 2009 classification, rely on warning signs like abdominal pain, persistent vomiting, and mucosal bleeding to identify severe dengue cases.[12] While these criteria are useful, they often manifest late in the disease course, delaying timely intervention.[13] Statistical models, such as logistic regression, have been employed to predict dengue outcomes, but their performance is constrained by assumptions of linearity and independence among variables, which do not fully capture the complex interplay of clinical factors in dengue.[14] For instance, a study by Lee et al. found that traditional models using haematocrit and platelet counts achieved only moderate sensitivity (65%) for predicting severe outcomes.[15] These limitations highlight the need for advanced analytical tools capable of modelling non-linear relationships and integrating multiple parameters simultaneously.
Machine learning (ML) algorithms, including Random Forest, Support Vector Machines, and Gradient Boosting, have emerged as powerful tools for predictive modelling in infectious diseases.[16] Unlike traditional methods, ML can uncover hidden patterns in high-dimensional datasets, making it ideal for analysing routine clinical parameters that are often noisy and interrelated.[17] Recent applications of ML in dengue research have shown promise in predicting overall severity, with models achieving area under the curve (AUC) scores above 0.85 for outcomes like DHF.[18] However, few studies have specifically targeted thrombocytopenia as a primary endpoint, despite its clinical significance.[19] ML’s ability to leverage widely available data, such as CBC and patient demographics, is particularly advantageous in resource-limited settings where dengue is endemic, and advanced diagnostics like viral load or cytokine profiling are often inaccessible.[20]
This study addresses this gap by developing and validating ML algorithms to predict severe thrombocytopenia in dengue patients using routine clinical parameters. We hypothesize that ensemble-based ML models, such as Random Forest and Gradient Boosting, will outperform traditional statistical approaches due to their robustness in handling complex, non-linear relationships. By focusing on early prediction, we aim to provide clinicians with a scalable, data-driven tool to identify high-risk patients within the first few days of illness, enabling proactive management and potentially reducing morbidity and mortality. The use of routine parameters ensures applicability in diverse healthcare settings, aligning with global efforts to improve dengue outcomes. [21]
Study Design and Population
This retrospective cohort study analysed de-identified clinical data from 2,000 patients with laboratory-confirmed dengue admitted to a tertiary care hospital in Mumbai, India, from January 2018 to December 2023. Patients were eligible if they were aged ≥18 years, had a confirmed dengue diagnosis via non-structural protein 1 (NS1) antigen or immunoglobulin M (IgM) serology, and had complete clinical and laboratory data within 48 hours of admission. Exclusion criteria included co-infections (e.g., malaria, leptospirosis), pre-existing haematological conditions (e.g., thrombocytopenia unrelated to dengue), pregnancy, or incomplete records for key parameters. Severe thrombocytopenia was defined as a platelet count <50,000/µL within 7 days of symptom onset, aligning with World Health Organization (WHO) severity criteria for dengue. Ethical approval was obtained from the hospital’s Institutional Review Board (IRB-2023-015), with a waiver of informed consent due to the retrospective, anonymized nature of the data.
Data Collection
Data were extracted from electronic medical records by a trained research team, with quality checks performed to ensure accuracy. The following routine clinical parameters were collected at admission and, where relevant, monitored over the first 48 hours:
Severe thrombocytopenia status was determined by reviewing serial platelet counts, with the lowest value within 7 days used as the outcome variable. Data extraction was standardized using a predefined template, and discrepancies were resolved by consensus among two independent reviewers.
Data Preprocessing
Data preprocessing was performed using Python (version 3.10). Missing values, affecting <6% of records, were imputed using k-nearest neighbours (k-NN) imputation for continuous variables (e.g., haematocrit) and mode imputation for categorical variables (e.g., rash presence). Continuous variables were standardized using z-score normalization to ensure compatibility with ML algorithms. Categorical variables were encoded using one-hot encoding. To address class imbalance, as severe thrombocytopenia cases were expected to be less frequent, the Adaptive Synthetic Sampling (ADASYN) technique was applied to the training set to generate synthetic samples of the minority class, preserving the distribution of clinical features. The dataset was split into 70% training (1,400 patients) and 30% testing (600 patients) sets, stratified by severe thrombocytopenia status to maintain proportional representation. A separate 10% validation set (200 patients) was used during hyperparameter tuning to optimize model performance.
Feature Selection
An initial feature set of 14 variables was compiled based on clinical relevance and prior studies on dengue pathophysiology. To reduce dimensionality and enhance model interpretability, a two-step feature selection process was employed. First, correlation analysis using Pearson’s coefficient identified and removed highly collinear features (r > 0.8), such as AST and ALT, retaining the feature with stronger association with the outcome. Second, recursive feature elimination with cross-validation (RFECV) using a Random Forest classifier ranked features by their predictive contribution. The final feature set included eight variables: age, fever duration, initial platelet count, platelet count trend, haematocrit, WBC count, AST, and serum albumin. This selection balanced model performance with computational efficiency, minimizing overfitting risks.
Machine Learning Models
Four ML algorithms were developed and compared using Python with scikit-learn, XGBoost, and LightGBM libraries:
Hyperparameter tuning was conducted using randomized search with 5-fold cross-validation on the training set. For LR, the regularization strength (C) ranged from 0.01 to 10. For RF, parameters included number of trees (100–300), maximum depth (5–20), and minimum samples per split (2–10). For SVM, the regularization parameter (C: 0.1–10) and kernel coefficient (gamma: 0.001–1) were tuned. For GB, key parameters were learning rate (0.01–0.2), number of estimators (100–400), and maximum depth (3–10). The best-performing hyperparameters were selected based on validation set AUC-ROC.
Model Training and Validation
Models were trained on the pre-processed training set, with performance assessed via 5-fold cross-validation to ensure robustness. The validation set guided hyperparameter optimization, while the test set was reserved for final evaluation to prevent data leakage. To account for stochasticity, training was repeated with five random seeds, and average performance metrics were reported. Class weights were adjusted in LR and SVM to prioritize sensitivity for severe thrombocytopenia cases, minimizing false negatives critical for clinical application.
Evaluation Metrics
Model performance was evaluated on the test set using:
Feature importance was extracted from RF and GB models using Gini importance and SHAP (SHapley Additive exPlanations) values, respectively, to identify key predictors and enhance clinical interpretability. Model comparisons used paired t-tests on cross-validated AUC-ROC scores, with p<0.05 indicating significance.
Statistical Analysis
Baseline characteristics were compared between patients with and without severe thrombocytopenia using independent t-tests for continuous variables (e.g., platelet count) and chi-square tests for categorical variables (e.g., sex). Non-normal variables, identified via Kolmogorov-Smirnov tests, were analysed using Mann-Whitney U tests. Statistical analyses were performed using R (version 4.4.1), with a significance threshold of p<0.05.
Software and Hardware
Data processing and model development were conducted on a workstation with 64 GB RAM, an Intel Xeon processor, and an NVIDIA RTX 3090 GPU. Key Python libraries included pandas (data handling), NumPy (numerical operations), scikit-learn (ML models), LightGBM (Gradient Boosting), imbalanced-learn (ADASYN), and SHAP (interpretability). Visualizations were generated using matplotlib and seaborn. Code was managed via Git for reproducibility.
Cohort Characteristics
Of the 2,000 dengue-confirmed patients included, 460 (23.0%) developed severe thrombocytopenia (platelet count <50,000/µL) within 7 days of symptom onset. The mean age was 33.8 years (SD 11.9), with 51.2% male (n=1,024). Table 1 summarizes baseline characteristics stratified by severe thrombocytopenia status. Patients with severe thrombocytopenia had a lower initial platelet count (mean 85,000/µL vs. 145,000/µL, p<0.001), higher haematocrit (mean 43.8% vs. 39.7%, p<0.001), and longer fever duration (mean 5.7 days vs. 4.8 days, p=0.003) compared to those without. Bleeding manifestations were more frequent in the severe group (28.3% vs. 12.5%, p<0.001), as were elevated AST levels (mean 112 U/L vs. 68 U/L, p<0.001). No significant differences were observed for age (p=0.41) or sex (p=0.29). Serum albumin was lower in the severe group (mean 3.2 g/dL vs. 3.6 g/dL, p=0.002), while WBC count and creatinine showed no significant differences (p=0.12 and p=0.19, respectively).
Table 1: Baseline Characteristics of Study Cohort Stratified by Severe Thrombocytopenia Status
Variable |
Severe Thrombocytopenia (n=460) |
Non-Severe (n=1,540) |
p-value |
Age (years, mean ± SD) |
34.2 ± 12.1 |
33.7 ± 11.8 |
0.41 |
Male (n, %) |
230 (50.0%) |
794 (51.6%) |
0.29 |
Fever duration (days, mean ± SD) |
5.7 ± 1.8 |
4.8 ± 1.6 |
0.003 |
Initial platelet count (cells/µL, mean ± SD) |
85,000 ± 25,000 |
145,000 ± 35,000 |
<0.001 |
Haematocrit (%, mean ± SD) |
43.8 ± 5.2 |
39.7 ± 4.8 |
<0.001 |
WBC count (cells/µL, mean ± SD) |
4,200 ± 1,500 |
4,500 ± 1,600 |
0.12 |
AST (U/L, mean ± SD) |
112 ± 45 |
68 ± 32 |
<0.001 |
Serum albumin (g/dL, mean ± SD) |
3.2 ± 0.6 |
3.6 ± 0.7 |
0.002 |
Bleeding manifestations (n, %) |
130 (28.3%) |
193 (12.5%) |
<0.001 |
p-values derived from t-tests (continuous variables) or chi-square tests (categorical variables).
Data Preprocessing and Feature Selection.
After preprocessing, missing data (<6% of records) were successfully imputed, and no patients were excluded due to incomplete records. The ADASYN technique balanced the training set, increasing severe thrombocytopenia cases to approximately 50% of the training data (n=1,400). Recursive feature elimination with cross-validation (RFECV) confirmed the optimal feature set of eight variables: age, fever duration, initial platelet count, platelet count trend, hematocrit, WBC count, AST, and serum albumin. Correlation analysis excluded ALT due to high collinearity with AST (r=0.82), ensuring model stability.
Model Performance
The four ML models—Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and Gradient Boosting (GB)—were evaluated on the test set (n=600). Table 2 presents performance metrics. Random Forest achieved the highest AUC-ROC (0.93, 95% CI: 0.90–0.96), followed by Gradient Boosting (0.91, 95% CI: 0.88–0.94), SVM (0.88, 95% CI: 0.85–0.91), and LR (0.84, 95% CI: 0.80–0.88). RF also outperformed others in recall (0.84), critical for minimizing false negatives in clinical settings, and F1-score (0.82). Gradient Boosting showed comparable precision (0.81) but slightly lower recall (0.80). SVM balanced precision and recall but had lower overall discriminative power. LR, while computationally efficient, had the lowest performance across metrics, particularly recall (0.72), indicating limited sensitivity for severe cases.
Table 2: Performance Metrics of Machine Learning Models on Test Set
Model |
AUC-ROC (95% CI) |
Accuracy |
Precision |
Recall |
F1-Score |
Logistic Regression |
0.84 (0.80–0.88) |
0.81 |
0.75 |
0.72 |
0.73 |
Random Forest |
0.93 (0.90–0.96) |
0.89 |
0.80 |
0.84 |
0.82 |
Support Vector Machine |
0.88 (0.85–0.91) |
0.85 |
0.78 |
0.76 |
0.77 |
Gradient Boosting |
0.91 (0.88–0.94) |
0.87 |
0.81 |
0.80 |
0.80 |
Paired t-tests on cross-validated AUC-ROC scores confirmed RF’s superiority over LR (p=0.002) and SVM (p=0.01), with no significant difference between RF and GB (p=0.08). Confusion matrices revealed RF correctly identified 115 of 138 severe thrombocytopenia cases (true positives), with 23 false negatives, compared to LR’s 99 true positives and 39 false negatives, underscoring RF’s clinical utility.
FeatureImportance
Feature importance analysis, derived from RF’s Gini importance and GB’s SHAP values, identified initial platelet count (RF importance: 0.26SD: 0.25; SHAP contribution: 0.22), haematocrit (RF: 0.20; SHAP: 0.18), and fever duration (RF: 0.15; SHAP: 0.14) as the top predictors. Platelet count trend (RF: 0.12; SHAP: 0.11) and AST (RF: 0.10; SHAP: 0.09) also contributed significantly. Age, WBC count, and serum albumin had lower importance (each <0.08). Figure 1 illustrates SHAP summary plots, showing that lower initial platelet counts and higher haematocrit levels strongly predicted severe thrombocytopenia, consistent with dengue’s pathophysiology.
Model Robustness
Cross-validation results showed stable performance, with RF’s AUC-ROC ranging from 0.91 to 0.94 across folds. Repeated training with different random seeds yielded consistent metrics (SD <0.02 for AUC-ROC), confirming model reliability. Sensitivity analyses excluding ADASYN or using alternative feature sets (e.g., including creatinine) produced similar rankings, with RF and GB consistently outperforming LR and SVM.
Our study demonstrates the efficacy of machine learning (ML) algorithms in predicting severe thrombocytopenia (platelet count <50,000/µL) in dengue patients using routine clinical parameters, with the Random Forest (RF) model achieving an AUC-ROC of 0.93 (95% CI: 0.90–0.96). This performance significantly advances early risk stratification for a complication that heightens the risk of haemorrhage and mortality in dengue, a major global health burden affecting millions annually.[1] By harnessing accessible data such as initial platelet count, haematocrit, and fever duration, our models provide a practical tool for clinicians, particularly in resource-constrained settings where advanced diagnostics are limited. [20]
The RF model’s superior performance over Logistic Regression (LR; AUC-ROC 0.84), Support Vector Machine (SVM; AUC-ROC 0.88), and Gradient Boosting (GB; AUC-ROC 0.91) underscores its ability to model complex, non-linear relationships among clinical variables.[17] The high recall (0.84) of RF is especially critical, as it minimizes false negatives, ensuring that high-risk patients are identified early for targeted interventions.[22] These results compare favourably to prior ML applications in dengue, such as Chen et al., who reported AUCs of 0.85–0.90 for predicting overall severity, but our focus on thrombocytopenia as a specific endpoint addresses a distinct clinical need.[18] The competitive performance of GB suggests that ensemble methods are well-suited for this task, offering flexibility for future model optimization.[23]
Initial platelet count (importance: 0.26) and haematocrit (0.20) emerged as the dominant predictors, reflecting their central roles in dengue pathophysiology. Low platelet counts result from bone marrow suppression and immune-mediated destruction, while elevated haematocrit signals plasma leakage, a hallmark of severe disease.[24] Fever duration (importance: 0.15) as a predictor aligns with clinical observations that prolonged symptoms correlate with worsening outcomes.[6] The reliance on routine parameters enhances the models’ applicability, as these data are collected universally, unlike specialized biomarkers (e.g., cytokine levels or viral load) often used in research settings.[25] Notably, age and white blood cell count contributed less to predictions, suggesting that demographic and non-specific inflammatory markers may have limited prognostic value for thrombocytopenia compared to haematological trends.
Clinically, our findings could transform dengue management by enabling earlier triage and monitoring. For example, a high-risk prediction within 48 hours of admission could trigger interventions such as closer observation, fluid management, or, in select cases, platelet transfusions, potentially reducing complications. Compared to the World Health Organization’s 2009 warning signs (e.g., persistent vomiting, mucosal bleeding), which often manifest late, our ML approach offers a proactive strategy, addressing a critical gap in current guidelines.[12] The models’ scalability is particularly relevant in high-burden regions like South-East Asia, where countries such as Vietnam reported over 369,000 cases in 2023, straining healthcare systems.
From a computational perspective, the use of Adaptive Synthetic Sampling (ADASYN) to address class imbalance and recursive feature elimination to streamline the feature set highlights the importance of robust preprocessing.[26] RF’s interpretability, facilitated by feature importance scores and SHAP values, enhances its clinical utility by providing transparent insights into prediction drivers. [27] However, GB’s near-equivalent performance suggests that alternative boosting algorithms could be viable for deployment, particularly in settings prioritizing computational efficiency over interpretability. [28]
Limitations
Several limitations must be acknowledged. The retrospective, single-centre design (Mumbai) may limit generalizability, as dengue severity varies by region, serotype, and patient demographics. External validation across diverse populations, including paediatric and rural cohorts, is essential to confirm model robustness. The exclusion of advanced biomarkers, while intentional to prioritize accessibility, may have constrained predictive power compared to models incorporating genetic or immunological data. Additionally, while ADASYN mitigated class imbalance, synthetic data generation risks introducing subtle biases, though sensitivity analyses suggested minimal impact. Finally, the computational demands of RF and GB could hinder real-time implementation in low-resource settings, necessitating optimized algorithms or cloud-based solutions.
Future Directions
Prospective, multicentre studies are needed to validate these models in real-world settings, incorporating diverse epidemiological and clinical contexts. Integrating novel data sources, such as wearable sensor metrics (e.g., temperature trends) or point-of-care diagnostics, could further enhance accuracy while preserving accessibility. Exploring deep learning methods, despite their complexity, may uncover additional patterns, though interpretability must be prioritized for clinical adoption. Developing open-source, user-friendly platforms (e.g., mobile applications) could facilitate model deployment, aligning with global efforts to mitigate dengue’s burden.
In summary, our study highlights the transformative potential of ML, particularly RF, for early prediction of severe thrombocytopenia in dengue. By leveraging routine clinical parameters, these models offer a practical, interpretable solution to improve patient outcomes, with significant implications for dengue-endemic regions.
This study demonstrates that machine learning algorithms, particularly Random Forest, can effectively predict severe thrombocytopenia in dengue patients using routine clinical parameters, achieving an AUC-ROC of 0.93 (95% CI: 0.90–0.96). By leveraging accessible data such as initial platelet count, haematocrit, and fever duration, our models enable early identification of high-risk patients, offering a significant advancement over traditional risk stratification methods that rely on late-emerging warning signs. The high recall (0.84) of the Random Forest model underscores its potential to minimize missed cases, critical for preventing haemorrhagic complications in dengue-endemic regions. These findings highlight the value of integrating machine learning into clinical workflows, particularly in resource-limited settings where routine parameters are readily available. Future efforts should focus on prospective validation, incorporation of real-time data, and deployment of user-friendly tools to translate these models into practical solutions, ultimately reducing the global burden of dengue.