Discrepancies in Stroke Distribution and Dataset Origin in Machine Learning for Stroke

Lohit Velagapudi, Nikolaos Mouchtouris, Michael P. Baldassari, David Nauheim, Omaditya Khanna, Fadi Al Saiegh, Nabeel Herial, M. Reid Gooch, Stavropoula Tjoumakaris, Robert H. Rosenwasser, Pascal Jabbour

Research output: Contribution to journalArticlepeer-review

4 Scopus citations


Background: Machine learning algorithms depend on accurate and representative datasets for training in order to become valuable clinical tools that are widely generalizable to a varied population. We aim to conduct a review of machine learning uses in stroke literature to assess the geographic distribution of datasets and patient cohorts used to train these models and compare them to stroke distribution to evaluate for disparities. Aims: 582 studies were identified on initial searching of the PubMed database. Of these studies, 106 full texts were assessed after title and abstract screening which resulted in 489 papers excluded. Of these 106 studies, 79 were excluded due to using cohorts from outside the United States or being review articles or editorials. 27 studies were thus included in this analysis. Summary of review: Of the 27 studies included, 7 (25.9%) used patient data from California, 6 (22.2%) were multicenter, 3 (11.1%) were in Massachusetts, 2 (7.4%) each in Illinois, Missouri, and New York, and 1 (3.7%) each from South Carolina, Washington, West Virginia, and Wisconsin. 1 (3.7%) study used data from Utah and Texas. These were qualitatively compared to a CDC study showing the highest distribution of stroke in Mississippi (4.3%) followed by Oklahoma (3.4%), Washington D.C. (3.4%), Louisiana (3.3%), and Alabama (3.2%) while the prevalence in California was 2.6%. Conclusions: It is clear that a strong disconnect exists between the datasets and patient cohorts used in training machine learning algorithms in clinical research and the stroke distribution in which clinical tools using these algorithms will be implemented. In order to ensure a lack of bias and increase generalizability and accuracy in future machine learning studies, datasets using a varied patient population that reflects the unequal distribution of stroke risk factors would greatly benefit the usability of these tools and ensure accuracy on a nationwide scale.

Original languageEnglish (US)
Article number105832
JournalJournal of Stroke and Cerebrovascular Diseases
Issue number7
StatePublished - Jul 2021
Externally publishedYes


  • Bias
  • Epidemiology
  • Machine learning
  • Stroke

ASJC Scopus subject areas

  • Surgery
  • Rehabilitation
  • Clinical Neurology
  • Cardiology and Cardiovascular Medicine


Dive into the research topics of 'Discrepancies in Stroke Distribution and Dataset Origin in Machine Learning for Stroke'. Together they form a unique fingerprint.

Cite this