TY - JOUR
T1 - Discrepancies in Stroke Distribution and Dataset Origin in Machine Learning for Stroke
AU - Velagapudi, Lohit
AU - Mouchtouris, Nikolaos
AU - Baldassari, Michael P.
AU - Nauheim, David
AU - Khanna, Omaditya
AU - Saiegh, Fadi Al
AU - Herial, Nabeel
AU - Gooch, M. Reid
AU - Tjoumakaris, Stavropoula
AU - Rosenwasser, Robert H.
AU - Jabbour, Pascal
N1 - Publisher Copyright:
© 2021 Elsevier Inc.
PY - 2021/7
Y1 - 2021/7
N2 - Background: Machine learning algorithms depend on accurate and representative datasets for training in order to become valuable clinical tools that are widely generalizable to a varied population. We aim to conduct a review of machine learning uses in stroke literature to assess the geographic distribution of datasets and patient cohorts used to train these models and compare them to stroke distribution to evaluate for disparities. Aims: 582 studies were identified on initial searching of the PubMed database. Of these studies, 106 full texts were assessed after title and abstract screening which resulted in 489 papers excluded. Of these 106 studies, 79 were excluded due to using cohorts from outside the United States or being review articles or editorials. 27 studies were thus included in this analysis. Summary of review: Of the 27 studies included, 7 (25.9%) used patient data from California, 6 (22.2%) were multicenter, 3 (11.1%) were in Massachusetts, 2 (7.4%) each in Illinois, Missouri, and New York, and 1 (3.7%) each from South Carolina, Washington, West Virginia, and Wisconsin. 1 (3.7%) study used data from Utah and Texas. These were qualitatively compared to a CDC study showing the highest distribution of stroke in Mississippi (4.3%) followed by Oklahoma (3.4%), Washington D.C. (3.4%), Louisiana (3.3%), and Alabama (3.2%) while the prevalence in California was 2.6%. Conclusions: It is clear that a strong disconnect exists between the datasets and patient cohorts used in training machine learning algorithms in clinical research and the stroke distribution in which clinical tools using these algorithms will be implemented. In order to ensure a lack of bias and increase generalizability and accuracy in future machine learning studies, datasets using a varied patient population that reflects the unequal distribution of stroke risk factors would greatly benefit the usability of these tools and ensure accuracy on a nationwide scale.
AB - Background: Machine learning algorithms depend on accurate and representative datasets for training in order to become valuable clinical tools that are widely generalizable to a varied population. We aim to conduct a review of machine learning uses in stroke literature to assess the geographic distribution of datasets and patient cohorts used to train these models and compare them to stroke distribution to evaluate for disparities. Aims: 582 studies were identified on initial searching of the PubMed database. Of these studies, 106 full texts were assessed after title and abstract screening which resulted in 489 papers excluded. Of these 106 studies, 79 were excluded due to using cohorts from outside the United States or being review articles or editorials. 27 studies were thus included in this analysis. Summary of review: Of the 27 studies included, 7 (25.9%) used patient data from California, 6 (22.2%) were multicenter, 3 (11.1%) were in Massachusetts, 2 (7.4%) each in Illinois, Missouri, and New York, and 1 (3.7%) each from South Carolina, Washington, West Virginia, and Wisconsin. 1 (3.7%) study used data from Utah and Texas. These were qualitatively compared to a CDC study showing the highest distribution of stroke in Mississippi (4.3%) followed by Oklahoma (3.4%), Washington D.C. (3.4%), Louisiana (3.3%), and Alabama (3.2%) while the prevalence in California was 2.6%. Conclusions: It is clear that a strong disconnect exists between the datasets and patient cohorts used in training machine learning algorithms in clinical research and the stroke distribution in which clinical tools using these algorithms will be implemented. In order to ensure a lack of bias and increase generalizability and accuracy in future machine learning studies, datasets using a varied patient population that reflects the unequal distribution of stroke risk factors would greatly benefit the usability of these tools and ensure accuracy on a nationwide scale.
KW - Bias
KW - Epidemiology
KW - Machine learning
KW - Stroke
UR - http://www.scopus.com/inward/record.url?scp=85104922118&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104922118&partnerID=8YFLogxK
U2 - 10.1016/j.jstrokecerebrovasdis.2021.105832
DO - 10.1016/j.jstrokecerebrovasdis.2021.105832
M3 - Article
C2 - 33940363
AN - SCOPUS:85104922118
SN - 1052-3057
VL - 30
JO - Journal of Stroke and Cerebrovascular Diseases
JF - Journal of Stroke and Cerebrovascular Diseases
IS - 7
M1 - 105832
ER -