TY - JOUR
T1 - Denormalize and delimit
T2 - International Conference on Computational Science, ICCS 2016
AU - Bokov, Alex F.
AU - Manuel, Laura
AU - Cheng, Catherine
AU - Bos, Angela
AU - Tirado-Ramos, Alfredo
N1 - Publisher Copyright:
© The Authors. Published by Elsevier B.V.
PY - 2016
Y1 - 2016
N2 - There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best underserved by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources. The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.
AB - There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best underserved by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources. The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.
KW - Data extraction
KW - Data formats
KW - Data transformation
KW - Electronic health records
KW - Health services research
KW - Relational databases
UR - http://www.scopus.com/inward/record.url?scp=84978477095&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84978477095&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2016.05.403
DO - 10.1016/j.procs.2016.05.403
M3 - Conference article
AN - SCOPUS:84978477095
SN - 1877-0509
VL - 80
SP - 1033
EP - 1041
JO - Procedia Computer Science
JF - Procedia Computer Science
Y2 - 6 June 2016 through 8 June 2016
ER -