Denormalize and delimit: How not to make data extraction for analysis more complex than necessary

Alex F. Bokov, Laura Manuel, Catherine Cheng, Angela Bos, Alfredo Tirado-Ramos

Producción científica: Conference articlerevisión exhaustiva

3 Citas (Scopus)


There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best underserved by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources. The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.

Idioma originalEnglish (US)
Páginas (desde-hasta)1033-1041
Número de páginas9
PublicaciónProcedia Computer Science
EstadoPublished - 2016
EventoInternational Conference on Computational Science, ICCS 2016 - San Diego, United States
Duración: jun 6 2016jun 8 2016

ASJC Scopus subject areas

  • General Computer Science


Profundice en los temas de investigación de 'Denormalize and delimit: How not to make data extraction for analysis more complex than necessary'. En conjunto forman una huella única.

Citar esto