START

A system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries

Xinjie Zhu, Qiang Zhang, Eric Dun Ho, Ken Hung On Yu, Chris Liu, Hui-ming Huang, Alfred Sze Lok Cheng, Ben Kao, Eric Lo, Kevin Y. Yip

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Background: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.

Original languageEnglish (US)
Article number749
JournalBMC Genomics
Volume18
Issue number1
DOIs
StatePublished - Sep 22 2017

Fingerprint

Language
Galaxies
Boidae
Information Storage and Retrieval
DNA Methylation
Hepatocellular Carcinoma
Databases
Research

Keywords

  • Data analysis
  • Human genomics
  • Signal tracks

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

START : A system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries. / Zhu, Xinjie; Zhang, Qiang; Ho, Eric Dun; Yu, Ken Hung On; Liu, Chris; Huang, Hui-ming; Cheng, Alfred Sze Lok; Kao, Ben; Lo, Eric; Yip, Kevin Y.

In: BMC Genomics, Vol. 18, No. 1, 749, 22.09.2017.

Research output: Contribution to journalArticle

Zhu, Xinjie ; Zhang, Qiang ; Ho, Eric Dun ; Yu, Ken Hung On ; Liu, Chris ; Huang, Hui-ming ; Cheng, Alfred Sze Lok ; Kao, Ben ; Lo, Eric ; Yip, Kevin Y. / START : A system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries. In: BMC Genomics. 2017 ; Vol. 18, No. 1.
@article{44b1352a09a94511b026981e61700d6e,
title = "START: A system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries",
abstract = "Background: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.",
keywords = "Data analysis, Human genomics, Signal tracks",
author = "Xinjie Zhu and Qiang Zhang and Ho, {Eric Dun} and Yu, {Ken Hung On} and Chris Liu and Hui-ming Huang and Cheng, {Alfred Sze Lok} and Ben Kao and Eric Lo and Yip, {Kevin Y.}",
year = "2017",
month = "9",
day = "22",
doi = "10.1186/s12864-017-4071-1",
language = "English (US)",
volume = "18",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - START

T2 - A system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries

AU - Zhu, Xinjie

AU - Zhang, Qiang

AU - Ho, Eric Dun

AU - Yu, Ken Hung On

AU - Liu, Chris

AU - Huang, Hui-ming

AU - Cheng, Alfred Sze Lok

AU - Kao, Ben

AU - Lo, Eric

AU - Yip, Kevin Y.

PY - 2017/9/22

Y1 - 2017/9/22

N2 - Background: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.

AB - Background: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.

KW - Data analysis

KW - Human genomics

KW - Signal tracks

UR - http://www.scopus.com/inward/record.url?scp=85029878806&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029878806&partnerID=8YFLogxK

U2 - 10.1186/s12864-017-4071-1

DO - 10.1186/s12864-017-4071-1

M3 - Article

VL - 18

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 749

ER -