Inference from clustering with application to gene-expression microarrays

Edward R. Dougherty, Junior Barrera, Marcel Brun, Seungchan Kim, Roberto M. Cesar, Yidong Chen, Michael Bittner, Jeffrey M. Trent

Research output: Contribution to journalArticle

129 Citations (Scopus)

Abstract

There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.

Original languageEnglish (US)
Pages (from-to)105-126
Number of pages22
JournalJournal of Computational Biology
Volume9
Issue number1
DOIs
StatePublished - 2002
Externally publishedYes

Fingerprint

Microarrays
Gene expression
Microarray
Gene Expression
Cluster Analysis
Random processes
Clustering
Clustering algorithms
Random process
Sample point
Self organizing maps
Replication
Clustering Algorithm
Output
Error analysis
Seed
CDNA Microarray
Model-based Clustering
Oligonucleotide Array Sequence Analysis
Fuzzy C-means

Keywords

  • Clustering
  • Gene expression
  • Microarray

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., ... Trent, J. M. (2002). Inference from clustering with application to gene-expression microarrays. Journal of Computational Biology, 9(1), 105-126. https://doi.org/10.1089/10665270252833217

Inference from clustering with application to gene-expression microarrays. / Dougherty, Edward R.; Barrera, Junior; Brun, Marcel; Kim, Seungchan; Cesar, Roberto M.; Chen, Yidong; Bittner, Michael; Trent, Jeffrey M.

In: Journal of Computational Biology, Vol. 9, No. 1, 2002, p. 105-126.

Research output: Contribution to journalArticle

Dougherty, ER, Barrera, J, Brun, M, Kim, S, Cesar, RM, Chen, Y, Bittner, M & Trent, JM 2002, 'Inference from clustering with application to gene-expression microarrays', Journal of Computational Biology, vol. 9, no. 1, pp. 105-126. https://doi.org/10.1089/10665270252833217
Dougherty, Edward R. ; Barrera, Junior ; Brun, Marcel ; Kim, Seungchan ; Cesar, Roberto M. ; Chen, Yidong ; Bittner, Michael ; Trent, Jeffrey M. / Inference from clustering with application to gene-expression microarrays. In: Journal of Computational Biology. 2002 ; Vol. 9, No. 1. pp. 105-126.
@article{8d94825059cf40048323cdbf06af8cdf,
title = "Inference from clustering with application to gene-expression microarrays",
abstract = "There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.",
keywords = "Clustering, Gene expression, Microarray",
author = "Dougherty, {Edward R.} and Junior Barrera and Marcel Brun and Seungchan Kim and Cesar, {Roberto M.} and Yidong Chen and Michael Bittner and Trent, {Jeffrey M.}",
year = "2002",
doi = "10.1089/10665270252833217",
language = "English (US)",
volume = "9",
pages = "105--126",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "1",

}

TY - JOUR

T1 - Inference from clustering with application to gene-expression microarrays

AU - Dougherty, Edward R.

AU - Barrera, Junior

AU - Brun, Marcel

AU - Kim, Seungchan

AU - Cesar, Roberto M.

AU - Chen, Yidong

AU - Bittner, Michael

AU - Trent, Jeffrey M.

PY - 2002

Y1 - 2002

N2 - There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.

AB - There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.

KW - Clustering

KW - Gene expression

KW - Microarray

UR - http://www.scopus.com/inward/record.url?scp=0036207548&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036207548&partnerID=8YFLogxK

U2 - 10.1089/10665270252833217

DO - 10.1089/10665270252833217

M3 - Article

C2 - 11911797

AN - SCOPUS:0036207548

VL - 9

SP - 105

EP - 126

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 1

ER -