|Published (Last):||6 February 2013|
|PDF File Size:||2.82 Mb|
|ePub File Size:||12.14 Mb|
|Price:||Free* [*Free Regsitration Required]|
However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated.
A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed.
The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining.
In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications. Index Terms—Biclustering, simultaneous clustering, coclustering, subspace clustering, bidimensional clustering, direct clustering, block clustering, two-way clustering, two-mode clustering, two-sided clustering, microarray data analysis, biological data analysis, gene expression data.
Grouping of genes according to their expression samples conditions . The samples may correspond to under multiple conditions. Classification of a new gene, given the expression of In other cases, the samples may have come from different other genes, with known classification. Grouping of conditions based on the expression of a different individuals. Simply visualizing this kind of data, number of genes. Classification of a new sample, given the expression expression data, is challenging and extracting biologically of the genes under that experimental condition.
Usually, gene expression data is arranged in a data Clustering techniques can be used to group either genes or conditions and, therefore, to pursue directly objectives 1 matrix, where each gene corresponds to one row and each and 3 above and, indirectly, objectives 2 and 4.
However, condition to one column. Each element of this matrix applying clustering algorithms to gene expression data runs represents the expression level of a gene under a specific into a significant difficulty.
Many activation patterns are condition, and is represented by a real number, which is common to a group of genes only under specific experi- usually the logarithm of the relative abundance of the mental conditions. In fact, our general understanding of mRNA of the gene under the specific condition.
Gene cellular processes leads us to expect subsets of genes to be expression matrices have been extensively analyzed in two coregulated and coexpressed only under certain experi- dimensions: the gene dimension and the condition dimen- mental conditions, but to behave almost independently sion.
These analysis correspond, respectively, to analyze the under other conditions. Discovering such local expression patterns may be the key to uncovering many genetic expression patterns of genes by comparing the rows in the pathways that are not apparent otherwise. It is therefore matrix, and to analyze the expression patterns of samples highly desirable to move beyond the clustering paradigm, by comparing the columns in the matrix.
It refers to a ID, Lisbon, Portugal. E-mail: smadeira di. E-mail: aml inesc-id. Names such as coclustering, bidimensional clustering, and Manuscript received 29 Jan. One of tcbb computer. Gene Expression Data Matrix What is then the difference between clustering and biclustering? Why and when should we use biclustering instead of clustering?
Clustering can be applied to either the rows or the columns of the data matrix, separately. Biclustering, on the other hand, performs clustering in these two dimensions simultaneously. This means that clustering derives a global model while biclustering produces a local model.
When clustering algorithms are used, each gene in a given gene cluster is defined using all the rows X and set of columns Y , where the element aij conditions. Similarly, each condition in a condition cluster corresponds to a value representing the relation between is characterized by the activity of all the genes that belong to row i and column j. Such a matrix A, with n rows and m it. We can then conclude that, unlike columns J. Therefore, behavior across the set of all columns.
Only a small set of the genes participates in a cellular defined as a k by m submatrix of the matrix A. Similarly, a cluster of columns is a subset of columns that 2. An interesting cellular process is active only in a exhibit similar behavior across the set of all rows. A column subset of the conditions. A cluster of genes should be defined with respect to behavior across a subset of columns, and vice versa. The only a subset of the conditions. The first characteristic is the geneity.
The exact characteristics of homogeneity vary from sheer complexity of gene regulation processes that require approach to approach, and will be studied in Section 3. The second characteristic is the 2. An interesting connection between data matrices and graph theory can be established. A data matrix can be viewed as a weighted bipartite graph. Table 1 illustrates the as a weighted bipartite graph where each node ni 2 L arrangement of a gene expression matrix.
The edge between node ni and nj has weight aij , deal with gene expression matrices. However, there are denoting the element of the matrix in the intersection many other applications for biclustering. For this reason, we between row i and column j and the strength of the will consider the general case of a data matrix, A, with set of activation level, in the case of gene expression matrices.
Examples of different types of biclusters. This connection between matrices and graph theory leads to. The specific algorithm used to identify each biclus- very interesting approaches to the analysis of expression ter.
Section 5 shows that some proposals use greedy data based on graph algorithms. The domain of application of each algorithm. Although the complexity of the biclustering problem may Biclustering applications range from a number of depend on the exact problem formulation and, specifically, microarray data analysis tasks to more exotic on the merit function used to evaluate the quality of a given applications like recommendation systems, direct bicluster, almost all interesting variants of this problem are marketing and elections analysis.
Applications of NP-complete. In its simplest form the data matrix A is a biclustering with special emphasis on biological data binary matrix, where every element aij is either 0 or 1. When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph. More complex cases, where concerns the identification of the type of biclusters the the actual numeric values in the matrix A are taken into algorithm is able to find. We identified four major classes: account to compute the quality of a bicluster, have a 1.
Biclusters with constant values. Biclusters with constant values on rows or columns. Biclusters with coherent values. Biclusters with coherent evolutions. Given this, the large majority of the algorithms use heuristic approaches to identify biclusters in many The first three classes analyze directly the numeric cases preceded by a normalization step that is applied to the values in the data matrix and try to find subsets of rows data matrix in order to make more evident the patterns of and subsets of columns with similar behaviors.
These interest. Some of them avoid heuristics but exhibit an behaviors can be observed on the rows, on the columns, or exponential worst case runtime. The fourth class aims to find coherent behaviors 2. Given the already extensive literature on biclustering As such, biclusters with coherent evolutions view the algorithms, it is important to structure the analysis to be elements in the data matrix as symbols. These symbols presented. To achieve this, we classified the surveyed can be purely nominal, as in Figs.
The type of biclusters they can find. The bicluster normal value, as in Fig. This analysis is presented in Section 3. A bicluster with constant. The way multiple biclusters are treated and the values in the rows identifies a subset of genes with similar bicluster structure produced. Some algorithms find expression values across a subset of conditions, allowing only one bicluster, others find nonoverlapping the expression levels to differ from gene to gene.
Similarly, biclusters, others, more general, extract multiple, a bicluster with constant columns identifies a subset of overlapping biclusters. This dimension is studied in conditions within which a subset of genes present similar Section 4. However, one can be these algorithms depends, then, on the distance or interested in identifying more complex relations between similarity measure used by the one-way clustering algo- the genes and the conditions by looking directly at the rithms.
These algorithms will be considered in Sections 3. As such, a bicluster 3. On the other hand, identifying a bicluster with We will now introduce some notation used in the remaining coherent evolutions may be helpful if one is interested in of the section.
We denote same or opposite effects on a subset of genes. An elements in the bicluster: example of a constant bicluster is presented in Fig. These algorithms are studied in Section 3.
Biclustering algorithms for biological data analysis: a survey
Skip to Main Content. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. Use of this web site signifies your agreement to the terms and conditions. Personal Sign In.
Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. DOI: Madeira and Arlindo L. Madeira , Arlindo L.
Biclustering Algorithms for Biological Data Analysis: A Survey