编辑: lqwzrs | 2019-07-02 |
dbs.ifi.lmu.de E-mail: {achtert,boehm,kriegel,kroegerp,zimek}@dbs.ifi.lmu.de Abstract In high dimensional data, clusters often only exist in ar- bitrarily oriented subspaces of the feature space. In addi- tion, these so-called correlation clusters may have complex relationships between each other. For example, a correla- tion cluster in a 1-D subspace (forming a line) may be en- closed within one or even several correlation clusters in 2- D superspaces (forming planes). In general, such relation- ships can be seen as a complex hierarchy that allows mul- tiple inclusions, i.e. clusters may be embedded in several super-clusters rather than only in one. Obviously, uncover- ing the hierarchical relationships between the detected cor- relation clusters is an important information gain. Since ex- isting approaches cannot detect such complex hierarchical relationships among correlation clusters, we propose the al- gorithm ERiC to tackle this problem and to visualize the result by means of a graph-based representation. In our ex- perimental evaluation, we show that ERiC ?nds more infor- mation than state-of-the-art correlation clustering methods and outperforms existing competitors in terms of ef?ciency. 1. Introduction In high-dimensional data, meaningful clusters are usu- ally based only on a subset of all dimensions. Subspace clustering (or projected clustering) is a well known ap- proach to ?nd λ-dimensional subspaces of a d-dimensional data space (λ <
d), where certain sets of points cluster well. Despite great efforts in developing subspace clus- tering methods, all existing approaches suffer from certain drawbacks. Each approach is based on certain heuristics because the optimal solution would require at least a time complexity exponential in the number of dimensions d. An even more challenging problem is to ?nd clusters in arbi- trarily oriented subspaces. Such clusters appear as sets of points located near a common hyperplane (of arbitrary di- mension λ) in a d-dimensional data space. Since these hy- perplanes correspond to linear dependencies among several attributes (and thus the corresponding attributes are cor- related), the concept of knowledge discovery in databases addressing this problem is known as correlation clustering [10]. Correlation clustering groups the data sets into sub- sets called correlation clusters such that the objects in the same correlation cluster are all associated to a common hy- perplane of arbitrary dimensionality. A correlation cluster associated to a λ-dimensional hyperplane is referred to as a λ-dimensional correlation cluster. The dimensionality of a hyperplane associated to a correlation cluster is called the correlation dimensionality. A good example of a successful application of correla- tion clustering are recommendation systems. In target mar- keting, it is important to ?nd homogeneous groups of users with similar ratings in subsets of the attributes. In addi- tion, it is interesting to ?nd groups of users with correlated af?nities. This knowledge can help companies to predict customer behavior and thus develop future marketing plans. A second application of correlation clustering is metabolic screening. The collected data usually contain the concentra- tions of certain metabolites in blood samples of thousands of patients. In such data sets, it is important to ?nd homoge- neous groups of patients with correlated metabolite concen- trations indicating a common metabolic disease. This is an important step towards understanding metabolic or genetic disorders and designing individual drugs. Another promi- nent application for correlation clustering is the analysis of gene expression data. Gene expression data contain the ex- pression levels of thousands of genes, indicating how active the genes are, according to a set of samples. A common task is to ?nd clusters of co-regulated genes, i.e. clusters of genes that share a common linear dependency within a set of their features. The ?rst approach that can detect correlation clusters is ORCLUS [6] that integrates PCA into k-means clustering. The algorithm 4C [10] integrates PCA into a density-based clustering algorithm. These approaches can be seen as ?at approaches in the following sense. A correlation cluster C1 with dimensionality λ1 may be embedded in (and therefore In Proc. 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), Banff, Canada,