编辑: lqwzrs | 2019-07-02 |
2007 (a) Sample data set
1 (b) Sample data set
2 Figure 1. Simple (a) and complex (b) hierar- chical relationships among correlation clus- ters may be part of) another correlation cluster C2 with dimen- sionality λ2 >
λ1. In general, there may be a kind of hi- erarchy among correlation clusters that are embedded into higher dimensional correlation clusters. Since ORCLUS and 4C cannot detect such hierarchies, the algorithm HiCO [4] was proposed tackling correlation clustering as a hier- archical problem, i.e. exploring the information of corre- lation clusters of lower correlation dimensionality that to- gether form a larger correlation cluster of higher correla- tion dimensionality. Although it is represented by the same models (dendrogram/reachability diagram), the resulting hi- erarchy is different from the hierarchies computed by tra- ditional hierarchical clustering algorithms such as Single- Link or OPTICS [8]. The hierarchy among correlation clus- ters re?ects the relationships among the subspaces in which these correlation clusters are embedded rather than spatial vicinity or density. As a simple illustration consider the data set depicted in Figure 1(a): Two lines, i.e. 1-D correlation clusters, are embedded within a plane, i.e. a 2-D correla- tion cluster. The resulting hierarchy consists of the two 1-D clusters as leaf-nodes of the hierarchy-tree both having the 2-D correlation cluster as parent node. HiCO aims at gen- erating a tree-based representation of the correlation cluster hierarchy. However, it may not always be appropriate to re?ect the hierarchy of correlation clusters as a tree. A correlation cluster may be embedded in several correlation clusters of higher dimensionality, resulting in a hierarchy with multi- ple inclusions (similar to the concept of multiple inheri- tance in software engineering). Consider e.g. the data set depicted in Figure 1(b): One of the 1-D correlation clus- ters is the intersection of two 2-D correlation clusters, i.e. it is embedded within two clusters of higher dimensional- ity. Those multiple inclusions can only be represented by a graph-based visualization approach which is beyond the capabilities of previous methods such as HiCO. In this paper, we propose a new algorithm called ERiC (Exploring Relationships among Correlation clusters) to completely uncover any complex hierarchical relationships of correlation clusters in high dimensional data sets includ- ing not only single inclusions (like HiCO) but also multiple inclusions. In addition, ERiC provides a clear visualization of these complex relationships by means of a graph-based representation. The remainder of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we de- scribe the notion of correlation clusters based on PCA in more detail in preparation to handle correlation clusters for- mally. Section
4 describes our new approach. An experi- mental evaluation is presented in Section 5. Section
6 con- cludes the paper. 2. Related Work In recent years, several subspace clustering algorithms have been proposed [7, 5, 19, 16, 9] to uncover clusters in axis parallel projections of the original data set. Some ap- proaches even provide information regarding the hierarchi- cal relationships among different subspace clusters [1, 2]. However, subspace clustering algorithms are not able to capture local data correlations and ?nd clusters of corre- lated objects since the principal axes of correlated data are arbitrarily oriented. Pattern-based clustering methods [21, 22, 17, 18] aim at grouping objects that exhibit a similar trend in a sub- set of attributes into clusters rather than grouping objects with low distance. This problem is also known as co- clustering or biclustering [15]. In contrast to correlation clustering, pattern-based clustering limits itself to a very special form of correlation where all attributes are positively correlated. It does not include negative correlations or cor- relations where one attribute is determined by a linear com- bination of two or more other attributes. Thus, biclustering or pattern-based clustering could be regarded as a special case of correlation clustering, as more extensively discussed in [10]. The expectation maximization (EM) algorithm [12] is one of the ?rst clustering algorithms that can generally de- tect correlation clusters. The EM algorithm tries to model the data distribution of a data set using a mixture of non- axis parallel Gaussian distributions. However, the EM al- gorithm cannot distinguish between correlation clusters and full-dimensional clusters without any correlation. In addi- tion, it favors clusters of spherical shape and requires the user to specify the number of clusters in advance. As a main drawback, the EM clustering is very sensitive to noise. ORCLUS [6] is a k-means style correlation clustering al- gorithm and, thus, can be seen as a specialization of EM that detects only correlation clusters. The correlation clusters are allowed to exist in arbitrarily oriented subspaces repre- sented by a set of Eigenvectors. Like most k-means based approaches, ORCLUS favors correlation clusters of spher- ical shape and requires the user to specify the number of clusters in advance. If the user'