The cluster map is an intuitive way to group search results by topic. It helps to better understand the structure of online coverage and other large document collections. The visual representation of the cluster map arranges documents by their semantic similarity using different clustering methods  in combination with a force-directed layout algorithm . The system assigns each document to a specific cluster, which acts as a local gravity point. Its largest node rests at the center and attracts other nodes that belong to this cluster.
Cluster Map Layout
The Cluster Map highlights groups of similar documents by a convex hull shape that visually holds its nodes together. The size of this shape is dynamic and depends on the number of contained nodes. Each of the nodes of variable size and color represents one of the documents returned by the search function:
- Node size is proportional to the reach of the document’s media source (a CNN.com article, for example, is rendered larger than a report published on a local community site).
- Node color reflects normalized document sentiment, ranging from red (negative) to grey (neutral) and green (positive). The saturation depends on the degree of polarity – vivid colors indicate emotional articles, lower saturation a more factual coverage.
Three keyword labels per cluster describe its contents. The system renders the hull shapes of nodes and keyword clusters with reduced opacity to decrease the visual load and increase the labels’ readability. The computation of labels is based on the document keywords within the cluster, and considers the reach of the documents’ sources.
Interactive Cluster Map Features
- Hovering over a cluster hides its keywords and highlights its shape and nodes through higher opacity. Node colors within a highlighted cluster become more vivid.
- Clicking on a cluster triggers a new search, narrowing down the set of results to documents within the selected cluster.
- Hovering over a single node highlights this node with an orange stroke and shows a tooltip with document keywords and the favicon of the source.
Keyword clustering tools have to balance accuracy and scalability. Common methods include K-means, which divides the collection of documents into a fixed amount of clusters. Each document belongs to the cluster with the nearest centroid. In contrast to K-means, agglomerative hierarchical clustering is deterministic (= a given set of documents always results in the same layout) and uses an iterative “bottom up” approach to pair clusters into a tree-like structure. The story detection algorithm of webLyzard is another clustering approach that uses time slices and is particularly suited for very large document collections.
- Jain, A.K. (2010). Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters, 31(8): 651-666.
- Syed, K.A.A., Kröll, M., Sabol, V., Scharl, A. and Gindl, S. (2012). Incremental and Scalable Computation of Dynamic Topography Information Landscapes, Journal of Multimedia Processing and Technologies, 3(1): 49-65.