What is a Cluster Map?
Search queries often return an overwhelming number of online documents. A cluster map is an intuitive way to group search results by topic. By identifying similar documents, it helps to better understand the structure of online coverage and other large document collections. The visual representation of the cluster map arranges documents by their semantic similarity, using clustering methods in combination with a force-directed layout algorithm. The system assigns each document to a specific cluster, which acts as a local gravity point. The largest node rests at the center and attracts other nodes that belong to this cluster.
Cluster Map Layout
The visualization highlights groups of similar documents by a convex hull shape that visually holds its nodes together. The size of this shape is dynamic and depends on the number of contained nodes. Each of the nodes, variable in size and color, represents one of the documents returned by the search function:
- Node size is proportional to the reach of the document’s media source (a CNN.com article, for example, is rendered larger than a report published on a local community site).
- Node color either indicates the selected checkboxes from the left sidebar, or normalized document sentiment, ranging from red (negative) to grey (neutral) and green (positive). The saturation depends on the degree of polarity – vivid colors indicate emotional articles, lower saturation a more factual coverage.
Three keyword labels per cluster describe its contents. The system renders the hull shapes of nodes with reduced opacity to decrease the visual load and increase the labels’ readability. The computation of these labels is based on the document keywords within the cluster. This process considers the reach of the documents’ sources to optimize the selection.
Interactive Cluster Map Features
- Hovering over a document cluster hides its keywords and highlights its shape and nodes through higher opacity. Node colors within a highlighted cluster become more vivid.
- Clicking on a cluster triggers a new search, narrowing down the set of results to documents within the selected cluster.
- Hovering over a single node highlights this node with an orange stroke and shows a tooltip with document keywords and the favicon of the source.
Keyword clustering tools have to balance accuracy and scalability. Common methods include the Louvain method for community detection as well as K-means, which divides the collection of documents into a fixed amount of clusters. Each document belongs to the cluster with the nearest centroid. In contrast to K-means, agglomerative hierarchical clustering is deterministic (= a given set of documents always results in the same layout) and uses an iterative “bottom up” approach to pair clusters into a tree-like structure. The story detection algorithm of webLyzard pursues a similar approach. It uses time slices and is particularly suited for the real time clustering of very large document collections.
- Jain, A.K. (2010). Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters, 31(8): 651-666.
- Syed, K.A.A., Kröll, M., Sabol, V., Scharl, A. and Gindl, S. (2012). Incremental and Scalable Computation of Dynamic Topography Information Landscapes, Journal of Multimedia Processing and Technologies, 3(1): 49-65.