archives.gandrew.com
All old news here. Move along.

Tag Clusters

July 8th 2006

With the Tag Clusters project I was looking at how a hierarchy or faceted system can be automatically generated from a set of tags or a folksonomy*. The project is composed of

I've written about this twice on my blog.

*I consider a folksonomy to be a collaborative set of tags, rather than just personal tagging.

Groundwork

First up some basic definitions

As relatively new ideas, these definitions are not used consistently across the literature.  My definitions are designed to make analysing the concepts easier, but for a better understanding, Thomas Vander Wal's definition is probably the most authoritative and wikipedia 's definition probably represents common usage of the terms.

Tags and Sets

This chapter begins with discussing the mathematical foundations for tagsonomies using sets. Then goes on to explores a simple property of tag maps - Tag Similarity. This property can be used to find containment relations and tag synonymns, generate hierarchies from tag maps, and to generate facet-like clusters of tags.

Partitioning

So the aim of this chapter is to talk about partitioning tags into independent groups.

Eg.
Set A - 1,2,3 
Set B -   2,3 
Set C - 1,  3 
Set D -       4,5 
Set E -         5,6,7 

The set ABC is disjoint with DE so we can partition this data into two sets ABC and DE. By representing the tagsonomy as a graph we can see that searching for disconnected graphs performs the same function.

Real world application

On a real world tagsonomy this is unlikely to yield many partitions (on my delicious tag set it yields three partitions with two of them being trivial).

Improvement

To improve the algorithm we need to define "connectedness" (connectivity?) and treat graphs with low connectedness as independant. Connectedness is probably strongly related to similarity.

Connectivity

A k-connected graph is a graph where any two vertices can be connected by a path of length k.

Tag Maps and Sets

Most of this chapter is based on the idea of Tag Sets. In a tag set each tag forms a set of resources. Thus given the following taggings.

Resource A, Tags: code snippets, php, dom, xml Resource B, Tags: code_snippets, javascript, dom, xml Resource C, Tags: code_snippets, javascript Resource D, Tags: code_snippets, php We would have the following sets

Considering tag maps as sets,allows us to use set notation, and concepts from set theory such as Unions, Intersection, Negation, Containment (subsets) and ordering. For an introduction to these topics see wikipedia/set, wikipedia/Algebra of sets, and wikipedia/Order Theory

Intersection

Union

Negation

Containment

Immediately we can see that there is some order in this previously flat tag space. All the resources, and all the other tag sets are completely contained within code_snippets, and the php and javascript sets don't overlap.

Partial Ordering with Containment

Tag Similarity

The tag similarity relation expands on the idea of containment but also considers that the tagging is imperfect noisy data. The algorithm evolves from first defining set equality as

A = B iff A [is contained in] B and B [is contained in] A

Then examing the containment relation

A [is contained in] B iff for each resource, i, in A, is in B
or
A [is contained in] B iff |A| < |A n B|

To get the follwing definition of equality

A = B iff |A| < |A n B|and |B| < |A n B|
or
A = B iff |A| = |B| = |A n B|
or
A = B iff |A|/|A n B| = 1 and |B|/|A n B| = 1

The ratio |A|/|A n B| is defined as the similarity of A to B, S(A, B). This relation is not symmetric ie. S(A, B) &neq; S(B, A). Some useful proerties of similarity:

Similarity, Containment and Equality algorithms

Fuzzy containment and equality

We can relax our defnitions of equality, disjointedness and containment

to

Tags and Graphs

A graph is a collection of nodes (vertices) and edges (arcs). An edge links two nodes, if the the direction of the link matters, the graph is a directed-graph or digraph. If the edges of a graph convery extra information the graph is called a labelled graph. It is easy to see that a tagsonomy describes a graph where each tag is a node, each resource is a node, and each tagging is an edge connecting a resource to a tag (or vice-versa). This link is directional so the resource is a digraph, and for now we do not need to add information to the links so the graph is unlabelled.

Tag Graphs

Graphs of tagging are useful, but to explore relationships more deeply it would be nice to have higher-level graphs. Two second-order graphs can be created for resources and tags. A tag graph is a graph where the nodes are tags and an edge exists between two nodes if there is a resource that is tagged with both tags. A tag graph can easily be generated by walking through the tags in a TagMap and adding an edge to each of the adjacent tags if one hasn't already been added.

Limited Tag Graphs

As the real-world example shows tag-graphs convey too much information to be truly useful. One way of limiting the information is too simply focus the graph on a limited set of tags, perhaps just showing one tag and its cotags, or showing the relationship between a set of cotags.

Similarity Graphs

The main problem with the tag-graph is that much of the information is noise, it is too easy for tags to become linked. To prune the links we can reuse the tag similarity measure developed previously. The tag graph we previously defined is equivalent to the graph {T1 -> T2 | similairty(T1, T2) > 0} by generalising this and increasing the parameter from 0 to an arbitrary constant sim_max.

Visualisation

Cascading Lists

Treemaps

About Treemap

See University of Maryland's Treemap History for more information.

Treemaps and Tagmaps

At the heart of treemaps is an algorithm for displaying a weighted list in a two dimensional view, with areas representing the weights. This breaks an important rule of graphic design - that 1 dimensional values should not be represented as 2-dimension structures. This rule is important because people find it much more difficult to compare areas than lengths. For displaying tags this is less important because usually the users do not care about exactly comparing the size of groups. The main problem with treemaps is that the labels do not always fit in the boxes, which makes them practially useless for navigation.

Treemaps and Tag Hierarchies

The techniques for making tags into hierarchies may be combined with Treemaps to get better/different results. I'll look into this one day.