Tag Clusters
With the Tag Clusters project I was looking at how a hierarchy or faceted system can be automatically generated from a set of tags or a folksonomy*. The project is composed of
- Tag Maps - this is a javascript api for dealing with sets of tagged resources
- Delicious Tag Map - this converts the contents of delicious's all posts api call to a Tag Map
- MinimumSpanningSet - A minimum spanning set algorithm
- MinimalSpanningSet - A close approximation to a minimum spanning set algorithm
- TagSimilarity = A similarity algorithm for Tag Maps
- HashMap - a javascript hash map used by Tag Map
Groundwork
First up some basic definitions
- Tagsonomy - This is a set of tagged resources, each resource has one or more "tags" which are not limited to any preset vocabulary.
- Folksonomy - A collective tagsonomy, usually the aggregate of a group of users personal tagsonomies.
- Tag Map - The structure used to represent a tagsonomy.
- Weighted Tag Map - The structure used to represent a folksonomy.
As relatively new ideas, these definitions are not used consistently across the literature. My definitions are designed to make analysing the concepts easier, but for a better understanding, Thomas Vander Wal's definition is probably the most authoritative and wikipedia 's definition probably represents common usage of the terms.
Tags and Sets
This chapter begins with discussing the mathematical foundations for tagsonomies using sets. Then goes on to explores a simple property of tag maps - Tag Similarity. This property can be used to find containment relations and tag synonymns, generate hierarchies from tag maps, and to generate facet-like clusters of tags.
Partitioning
So the aim of this chapter is to talk about partitioning tags into independent groups.
Eg.Set A - 1,2,3 Set B - 2,3 Set C - 1, 3 Set D - 4,5 Set E - 5,6,7
The set ABC is disjoint with DE so we can partition this data into two sets ABC and DE. By representing the tagsonomy as a graph we can see that searching for disconnected graphs performs the same function.
Real world application
On a real world tagsonomy this is unlikely to yield many partitions (on my delicious tag set it yields three partitions with two of them being trivial).
Improvement
To improve the algorithm we need to define "connectedness" (connectivity?) and treat graphs with low connectedness as independant. Connectedness is probably strongly related to similarity.
Connectivity
A k-connected graph is a graph where any two vertices can be connected by a path of length k.
Tag Maps and Sets
Most of this chapter is based on the idea of Tag Sets. In a tag set each tag forms a set of resources. Thus given the following taggings.
Resource A, Tags: code snippets, php, dom, xml
Resource B, Tags: code_snippets, javascript, dom, xml
Resource C, Tags: code_snippets, javascript
Resource D, Tags: code_snippets, php
We would have the following sets
Considering tag maps as sets,allows us to use set notation, and concepts from set theory such as Unions, Intersection, Negation, Containment (subsets) and ordering. For an introduction to these topics see wikipedia/set, wikipedia/Algebra of sets, and wikipedia/Order Theory
Intersection
Union
Negation
Containment
Immediately we can see that there is some order in this previously flat tag space. All the resources, and all the other tag sets are completely contained within code_snippets, and the php and javascript sets don't overlap.Partial Ordering with Containment
Tag Similarity
The tag similarity relation expands on the idea of containment but also considers that the tagging is imperfect noisy data. The algorithm evolves from first defining set equality as
A = B iff A [is contained in] B and B [is contained in] A
Then examing the containment relation
A [is contained in] B iff for each resource, i, in A, is in B
or
A [is contained in] B iff |A| < |A n B|
To get the follwing definition of equality
A = B iff |A| < |A n B|and |B| < |A n B|
or
A = B iff |A| = |B| = |A n B|
or
A = B iff |A|/|A n B| = 1 and |B|/|A n B| = 1
The ratio |A|/|A n B| is defined as the similarity of A to B, S(A, B). This relation is not symmetric ie. S(A, B) &neq; S(B, A). Some useful proerties of similarity:
- S(A, B) = 0 : A n B = {}
- S(A, B) >= 1 : A [contains] B
- S(A, B) = S(B, A) = 1 : A = B
Similarity, Containment and Equality algorithms
Fuzzy containment and equality
We can relax our defnitions of equality, disjointedness and containment
- S(A, B) = 0 : A n B = {}
- S(A, B) >= 1 : A [contains] B
- S(A, B) = S(B, A) = 1 : A = B
to
- S(A, B) ≈ 0 : A n B ≈ {}
- S(A, B) > ≈ 1 : A [approximately contains] B
- S(A, B) ≈ 1, S(B, A) ≈ 1 : A ≈ B
Tags and Graphs
A graph is a collection of nodes (vertices) and edges (arcs). An edge links two nodes, if the the direction of the link matters, the graph is a directed-graph or digraph. If the edges of a graph convery extra information the graph is called a labelled graph. It is easy to see that a tagsonomy describes a graph where each tag is a node, each resource is a node, and each tagging is an edge connecting a resource to a tag (or vice-versa). This link is directional so the resource is a digraph, and for now we do not need to add information to the links so the graph is unlabelled.- TagIncidenceList - an incidence list representation of a TagMap
- TagAdjacencyList - an adjacency list representation of a TagMap
- TagDot - a producer of dot files for visualising graphs with Graphviz
Tag Graphs
Graphs of tagging are useful, but to explore relationships more deeply it would be nice to have higher-level graphs. Two second-order graphs can be created for resources and tags. A tag graph is a graph where the nodes are tags and an edge exists between two nodes if there is a resource that is tagged with both tags. A tag graph can easily be generated by walking through the tags in a TagMap and adding an edge to each of the adjacent tags if one hasn't already been added.- Simple tag graph
- Delicious tag graph