The Long Tail of Tags
I made an observation the other day, that then led me to another and then another. Perhaps these are entirely obvious to you but I hadn't previously made the connection between the tags I use, their frequency in my tagcloud and Chris Anderson's 'the Long Tail' theory. Doing a quick search on the web, I haven't found anything specific to this topic, so I thought I share what I found.
First, let's start with the classic tagcloud. Here's a pic of all the tags I've used at del.icio.us:
As per the standard tagcloud visual representation, the size of each tag represents the relative frequency of the tags I've used - the larger the size of the tag, the more I have used that tag relative to another tag in my 'tagcloud'. Sized tagclouds can be a helpful navigational device, providing a view into the distribution of 'interest' about things. So if you look at my tagcloud, you can get a feel of what interests me.
What I had a hunch about, and confirmed via the graphing, is that my interests seem to follow the classic Long Tail / powercurve.
My Long Tail of Tags
On to the data...
- I have tagged 681 'articles' (URLs)
- I have used 386 tags
- The most used used tag is 'RSS'.
- I've tagged 139 'articles' with the 'RSS' tag, around 36%.
- I've used Atom tag 4 times, or 1% (btw, I expect to tag more stuff with Atom as my interest in that topic is on the increase)
- I have tagged 25 articles with the tag 'microformats'
So, I threw in my del.icio.us tag data into a spreadsheet - all the tags I have used and their frequency (shown in the tagcloud above) - and then sorted the list by descending order (most used tags as the top), charted and added a logarithmic trendline. This is what I saw:
Each tag is listed along the horizontal axis and their frequency is represented along the vertical, so the tags most used are on the left ('RSS' tag starts the series).
Lo and behold, athe Long Tail appears once more!
This is more than a variation on the theme of the Long Tails of language (Zipf's observation that the frequency of words used in the English language followed a powerlaw distribution) and words - this is the Long Tail of my interests as represented by tags.
I tag stuff of interest to me > my tags express my interests > the distribution of my tags express the distribution of my interests > My mind is a powerlaw!
And if you tag a lot, yours probably is too...try it.
Laws of the Long Tail of Tags
So based on the above, I propose the first two Laws of the Long Tail of Tags:
1. the frequency of a tags used by any user who is not required to follow a formalized taxonomy will follow a Long Tail powercurve disribution
2. any tag tagged by a user has an >80% chance of being in that user's 'head' of their tagcloud Long Tail
Kind of obvious if you think about it, I suppose. But it hadn't occurred to me until I thought about tags in Long Tail terms.
Now for another Long Tail in tagspace...
Let's take the Long Tail article published in Wired. Around 1,500 users have bookmarked the article in del.icio.us using all sorts of tags. I looked for the number of tags used by all the users who tagged the article and see of there was a powerlaw there too.
Unfortunately I haven't found a way to find all the tags used by all users for the article - I can only get the top 25 (the limit defined by del.icio.us...if anyone knows how to get the rest of the tags please let me know):
I graphed the above data and included some hypothetical data:
If there is a Long Tail here too, and I'm sure there is, what would that mean? And how do an item's tag distribution relate to behavior of their own tagclouds?
We know already know in that the process of lots of people tagging stuff a collective agreement emerges about how things should be tagged. The popular tags used to categorize an article live at the 'head of the tail'. We can also assume the tags that appear in 'tail' of the Long Tail itself show how an article means different things to different people. But what of their relevance in terms of 'importance' to those taggers?
Looking through the data relating to the entire bookmaking history of the Wired article by all users on del.icio.us, there are tags at the 'tail' that are not listed in the top 25 tags. Examples are
I chose to follow the link of one of the users who tagged the article with the 'collaboration' tag and went to their tagspace on del.icio.us. And there it was...The 'Collaboration' tag was the most used tag by the user called David Kato, almost the only one who tagged the article with the 'collaboration' tag. So I threw David's tagcloud into my spreadsheet...
Below is David's Long Tail of Tags. I'll point out here that he has tagged 168 items, using 72 tags - so it's not a large data set and therefore not seeing a very smoothed out curve here. However, I propose that over time his tag distribution will look more like the classic Long Tail shape we're looking for:
So what does that mean? Again, this maybe quite obvious to you, but this seems pretty interesting. What it says to me at least is that these Long Tails of individual minds are strongly and potentially algorithmically correlated to the Long Tails of taggers' collective efforts.
Looking at the tagging data in this way (and without any use of fancy algorithms) we can see the inherent potential of using tagging as a basis for collaborative filtering and recommendation systems. Based on the the simple and unscientific analysis I've done here, it appears that the world of tagging holds related Long Tail networks everywhere.
In other words, tagware = natural Recommendation Networks
Other Tag related posts of mine (on my old blog):