I made an observation the other day, that then led me to another and then another. Perhaps these are entirely obvious to you but I hadn't previously made the connection between the tags I use, their frequency in my tagcloud and Chris Anderson's 'the Long Tail' theory. Doing a quick search on the web, I haven't found anything specific to this topic, so I thought I share what I found.
First, let's start with the classic tagcloud. Here's a pic of all the tags I've used at del.icio.us:
As per the standard tagcloud visual representation, the size of each tag represents the relative frequency of the tags I've used - the larger the size of the tag, the more I have used that tag relative to another tag in my 'tagcloud'. Sized tagclouds can be a helpful navigational device, providing a view into the distribution of 'interest' about things. So if you look at my tagcloud, you can get a feel of what interests me.
What I had a hunch about, and confirmed via the graphing, is that my interests seem to follow the classic Long Tail / powercurve.
My Long Tail of Tags
On to the data...
- I have tagged 681 'articles' (URLs)
- I have used 386 tags
- The most used used tag is 'RSS'.
- I've tagged 139 'articles' with the 'RSS' tag, around 36%.
- I've used Atom tag 4 times, or 1% (btw, I expect to tag more stuff with Atom as my interest in that topic is on the increase)
- I have tagged 25 articles with the tag 'microformats'
So, I threw in my del.icio.us tag data into a spreadsheet - all the tags I have used and their frequency (shown in the tagcloud above) - and then sorted the list by descending order (most used tags as the top), charted and added a logarithmic trendline. This is what I saw:
Each tag is listed along the horizontal axis and their frequency is represented along the vertical, so the tags most used are on the left ('RSS' tag starts the series).
Lo and behold, athe Long Tail appears once more!
This is more than a variation on the theme of the Long Tails of language (Zipf's observation that the frequency of words used in the English language followed a powerlaw distribution) and words - this is the Long Tail of my interests as represented by tags.
I tag stuff of interest to me > my tags express my interests > the distribution of my tags express the distribution of my interests > My mind is a powerlaw!
And if you tag a lot, yours probably is too...try it.
Laws of the Long Tail of Tags
So based on the above, I propose the first two Laws of the Long Tail of Tags:
1. the frequency of a tags used by any user who is not required to follow a formalized taxonomy will follow a Long Tail powercurve disribution
2. any tag tagged by a user has an >80% chance of being in that user's 'head' of their tagcloud Long Tail
Kind of obvious if you think about it, I suppose. But it hadn't occurred to me until I thought about tags in Long Tail terms.
Now for another Long Tail in tagspace...
Let's take the Long Tail article published in Wired. Around 1,500 users have bookmarked the article in del.icio.us using all sorts of tags. I looked for the number of tags used by all the users who tagged the article and see of there was a powerlaw there too.
Unfortunately I haven't found a way to find all the tags used by all users for the article - I can only get the top 25 (the limit defined by del.icio.us...if anyone knows how to get the rest of the tags please let me know):
I graphed the above data and included some hypothetical data:
If there is a Long Tail here too, and I'm sure there is, what would that mean? And how do an item's tag distribution relate to behavior of their own tagclouds?
We know already know in that the process of lots of people tagging stuff a collective agreement emerges about how things should be tagged. The popular tags used to categorize an article live at the 'head of the tail'. We can also assume the tags that appear in 'tail' of the Long Tail itself show how an article means different things to different people. But what of their relevance in terms of 'importance' to those taggers?
Looking through the data relating to the entire bookmaking history of the Wired article by all users on del.icio.us, there are tags at the 'tail' that are not listed in the top 25 tags. Examples are
I chose to follow the link of one of the users who tagged the article with the 'collaboration' tag and went to their tagspace on del.icio.us. And there it was...The 'Collaboration' tag was the most used tag by the user called David Kato, almost the only one who tagged the article with the 'collaboration' tag. So I threw David's tagcloud into my spreadsheet...
Below is David's Long Tail of Tags. I'll point out here that he has tagged 168 items, using 72 tags - so it's not a large data set and therefore not seeing a very smoothed out curve here. However, I propose that over time his tag distribution will look more like the classic Long Tail shape we're looking for:
So what does that mean? Again, this maybe quite obvious to you, but this seems pretty interesting. What it says to me at least is that these Long Tails of individual minds are strongly and potentially algorithmically correlated to the Long Tails of taggers' collective efforts.
Looking at the tagging data in this way (and without any use of fancy algorithms) we can see the inherent potential of using tagging as a basis for collaborative filtering and recommendation systems. Based on the the simple and unscientific analysis I've done here, it appears that the world of tagging holds related Long Tail networks everywhere.
In other words, tagware = natural Recommendation Networks
Other Tag related posts of mine (on my old blog):
Jason Kolb has been writing a great series of posts called 'Reinventing the Internet'.
I've been bookmarking and sharing some of these posts via Del.icio.us (and if you're sub'd to me, you would have seen these in my feed). Dipping in and out of these since the first post of his series, they seem to be getting better with each post.
In Jason's first 'Reinventing the Internet' intro post, he starts off with the assertion that:
"If somebody wants to know something about me, I point them to www.jasonkolb.com to find out about me, or to my personal site if it's on a personal level. Everyone I know tells people to find them via their MySpace account, LinkedIn account, or blog. Or, people who still don't have an account on a social network of some type (they will) give out their email address."
As Jason points out in his second post 'A domain name in every pot', companies bet their existence, brand, success and ability to be trusted on this very premise - the domain rules. So, Jason asks, why not for you and me?
And then a quick reminder:
"owning your own domain name is like owning the title to your car. Otherwise, MySpace, LinkedIn, your blog provider, or your email provider owns the title to your online identity."
I think somewhere along the line of my reading the series, Jason kicked me into action as I recently moved my blog to my new domain. Come to think of it, I'm amazed that I hadn't done this years earlier. I've been playing on the web for 12+ years, 10 of those years professionally. It's taken me some time, yes, but now I'm here, wow - it feels good!
And so on to the fundamental question Jason begins to tackle in his series::
- should a blog at a domain name that you own be the epitome of an online presence?
Well a blog today, and something else tomorrow. The the point he makes is your domain is yours (as long as you keep paying the rent that is - Jason has another idea on that permarent issue.)
If the answer to Jason's question is 'yes', then what does it mean? What does it enable and why does it matter?
In the next few posts, Jason describes an architecture involving personal servers, URIs as unique personal online addresses and distributed applications, that will allow everyone to:
"eventually have their own personal server hosted at their own personal domain, and those servers will be able to talk to each other and collaborate with each other.
...be a node on an open source peer to peer social network."
It is a fascinating idea and it opens up some interesting scenarios (I'll get to those in another post). There are two key advancements he has discussed so far that would enable this vision:
We'll explore the 'internet as a database' idea further in another post (a topic close to my heart), but for now I'm going to stick with the ID question.
As his posts unfolded, I wondered how he saw his ID vision fitting, if at all, with CardSpace - formerly Infocard, the identity metasystem effort led by Kim Cameron.
Today, Jason posted an 'interlude post' responding to some of the feedback he's received on his series so far and he called out CardSpace specifically. Bottom line is that Jason believes there is no fit. Jason write of CardSpace -
"The alternative to this are identity metadata schemes like CardSpace. These assume, however, that you will still have pieces of your online identity scattered amongst various providers, which is precisely what I want to get away from. Consider this statement from the CardSpace information page:
"Different kinds of digital identities will always be necessary—no single identity will suffice... No single organization can unilaterally impose a solution."
Basically what I'm saying in this series of posts is that I completely disagree with this statement. The individual himself should be the single source of online identity. There IS a single organization that can unilaterally impose a solution, and that's the individual. Power to the people ;) "
Jason and Kim (and others in the community working with Kim) agree on the 'power to the people' mantra. I've spoken to Kim, met him and heard him present a couple of times on this and it's a prominent theme in CardSpace (hey, he even blogged me!). I realize Jason has at least looked into CardSpace - he quoted from the Seven Laws of Identity - but I'd encourage him to find out more on what CardSpace has to offer in helping him achieve his vision.
I'd like to highlight two other quotes from Seven Laws of Identity. For the uninitiated, think of these Seven Laws as a base set of requirements that any ID system must meet:
"1. User Control and Consent
No one is as pivotal to the success of the identity metasystem as the individual who uses it. The system must first of all appeal by means of convenience and simplicity. But to endure, it must earn the user’s trust above all.
Earning this trust requires a holistic commitment. The system must be designed to put the user in control of what digital identities are used, and what information is released.
The system must also protect the user against deception, verifying the identity of any parties who ask for information. Should the user decide to supply identity information, there must be no doubt that it goes to the right place. And the system needs mechanisms to make the user aware of the purposes for which any information is being collected.
The system must inform the user when he or she has selected an identity provider able to track Internet behavior."
Back to Jason's objections, I think the following is another key concept to point out with the identity metasystem - the need to support multiple identity providers and systems.
"5. Pluralism of Operators and Technologies
A universal identity system must channel and enable the inter-working of multiple identity technologies run by multiple identity providers.
So when it comes to digital identity, it is not only a matter of having identity providers run by different parties (including individuals themselves), but of having identity systems that offer different (and potentially contradictory) features."
(My bold). Does this mean that universal identity system proposes or requires the use of a gazillion different username / passwords? No, precisely the opposite in fact. However, the metasystem design accepts a heterogeneous internet as a fact of life (you know, Utopia is a very hard thing to come by, if not impossible - I've tried...).
So, should Jason try to solve today's identity nightmare by trying to get everyone to use his one system, or does he try and solve what he really cares about by using a common layer above the various ID systems, including his, that abstracts out the differences (various UIs, behaviors, etc) of these systems out and away from the user? You know that the banks / merchants / services ain't going to replace / swap out their ID systems for years, if not decades or at all.
Instead of asking them to replace their systems, they could just adopt an additional (not replacing) protocol that we can all agree on and that provides an single common UI / ID experience for the users, and go from there. That is what we want for users - a better experience, right? But to get there, we need to accept that:
"The universal identity metasystem must not be another monolith. It must be polycentric (federation implies this) and also polymorphic (existing in different forms). This will allow the identity ecology to emerge, evolve, and self-organize."
The last point is what allows us all to win. In other words, if Jason's system works, and it works well, it will interop with any other system that also uses the universal identity metasystem. If his works really well and populous like it, then Jason's solution could become the system of choice by the majority of internet users, if that is how it turned out to be. But without at least an initial level of interoperability between his and the multitude of other systems (that users will want to use via their personal servers), the chances of mass adoption of Jason's vision / solution are vanishingly small compared to the alternative route.
As I see it, in the ID space there is no downside to playing with the rest of the others. You can have your cake and it. I really think Kim and James can and should have a discussion on this.
Nick Malik has written up a cheerful post providing advice on how to kill an app. The context is within the Enterprise, where 'app fights' between IT departments and business units happen all the time, often resulting in maimed, if not mortally wounded, egos, bits and projects.
Nick provides the following line of attack as an example of a Jujitsu-esque maneuver designed to stun the opposition into submission:
"...I'd consider things like: scalability against maximum, throughput against maximum, downtime inside SLA, downtime outside SLA, and Number of people-hours needed for each function point of change request submitted in the past two years as a measure of maintenance costs."
The most savage example I've seen of IT shinnanigans in the real-world is the 'security and compliance audit' play - an ambitious, yet highly effective ruse that's very hard to combat once momentum is achieved. Note: the following is overkill if you are only trying to kill off a single competing business application / effort.
It goes roughly like this:
- Provide a senior exec with proof points showing that without an overly-centralized app development and IT management organization you should expect the development of 'insecure and non-compliant' IT applications
- Remind exec that the 'current lack of control' of IT chaos across their org presents unknown and unacceptable business risks to the said organization and that its their *** on the line if anything goes wrong, anywhere.
- Develop your own made-for-purpose-definitions of 'insecure and uncompliant' IT applications that would ensure that no system known to mankind could possibly pass using your audit
- Propose to lead a project (and receive funding for) an audit of all IT applications across the whole organization - don't forget to ask for explicit senior executive mandate (otherwise know as 'carte blanche')
- Present early results back to the senior executive team. Show that your audit has already turned up a number of IT applications (any two apps will do - shoot at will) that have been proven to be highly 'insecure and non-compliant' (according to your definitions - but don't remind them of that), and that these apps alone present unknown and unacceptable business risks (i.e. the execs' arses)
- Write your own check and plan early retirement
- Choke company to death with IT centralization, then leave
Seattle Times has reported interesting bit of news...
"Brian Valentine, who herded the past three versions of Microsoft's flagship operating-system software toward completion, left the company Friday to take a senior position at Amazon.com"
Valentine's Microsoft.com exec page confirms this:
"Brian Valentine left Microsoft in September 2006.
Valentine was senior vice president of Microsoft's Windows Core Operating System Division (COSD), responsible for development of the Windows operating system and driving engineering excellence within the Windows operating system and across platforms.
Valentine joined Microsoft in 1987 as an engineering manager in the LAN Manager group and then spent most of the next 12 years working on Microsoft Mail and Microsoft Exchange Server, eventually managing the Exchange and BackOffice family product units. He was put in charge of Windows in December 1998."
Scott Guthrie has announced the release of V1.0 of the IronPython project for .NET, available for download from CodePlex.
Check out the screencast recorded by Jim Hugunin (lead architect for IronPython) and Jon Udell did to demo a bunch of the languages features. Jim has a great post describing the background and goals:
"The more time I spent working on IronPython and with the CLR, the more excited I became about its potential to finally deliver on the vision of a single common platform for a broad range of languages. At that same time, I was invited to come out to Microsoft to present IronPython and to talk with members of the CLR team about technical issues that I was running into. I had a great time that day working through these issues with a group of really smart people who all had a deep understanding of virtual machines and language implementation. After much reflection, I decided to join the CLR team at Microsoft where I could work with the platform to make it an even better target for dynamic languages and be able to have interesting technical discussions like that every day."
More dynamic languages on .NET, that's where this is all heading...Scott Guthrie:
"Going forward, you are going to see even more dynamic languages appear on .NET, and a bunch of cool new scenarios become enabled."
"From a strategic perspective, Microsoft now has a stake in the ground. It aims to make dynamic languages, in the managed environment of the .NET Common Language Runtime, safe for the enterprise. Sun has shown some interest in doing the same for dynamic languages on the Java Virtual Machine, but not much, which is ironic given that Jim Hugunin started working on JPython -- now Jython, the Java equivalent to IronPython -- nine years ago."
Jim Hugunin underscores this point:
"Shipping IronPython 1.0 isn't the end of the road, but rather the beginning. Not only will we continue to drive IronPython forward but we're also looking at the bigger picture to make all dynamic languages deeply integrated with the .NET platform and with technologies and products built on top of it. I'm excited about how far we've come, but even more excited by what the future holds!"
I agree, this a key milestone for the development of .NET. Next? Well, projects such as Ruby.NET and RubyCLR show the potential in this space...read George Lawton's article on these efforts.
Over the weekend Garrett Rodgers noticed a list of new set of domains registered by Google including the word 'archive', leading him to speculate on some possbile new services in the near future, something along the lines of the WayBackMachine:
"If I am on the wrong track with the web archive, another possibility for a service named "Google Archive Search" would be one where you could search for historic articles from things like newspapers and magazines. Also don't forget the deal that was made with the Associated Press — it could have something to do with these domains also."
Philipp Lenssen picked up on this too.
According to the BBC, it looks like Garrett got it right with his second guess as you can now use Google News Archive Search to search digitised newspaper articles and more recent online content, spanning, wait for it: the last 200 years:
"People using the search are shown results from both free and subscription-based news outlets.
Partners in the project include the websites of US newspaper the New York Times and the Guardian from the UK.
Other sources include news aggregators, websites which collect and display news stories from multiple sources."
I've had a play and it is impressive. A search for Winston Churchill and Hitler citings pre-1939, provides results including a link to this Time Magazine article from June 1935 reporting on Britain's Parliament proceedings:
"The Lords: Spent most of the week hotly debating the blank check His Majesty's Government gave to Germany to violate the Treaty of Versailles in return for Adolf Hitler's promise to keep his navy at 35% of Britain's (TIME, June 24)."
This is the Timeline view of the same search:
I'm going to play some more...
"While the interface is similar to Google News, the new layout is focused on time. The key intervals for a search are marked with an arrow, and there's also a timeline view that shows the most interesting news from each computer-generated interval."
"Google would not state how many publishers were taking part in the new service…but has announced a number of partners including WSJ, NYT, WaPo, Time, Guardian Unlimited, Factiva, Lexis-Nexis, HighBeam Research and Thomson Gale."