Some background
Most media or content producers find themselves in a situation today where their systems for meta-data management are outdated and have not kept pace with the growth of the ‘folksonomy’ concept that emerged from the Web 2.0 movement. This is not a technical article about taxonomy and I don’t claim to be a professional data expert, I am simply going to outline some different ways that I have encountered to structure data based on my experience working with news and blog aggregation and comment is more than welcome from people who are experts.
The way I see it, there is a dichotomy between vertical, hierarchical category structures with high levels of normalisation and horizontal, unstructured tagging.
Vertical
The first approach comes from classic database design where you have a table of categories and subcategories that are joined via keys to your content items. This is a one to many relationship and makes for highly efficient data that conforms to fairly rigid structures. These types of categories do not change often and, when a change is made, these changes affect a lot of content at once.
The relationships between categories are generally parent-child relationships of exclusivity.
Horizontal
As site owners started opening up their taxonomies to the hordes, the rules were thrown out the window. Spelling and normalisation took a back-seat because the prominence of tags was now determined by sheer volumes and the wisdom of the crowds. The number of tags per content object are generally unlimited and completely unstructured, like keywords essentially.
If you consider the changes in the recent release of Wordpress, as an example, you will see that there is now support for both tags and categories when you create a new blog post. On Blogger there are only tags. In some other systems there are only categories or sections.
During the redevelopment of the Mail & Guardian Online, I had several very interesting discussions with the editorial staff about the logic of tagging. For some bloggers a tag is just a keyword and not much thought is given to how items are tagged in the long term, but in the case where a team generates 50 to 100 new stories a day the way stories get tagged suddenly becomes a whole lot more relevant.
Based on this discussions, I have put together a list of several approaches to tagging and categorisation and my thoughts on the pros and cons of each.
Categories/sections only
| Typical data structure | Pros | Cons |
| A table of unique category/section names joined to the story table as a foreign key or via a lookup table for multiple joins |
|
|
Tags only
| Typical data structure | Pros | Cons |
| A table of tag names, many repeated, joined to the story table as a foreign key or via a lookup table for multiple joins |
|
|
Grouped tags
| Typical data structure | Pros | Cons |
| A few tables of different types of tag names, many repeated, joined to the story table as a foreign key or via a lookup table for multiple joins. Example: people tags, place tags, thing tags. |
|
|
The OpenCalais service, which we used at the Mail & Guardian, works according to the grouped tags structure and this helped us a lot. The major consideration when it comes to doing the tagging this way is that, in theory, you would be introducing serveral new fields into your CMS, one for each type of tag. This can add up over time and slow down your publishing.
Autocomplete
The problems with spelling and word order for tagging by content adnimistrators can be solved, to an extent, by using autocomplete form-fields. This means someone starts typing and the field suggests tags that already exist for the user to select. Without autocomplete the Web would be a much messier place and it’s a great example of the subtle impact AJAX has had on publishing.
Tagging and microformats
It’s always good to play nicely with everyone else, so read this article about Microformats and tagging. In a nutshell, just add rel="tag" as an attribute in your tag hyperlinks. This may prove useful some time in the near future.
