• Home
  • About
Blue Orange Green Pink Purple

Strategies for tagging large volumes of content

Posted in Uncategorized. on Tuesday, September 16th, 2008 by Vincent Maher Tags: publishing, tagging
Sep 16

Some background
Most media or content producers find themselves in a situation today where their systems for meta-data management are outdated and have not kept pace with the growth of the ‘folksonomy’ concept that emerged from the Web 2.0 movement. This is not a technical article about taxonomy and I don’t claim to be a professional data expert, I am simply going to outline some different ways that I have encountered to structure data based on my experience working with news and blog aggregation and comment is more than welcome from people who are experts.

The way I see it, there is a dichotomy between vertical, hierarchical category structures with high levels of normalisation and horizontal, unstructured tagging.

Vertical
The first approach comes from classic database design where you have a table of categories and subcategories that are joined via keys to your content items. This is a one to many relationship and makes for highly efficient data that conforms to fairly rigid structures. These types of categories do not change often and, when a change is made, these changes affect a lot of content at once.

The relationships between categories are generally parent-child relationships of exclusivity.

Horizontal
As site owners started opening up their taxonomies to the hordes, the rules were thrown out the window. Spelling and normalisation took a back-seat because the prominence of tags was now determined by sheer volumes and the wisdom of the crowds. The number of tags per content object are generally unlimited and completely unstructured, like keywords essentially.

If you consider the changes in the recent release of Wordpress, as an example, you will see that there is now support for both tags and categories when you create a new blog post. On Blogger there are only tags. In some other systems there are only categories or sections.

During the redevelopment of the Mail & Guardian Online, I had several very interesting discussions with the editorial staff about the logic of tagging. For some bloggers a tag is just a keyword and not much thought is given to how items are tagged in the long term, but in the case where a team generates 50 to 100 new stories a day the way stories get tagged suddenly becomes a whole lot more relevant.

Based on this discussions, I have put together a list of several approaches to tagging and categorisation and my thoughts on the pros and cons of each.

Categories/sections only

Typical data structure Pros Cons
A table of unique category/section names joined to the story table as a foreign key or via a lookup table for multiple joins
  • Rigid and well-structured list of options
  • Centralised control
  • Enforces good spelling
  • Efficient from a database perspective
  • Low maintenance
  • Great for structured browsing
  • Hierarchical, which doesn’t always suite your requirements
  • Useless for tag clouds
  • Closed to users
  • Terms tend to be quite abstract
  • Splitting and changing names is difficult and messy

Tags only

Typical data structure Pros Cons
A table of tag names, many repeated, joined to the story table as a foreign key or via a lookup table for multiple joins
  • Lateral connections
  • More descriptive
  • Works great in tag clouds
  • Can be filtered by date range
  • The one-to-one relationship means splitting and changing is easy
  • Great for discovery
  • Messy on the spelling
  • Requires maintenance/curatorship
  • Very inefficient in the db

Grouped tags

Typical data structure Pros Cons
A few tables of different types of tag names, many repeated, joined to the story table as a foreign key or via a lookup table for multiple joins. Example: people tags, place tags, thing tags.
  • Easier to generate grouped tag clouds
  • Lateral connections
  • More descriptive
  • Works great in tag clouds
  • Can be filtered by date range
  • The one-to-one relationship means splitting and changing is easy
  • Great for discovery
  • Messy on the spelling
  • Requires maintenance/curatorship
  • Very inefficient in the db

The OpenCalais service, which we used at the Mail & Guardian, works according to the grouped tags structure and this helped us a lot. The major consideration when it comes to doing the tagging this way is that, in theory, you would be introducing serveral new fields into your CMS, one for each type of tag. This can add up over time and slow down your publishing.

Autocomplete
The problems with spelling and word order for tagging by content adnimistrators can be solved, to an extent, by using autocomplete form-fields. This means someone starts typing and the field suggests tags that already exist for the user to select. Without autocomplete the Web would be a much messier place and it’s a great example of the subtle impact AJAX has had on publishing.

Tagging and microformats
It’s always good to play nicely with everyone else, so read this article about Microformats and tagging. In a nutshell, just add rel="tag" as an attribute in your tag hyperlinks. This may prove useful some time in the near future.

 

blog comments powered by Disqus

Vincent Maher

  • the short bio
    Vincent Maher is the portfolio manager for social media at Vodacom, South Africa's largest mobile telecommunications company. His flagship product is The Grid, a fast-growing location-based social network and instant messaging platform. Previously he was the strategist at the Mail & Guardian Online and co-founder of Amatomu.com, the South African blog aggregator and analytics system. Before that he was Director of the New Media Lab at the Rhodes University School of Journalism & Media Studies, the managing director of Digital Commerce and a multimedia director at VWV Interactive.

    He has worked in the online media industry since 1996, has presented papers at many international conferences and specializes in profitable innovation in emerging markets.

    View Vincent Maher's profile on LinkedIn

  • Syndication
    RSS Feed RSS for this blog

    Learn more about syndication, feeds, and feedburning.

  • Archive
  • Search






  • Home
  • About

© Copyright Vincent Maher. All rights reserved.
Designed by FTL Wordpress Themes brought to you by Smashing Magazine

Back to Top