Dealing with legacy content

The biggest problem faced by large organisations with numerous content providers is legacy content. How do you deal with ageing content on a website with little in the way of central control?

At Headscape we work with a lot of organisations who have content heavy websites updated by large numbers of content providers. In many cases these sites have little in the way of central editorial control and so quickly become bloated with huge amounts of legacy content.

This amount of content creates some serious problems.

The problem with legacy content

With a lot of different people adding content to the website but few considering whether old content needs to be removed, it is not unusual for some site to have hundreds of thousands of pages. This creates two distinct problems; the ‘needle in a haystack’ scenario and out of date content.

A needle in a haystack

With some much content on the website it becomes increasingly hard for users to find the content they are looking for. Navigation becomes verbose and difficult to navigate. Search results return so many results that the chances of a user finding something they want is significantly reduced. In short users are left trying to find the proverbial needle in a haystack.

Needle in a haystack

Image provided courtesy of Shutterstock (Timothy Boomer)

Out of date content

With so much content on the website it is hard to ensure that everything is up to date. Old news stories and event listings long since past are only the tip of the iceberg. There is also content that is no longer accurate or now presents the organisation in the wrong light.

Although in theory each content provider should be responsible for ensuring that their own content is up-to-date this simply doesn’t work in practice. People leave, are too busy or simply forget to check the relevancy of content regularly.

In an ideal world there would be a team of central editors checking the pages on a regular basis to ensure the content is still relevant. However there are rarely the resources to do so. Even when their is a central editorial team they are normally too busy checking new content to worry about stuff already online.

Website showing out of date event

The other problem central editorial teams face is that when they suggest removing content they encounter political objections. Many content providers are defensive about their content even if they do not maintain it properly. They don’t like the idea of others telling them what they can and cannot have online. In other words they don’t like somebody telling them what to do.

The solution proposed by many content strategists would be a complete audit of the site. However, this involves checking every single page and that is just not practical in most cases. It also doesn’t solve the problem of politics. What is required is a solution that is automated.

An automated solution

An automated solution is good for two reasons. First, it doesn’t require anybody manually checking all of the pages. Second, it doesn’t require one person telling another that their content is going to be taken down. The whole thing just happens. People are much more likely to agree to an automated policy for content control than they are to being singled out as somebody who hasn’t maintained their content properly.

So how would this automated approach work in practice?

Automated review points

Essentially a review of a particular webpage would occur when certain criteria are met. This review could happen automatically or manually depending on your preference. However, in either case it requires your content management system being able to identify pages that have reached a certain age (or a certain time since they were last reviewed). In most cases this is something that already exists in a CMS or could easily be added.

An alternative to time based review points would be traffic based. This is designed to remove content that is not really used by users rather than out of date content. This review point would be triggered if the traffic to a page falls below a certain threshold over a given period. This would indicate that the page is of little interest and is simply making it more difficult for the majority of people to find what they are after.

Image of the word policy being highlighted with pen

Image provided courtesy of Shutterstock (Aaron Amat)

This is a lesson Microsoft had to learn with its support pages. They had support pages for every conceivable issue. However, instead of helping users most of this content just cluttered up the site and made it harder for users to find what they really wanted. In the end they removed less frequented pages and their customer satisfaction shot up.

How often you choose to review pages or how low the traffic trigger is, is entirely up to you. This will depend on how often your site/organisation changes and how much you want to ask of your content providers.

When a page is identified for review an email is sent out to the owner of this page (either manually or automatically) asking them to check the page. Ideally this should simply involve the content provider logging into the CMS and editing the page in question. A simple check box saying that the page is up-to-date is all that is required. If that is not possible a reply by email saying that the page is up-to-date would be just as good.

Sample email

If the content provider fails to identify the page as up-to-date within a set time period, this triggers a cleanup event (see below). Notice the default here. At the moment the majority of websites defaults are organised so that if the content provider does nothing the content remains online. This approach turns that on its head. No action leads to content being marked for cleanup.

What happens when a cleanup is triggered?

How you choose to handle the cleanup of webpages is up to you. However, here is my recommended process:

Mark the page as being old content

The first step would be to mark the content as old and potentially out of date. This can be done by automatically inserting a banner at the head of the main content telling the user that this content is potentially out of date. Below is an example of how this might look.

Example notification banner

You might wish to also send an email update to the content owner of that page saying that the page has been marked as out of date.

Remove the page from the site’s navigation

If the content provider still hasn’t checked the page after a set period you might then choose to trigger a further event that removes the page from the navigational structure of the site. This will reduce the clutter that users need to navigate through to find the page they want. However, for those who still really want to access these pages they are still findable via search.

Remove the page from the search results

Of course there is also the option to prevent pages being returned in search results too. It can be hard to find the right page when searching a large site simply because of the amount of content being returned. If a piece of content is out of date then it makes sense not to return it in the search results.

The Dell search results page showing 22000 results

This effectively orphans the page but keeps it online. You may wonder what the point of this is. Surely you would be better deleting the page entirely?

Delete the page altogether

There are mixed opinions about deleting content entirely. On the surface it seems like the most logical thing to do. If content is horribly out of date or is rarely visited what is the point of it being online?

As I see it there is no harm in keeping it online if it is clearly labelled as out of date and it no longer prevents users from finding content they really want. However, removing it can be damaging.

For a start there maybe third party links to that page let alone hard coded links within your own website. The last thing you want to present a user with is a ‘page not found’ error.

The only time I would recommend removing a page entirely is when the user can be automatically redirected to an alternative page that serves their needs better.


I am not suggesting that this approach is perfect. There is nothing stopping a content provider just checking the ‘this page is up-to-date’ box without properly reviewing the content. However, it does put the onus on the content provider to take action. This should automatically remove huge amounts of content from the site without battling with each content provider individually.

  • I was reading through the article waiting to see if you covered the potential 404 issue and got to it!

    You could also suggest presenting a 404 (with subsequent meta refresh) on removed pages to direct the user up a tier, perhaps back to a category level to find something else which might appeal.

    From an SEO point of view, removing content would not be considered great, but the user must come first. I have experienced Microsoft’s legacy content before now!

  • Absolutely. Don’t delete, archive! This is something Gmail should have taught everyone. Deleting pages breaks links and reduces the historical worth of your material.

    Take the example of the dinner page: yes, people won’t book for a dinner years later, but that page still has value: it provides a source for people inside and outside the organisation to find out what went on before. The person in the company running next year’s dinner needs to find out what time it finished? They check the website.

    And if someone arrives at that page from a link or from Google, a friendly banner at the top saying something like “This page is about an event that happened in the past: this year’s dinner will be on such-and-such a day – see this page for more details” will be more useful than a generic 404.

    I put up an example recently on my blog of the value of “legacy” content: a page from a Channel 4 microsite about a fairly worthless TV show. Because they short-sightedly decided to delete the page (for what? To save a few kilobytes?) we lose a very small data point from the historical record. See

    Take a look at the British Museum site: once a special exhibition is over, they no longer provide a way of finding that content. This beggars belief: if people are willing to pay money to walk round an exhibition of ancient Egyptian mummies one week, why would they not want to be able to read about them on the website after the exhibition has finished? The website is being treated as ephemera, when in fact it is probably the most useful public facing thing the museum has – more people are able to visit the website than could ever walk around the museum.

  • Thanks for all the great Informations and good explaining…

    Many regards from Germany,

    • Great article Paul, we are currently struggling with similar issues and your points are well taken.
      The idea of keeping info public in an archived state rather than taking it down is good in theory, but in some cases the risk outweighs the benefit. Example: we produce advice as factsheets for farmers. I dont know that a 10 year old factsheet giving advice on the application of a chemical that has now been deemed harmful or deregistered is good archived content. The real risk to someones livelihood of presenting out of date info, even with a disclaimer, is too high.
      PDFs are also a bit tricky. I love the idea of adding a banner to old content (automatically) but if you have old PDF documents, which many large organisations do, you cant add those automatically. Luckily PDFs already have an intrinsic “point in time” feel to them so (I believe) people are a bit more likely to check the date on the PDF before using.

  • Ryan Griffith

    What a good read Paul. Now you have me thinking how to work some web services magic in Cascade Server to run some automated auditing on our content. :)

    I like how this coincides with one of your older posts about dealing with dated content. I liked the concept of removing from navigation and un-publishing to see if anyone even notices. Like you mentioned though, the problem is not really the removal, it’s who should be the one to determine what what to do with the content. I’ve been dealing with content that is in that gray area of keep or remove, it’s a tough one.

  • me

    The picture of the orange highlighter after “Automated Review Points” says SEXCHANGE above the highlighter.


  • m

    „There is nothing stopping a content provider just checking the ‘this page is up-to-date’ box without properly reviewing the content.“

    that is most likely what is going to happen since that is the standard reaction to automated mails and notifications. 980% chance that these are coming when you have something else stuff to do.

    also, long comments go unread that’s why i stop here :)

    • The most likely reaction in my experience is to ignore it. If they do that the content will be unpublished. You could always put measures in place to prevent this of course. Even something as simple as the wording of the checkbox might help e.g. “I (insert name) confirm that all of the content contained on this page is both up-to-date and reflects [insert organisation] in a good light” This kind of legal phrasing will make people take it more seriously.

  • m

    i mean 90%. please correct. thanks.

  • m

    “The most likely reaction in my experience is to ignore it. If they do that the content will be unpublished.”

    o.k. must have overlooked this.

    “You could always put measures in place to prevent this of course. Even something as simple as the wording of the checkbox might help e.g. “I (insert name) confirm that all of the content contained on this page is both up-to-date and reflects [insert organisation] in a good light” This kind of legal phrasing will make people take it more seriously.”

    that is true.

    however, the owner needs to really take care of his pages and all of their content which in my experience is – unfortunately – often enough not the case. true for most ‘small to middle sized’ clients. here’s hope that this will change :)

  • As much as how social media has evolved, do you think that people still read legacy post or contents?

    I somehow doubt that only latest and newest content should ever be posted in FB or tweeted. Legacy content still once a useful content and nonetheless might still be applicable.

  • Tom,
    Keeping content up so people in the organisation can use it for reference just seems to put their needs ahead of those of the site’s target audience.

    There’s any number of ways the information you mention could be retrieved without having pages live on the web – through the CMS, through shared documents, through appointments in a calendar. Why should site users have to wade through out-of-date content even if it is clearly labelled. It’s still a chore to open pages from a search, discover they refer to years ago, close and go hunting again.

    More broadly, in my experience, however you structure it, lazy and uncommitted content ‘owners’ will find a way to blame someone else if they don’t keep their content up to date and it is retired as a result.

    You can produce the policy, show them the email that went to them to remind them, demonstrate the page hasn’t been updated for however long and the worst offenders will reply: I did reply to that email/ I never received it/ I edited the page with acres of fabulous new content only yesterday, for sure, and you must have lost it, what have you done, the page was great and you’ve ruined it by somehow switching it back to old content etc etc.

    Jaded, I know…

  • I am new to your site and just discovered this article about content.  My question is, do you see a need to dig up old legacy files in DTP such as Quark, InDesign and Microsoft Publisher?  If so, how would you go about doing something like this? 

    • Anonymous

      I’m not entirely sure what you mean Pactrick. My article is referring to content already online, in which case there would be no need to go back to the source.