The Duplicate Content Problem

By | July 6, 2008

Duplicate content is one of the banes of the Internet. And if you publish a blog or website, and your stuff is generally good, there’s a high likelihood, someone, somewhere, will copy the stuff you write and paste that up on their sites. First of all, let’s differentiate between duplicate content and syndicated content. Both are similar, and yet different.

Simply put, duplicate content is similar content, and syndicated content is similar content that has been permitted to be reproduced, most often, via an RSS feed, and may include certain rules to be followed. Make sure you know the difference. The originator of the content needs to give explicit permission to have his content reproduced. By default, you should regard everything as copyright. It is only permissible to reproduce them under a few circumstances, like if the material was:

  • Public domain material
  • Government material
  • GNU licensed
  • Creative Commons licensed (attribution to the original author required)
  • Explicitly syndicated content (permission given by the originator)
  • In a few cases, applicable under Fair Use

According to Google, duplicate content is both external and internal. Meaning, if Google detects the same content in other pages on your site, then that is also duplicate content. As is the same content found on other sites. Last month, Google tried to clear the air again about duplicate content with this post. But it still left many people still skeptical about whether Google is doing anything effective or not in battling copycats. Duplicate content only clogs up the Web with useless junk, and is regarded as spam. But, neither is it black and white. In the case of news reports and music lyrics, you can’t regard them as duplicate content, can you? It would be silly to apply the same rules on all kinds of content.

For WordPress blogs, duplicate content has always been an issue. That’s because, WordPress by default lists the same content in the archives and category pages. But the problem is easily fixed by telling the Google bot not to index those pages. If you haven’t got the plugin yet, go get the Duplicate Content Cure plugin, which should solve the problem (thanks to Badi Jones).

Say no to stealing contentThe biggest problem though, is the blatant copying of your pages by some parties to repost on their websites. They do this either manually, or by employing scraper bots. You can get a list of bots to exclude in your robots.txt file, but you can’t do much if the copy cat manually copies your content. This is an external problem with no prevention, with maybe only half baked cures. It is really up to the search engines to constantly improve themselves to battle duplicate content.

I’ve had my stuff copied by others, and it’s not flattering at all, especially when the copycats never leave any attribution behind, let alone ask you for permission. Illegal copying of content is very common, and constantly on the rise all over the Web. SEOmoz has an article on duplicate content, worth a read.

Copycats only devalue the Web by their actions, because the originators of the content may not be so inclined anymore, to put up their hard work only to have them copied. And the problem is worsened by the fact that many copycats employ the latest SEO techniques to try to get their content ranked well in the search engines, often to the detriment of the original sites (which are often small or medium sites with good content, but poor SEO).

When quality content disappears and the Web becomes one giant mass of low quality content, with identical twins floating around …..let’s hope it doesn’t end up that way. This is still one of the major challenges of the Web. The statement from Google that it is “quite good at identifying the originators of content” is far from reassuring, as there are indeed many honest webmasters who can attest to the constant problems they face from slick copycats – right up to this moment.

Finally, if all else fails – change your content.

Spread the love