Friday, January 12, 2007

Duplicate website content and how to fix it

Here I'll present a practical way on how to avoid the duplicate content penalty.

When is this penalty applied?
This kind of penalty is applied by search engines such as Google when there is an indication of two exactly the same versions of your site's content.

How can your website become a victim of such a penalty?
The modern content management systems(CMS) and community forums offer numerous possibilities of managing new content, but because of their deep structure, their URLs are very long. So search engines are unable to fully spider the site.
The solution to webmasters was to rewrite the old URL so index.php?mode=look_article&article_id=12 URL now becomes just article-12.html. As a first step, it serves its purpose, but if left like this the two URLs are going to be indexed. If we look through the eyes of a search engine we'll see same content having 2 instances and of course, the duplicate filter is raised:
I-st instance: index.php?mode=look_article&article_id=12

II-nd instance: article-12.html
Easy solution
The solution is done via the PHP language and using .htaccess Apache file.
First off we'll rewrite our URLs so they can be search-friendly. Let's assume that we've to redirect our index.php?mode=look_article&article_id=... to article-....html

Create an empty .htaccess file and place this. First, edit the code and fill in your website address. If you don't have subdomain then erase the subdomain variable also.
RewriteEngine on

RewriteRule article-([0-9]+)\.html    http://www.yourwebsite/subdomain/index.php?mode=look_article&article_id=$1&rw=on

RewriteCond %{the_request} ^[A-Z]{3,9}\ /subdomain/index\.php\ HTTP/
RewriteRule index\.php http://www.yourwebsite/subdomain/ [R=301,L]

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} ^www\.yourwebsite\.subdomain [nc]
RewriteRule ^(.*)$ http://yourwebsite/subdomain/$1 [R=301,L]

Explanation:
  • RewriteRule article-([0-9]+)\.html http://www.yourwebsite/subdomain/index.php?mode=look_article&article_id=$1&rw=on
    Those lines allow article-12.html to be loaded internaly as index.php?mode=look_article&article_id=12
    The variable &rw=on is important for the later PHP code. So don't forget to include it.
  • RewriteCond %{the_request} ^[A-Z]{3,9}\ /subdomain/index\.php\ HTTP/
    RewriteRule index\.php http://www.yourwebsite/subdomain/ [R=301,L]
    These lines avoid considering index.php as a separate page thus lowering your website PR and will transfer all the PR from index.php to your domain.
  • RewriteCond %{HTTP_HOST} .
    RewriteCond %{HTTP_HOST} ^www\.yourwebsite\.subdomain [nc]
    RewriteRule ^(.*)$ http://yourwebsite/subdomain/$1 [R=301,L]
    This will avoid duplicate URLs such as www and non-www and transfer all the requests and PR to the non-www site.

Then create file header.php and include in your website before all other files:

Put there:

$rw=$_GET['rw'];
if ($rw=="on") { echo "<meta content=\"index,follow\" name=\"robots\" />"; }

else { echo "<meta content=\"noindex,nofollow\" name=\"robots\" />"; }

This will point the search engine to index only the pages that will have rw flag set to on. These pages will be the previous set like article-12.html pages.

Of course, if you have access to your robots.txt file and to your root domain then you can just put the file: look_article there and you are done:
User-agenta: *

Disallow: /look_article.php



Notes: For those using CMS - check out whether your pages are still accessible using different parameters in the URL
Example: you've deleted an article with id=17 but the empty template would be still accessible producing header status 200 OK code - this will be surely recognized as a thin content from Google.
Solution:
1. Find out those empty pages and give them header status 404 not found code:

header("Status: 404 Not Found");


2. Create error404.html file explaining that the user is trying to access a non-existent page.

3. Then add in your .htaccess file the custom 404 error page:
ErrorDocument 404 /your_domain_name/error404.html

This way the search engine spider won't penalize your template displaying empty information - it will now see those pages like a 404 not-found document.

The next step involves cleaning up an already indexed but duplicated website content in order to regain the search engine's trust.

Above is a sample screenshot is taken from Google Search Console Keywords report. You may ask why should we need it when we can use Firefox integrated Show keyword density function?
Well, one benefit is that this function shows specific keyword significance across your pages. Let me explain what do this means:
Suppose that you are optimizing content for the keyword 'cars'. It's a normal practice to repeat 'cars' 2-3 times, style it in bold, etc... Everything's good as long as you do it naturally. The moment you overstuff your page with this keyword it will get penalized and lose its current ranking in Google SERPS. So you have to be careful with such repetitions.Moreover in the report, you can see the overall website keyword significance. And because Google likes thematic websites it is really important for these keywords to reflect your website purpose or theme. Otherwise, you're just targeting the wrong visitors and shouldn't be puzzled by the high abandonment rate.

But, enough with the theory now let's discuss how you can fix some things up:

Check every individual webpage keyword density online via Webconfs and correct(reduce) words, that are being used over 2%. Again this % depends on your local keyword concurrency. So tolerable levels can vary up and down.

Add the 'canonical' tag to all your website pages:
< link rel="canonical" href="http://www.example.com/your_preferred_webpage_url.html" />
(and make sure to specify the URL that you really prefer!). This will reveal to the search engine what your legitimate webpage is. More info: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html



Blogger users can achieve adding canonical with the following code at the head section in the Template:
<b:if cond='data:blog.pageType == "item"'>
<link expr:href='data:blog.url' rel='canonical'/>
</b:if>
(it will remove the parameters appended at the end of the URL such as http://nevyan.blogspot.com/2016/12/test.html?showComment=1242753180000
and specify the original authority page: http://nevyan.blogspot.com/2016/12/test.html )

Next to prevent duplicate references of your archive( i.e .../2009_01_01_archive.html) and label pages( i.e. /search/label/...) from getting indexed just add:
<b:if cond='data:blog.pageType == "archive"'>
<meta content='noindex,follow' name='robots'/>
</b:if>
<b:if cond='data:blog.pageType == "index"'>
<b:if cond='data:blog.url != data:blog.homepageUrl'>
<meta content='noindex,follow' name='robots'/>
</b:if>
</b:if>

To prevent indexing of mobile (duplicates) of the original pages:
    <b:if cond="data:blog.isMobile">
<meta content='noindex,nofollow' name='robots'/>
</b:if>

And working solution blocking even the /search/tags from indexing, allowing only homepage and posts to be indexed:
    <b:if cond="data:blog.pageType == "index" and data:blog.url != data:blog.homepageUrl">
<meta content='noindex,follow' name='robots'/>
</b:if>

Subscribe To My Channel for updates