Duplicate and Near Duplicate Content

All Search Engine Optimization schould include a programme for the inspection and altering of website copy. Think about it a Search Engine lists sites based purely upon relevancy. This means your site should give the search engine what it wants in the way it wants it. One of the main ways this will be done is through analysing the content upon a site to discover what that page should be relevant for.

Google™ and Content Storage and Analysis

All search engines would find it difficult, if not impossible to store a complete representation of every page of content that they knew of. As a solution to this they use a reference system for the content instead. In Google this is through the Hashing of content. In simple terms this is a reference system. Where a reference can be used many times to represent a specific value. This helps them to save on memory and makes the anlysing of data more efficent

If you think of the Google index as a data set and this contains a predetermined set of hash values. Each hash value then contains specific references to words and sections of web content. This means that has values will reference each other.

Example : (this is not a true representation of the hash values used and are merely used as exemplars).

  • Sentence : 'Duplicate Content is stolen content.' would be translated into hash values.
  • Sentence : '#54a #Fr674a ... #3a5Bee #ju2hgA' ('...' represents a stop word that is unlikely to be stored) this could then be translated into a hash value that represents the composite values.
  • Sentence : '#bc564357ba' is the hash representation of the sentence

Nb: It has been thought that Google stores information in hash values based upon base 62: e.g. hash values from the set (a-v,A-Z,0-9) = 26+26+10=62 (this is an idea rather than something they have disclosed to our knowledge)

This means that google will tear your web documents apart and reduce your content to a hash value. All words can be hashed and turned into a larger more representative hash for sentences. These can then be converted into a hash value representation for a paragraph and this can then be turned into a hash representation for the page. By now your page of content has been reduced to a single value. It is our thought that this hash value may be analagous to the document ID's used by Google.

As Google now has an simply referenced piece of information if when they analyse your page and have reduced it to a value that is exactly the same as a pre-existing piece of content, whether that be an entire page or a smaller docuemtn such as a paragraph tag, if that value already exists it indicates that you have placed duplicate content upon your site. The same is true if a competitor steals your content they should not gain any benefit from it.

Google and Near Duplicate Content.

By referencing content in sections and pages against a semantically defined copy of the content it would be possible to compare content to see if it is near duplicate content. That is content that is not exactly the same but is similar enough to be considered as such.

Every word will have semanticism, that is words will exist that are closely related to another word. This information is stored in what is known as 'terminal nodes' of the Google Latent Semantic Index. Each end node represent a words and contains the related words and the strength of the relevancy between these words.

Example :

  • Word: 'Search'
  • Hash representation of the word 'Search' = '#66554s'
  • Hash representation for the semantic terminal node for word 'Search' = '#67gfvh7g'
  • '#67gfvh7g' = a set of words including 'search, searching, listings, database, finder'

So by storing a semantic representation of the text it is possible to compare the similarity of text in sentences, paragraphs and pages more easily than was possible before the advent of the semantic index.

Why does Google care about duplicate and near duplicate content?

Search Engines return relevant results to every individual search engine query, well they try their best to. Content is a major factor when assesing the relevancy of a web document. If two documents have the same or highly similar content they will have the same content based relevancy score. This will screw up the results as all the high ranking web documents could have the same content and would offer the search engine users little of use after the first result.

What will Google do to a site for having duplicate or near duplicate content?

Google will definitely apply a penalty to a web document that posts content that is duplicate or near duplicate to that on another web document. This content based penalty is likely to be a point score penalty.

Conclusion

Stay away from duplicate content and near duplicate content. By all means get an SEO copywriter to edit, refine or create your content for your site but remember even the best SEO copywriter may not have expertise in your business are. Often it is best if you supply them with a draft of the page and ask them to edit it. The best solution is to write your own content as you will know it is not duplicate and is completely unique and original to ensure that you do not encounter penalties assosciated with duplicate and near duplicate content.


List of Articles on Ethical Search Engine Optimization


: Hotel Industry Booking Study :: The Horror of Site Submit Pro :: What do you need from Your Site? :
: What is Page Rank? :: Page Rank is Dead - Myth or Reality :: The Replacement for Page Rank? :
: Latent Semantic Indexing :: Using Latent Semantic Indexing :: Robots.txt :
: Writing a robots.txt file :: Server Company Link Request :: Duplicate and Near Duplicate Content :
: Web Site Spiderability :: Big Daddy - the new face of Google :: Page Hijacking and 302 redirects :
: To Submit to Search Engines or not to Submit to Search Engines That is the Question? :: Know Your Customer to Know your User :: Black Hat SEO - Dont Do it! :
: April Fools in Search Engine Land :: Search Engines and Menus :: High Rankings - How do Search Engines fit into Your Business? :
: Google - Da Vinci Code the Game :: Removing the ODP description from your MSN listing :: Viewing the Google index from different Geographic Positions :
: Underused HTML Tags :: Company Law Amendment :

Creative Commons License
This work is licensed under a Creative Commons Attribution-No Derivative Works 2.5 License.