Noise Filtering and Block-Level Segmentation of Web Pages

Search engines have been applying block-style methodologies to semantic page level interpretations on both individual webpages and groups of webpages within a website for quite some time now. This shouldn't be surprising for anyone familiar with object-oriented programming.

However, many people in the search industry have largely ignored this fact and are still flawed in their assumptions that keyword-rich boilerplate navigation, side link blocks, and certain elements that change only slightly from page to page, such as on-the-fly database driven title and headline tags, still make a difference in their comparative ranking schemes.

Even, well - lamers, such as Live Search (Bing's predecessor) have been employing the use of noise filtering and page segmentation to their crawlers as early as 2008.

What this means is that they identify certain parts of individual webpages as being more important and relevant for contextual analysis that others. Site-wide blocks, such as top level navigation and footer content would likely be defined as static, unchanging, and consequently worthless in the interpretation of a webpage's meaning, and the resulting ranking methodologies applied to it.

This could be a visual example of their crawl and what specific blocks they fetch and store:

 


 
Essentially, search engines love fresh data that carries unique value. If you've ever explored Google Patents you'll know how often the word "fresh" is repeated throughout them.

Given that their primary functions are to bring order to the web and to display relevant search results, and given the fact that their storage and indexing spaces are limited, they only want to grab what they determine to be unique within the web.

Giving careful thought to body copy and, for the purposes of interlinking, contextually embedded links is going to be essential, since a search engine bot, even while deep crawling, will ultimately walk away with only this unique information contained on your webpage.

Just think of how you interpret web pages. Do you ever pay attention to the unchanging elements of a webpage when browsing a website? It's doubtful that a search engine does either. All boilerplate elements (that is, unchanging and prolific content on a webpage) of a webpage carry with them a weight.

How a search engine weighs these elements on a webpage depends on how often they are repeated. Microsoft defined them as blocks, and each block was literally a piece of content that carried with it a unique weight (such as an advertisement, a link list, or a piece of content).

Some blocks only change infrequently, while others change frequently, and others are absolutely unique to the page itself. Paying careful attention to these blocks and how you are defining them for both users and search engines is a critical and often ignored technique in ranking webpages.

Additionally, search engines use categorical hierarchy to further understand websites. Search engines are coded in object-oriented languages, and therefore at their deepest fundamental core, they simply love relational parameters, such as child and parent relationships of webpages occurring at both the link and URL path levels.

Many websites are constructed with site architecture that silo's content by category. An example of this could be a Lawyer Directory that allows users to search by both state a profession. The state category would lead to additional filtering such as cities and state level professions. The category silo would segment further dissection such as specific occupations and categories within states (this would be an example of a merge between the state level profession). It might look something like this:


Now, with a logical and well constructed hierarchy we can use this to our advantage. Search engines might interpret a state page as being the parent of a city page simply by means of hierarchy. So you don't necessarily need worry about stressing the relative importance of "Texas" on a "Dallas" page by means of on-site elements. Search engines aren't idiots. Trust me on this - they can pick it up naturally.

A merge becomes somewhat complex both for SEO's and search engines. Carefully defining a path via the URL structure might involve a throwback of the user via the path (for instance, "flipping" them back to State -> Category rather than Category -> State when state level information is selected from a category page).

In some cases could be handled with a build-out of 2 paths and an added cannonical tag for specific search path preference.

Given the fact that the most recent patent relating to block-style analysis of a webpage was published in 2008 and was employed in an outdated search engine such as MSN, we can only conclude that search engines have become even more advanced and sophisticated in their crawling tactics since then.

And given the proliferation of the web graph in the past few years, they are likely filtering out even more noise than they used to.