Web Graphing and Link Models Applied to Search

The butterfly effect postulates that the flapping of a butterfly's wings in Brazil could create a meteorological disturbance potentially setting off a tornado in Texas. The theory itself, like many other things in life, is reliant on a sensitive dependency of initial conditions in a given situation.

Applied to marketing - we can see that very slight forces, such as the creation of the MP3 file in 1989 as a standard for digital audio compression, could have potentially led to the regeneration of the Apple brand, which followed closely behind the introduction of the iPod in 2000.

The web map of the internet and authority assigned within it by the search engines can follow much of the same pattern. On one end of the spectrum we have what are popularly known by Trust Rank proponents as seed sites, those sites which have extremely high editorial value and brand recognition (examples could include NASA and Time Magazine). The linking patterns of these sites oftentimes tend to flow to other lower quality websites, which in turn link to even lower quality websites.

Web spam theory related to the Trust Rank model has stated that on average it takes 4 link hops to go from a seed site to spam. However, because of the initial seed links, these spam sites can potentially contain PageRank and/or Trust with the search engines, and could be scored similarly in query spaces.

On the other end of the spectrum, and having much more prevalence in its application - are the lower end spam sites that plague the internet. These are the WPMU/Blogger/Tumblr/Squidoo/Weebly/Etc blog parasites, the automatic comments, and web 2.0 account spam. Black hat link pyramids typically link these to low end properties to micro-sites on shared webhosting accounts, which in turn link to more trusted and developed White Hat websites.

Example of a Black Hat Link Pyramid

Because of the sheer volume of linkouts on the lower end of this model, the White Hat sites have the potential to amass absurd amounts of raw PageRank. Using this implementation a black hat SEO can usually game a search engine into scoring his top-tiered website based on the incoming PageRank the website has.

Both of these examples, and especially the second, show how huge portions of the internet have a very high sensitivity to the initial linking conditions (for our purposes, high quality seed sites and low quality spam sites). To the best of my knowledge, this flaw has yet to have been corrected by any major search engine and is a major factor in black hat linking strategy.

Even Mike Grehan's Filthy Linking Rich principle, which states that link-rich websites attract more links than link-poor websites, is highly dependent on the initial conditions of the websites in question.

One solution to this problem would be to account for the time linearity of link accrual. Google does take into consideration time lines of link appearances and disappearances(according to the famous Historical Data patent), which at least attempts to combat some of these problems.

Though I would recommend they weight this factor significantly higher in their scoring than they presently do. However, this solution implemented alone would present some problems for News websites (which are treated under entirely different algorithmic procedures to begin with) and legitimate new websites.

John Klein once proposed that if you look at how many authorities (seed sites) link to an expert website, and how many expert websites (articles) link to authorities - you could find some pretty high quality web pages. The tighter the connections between these two groups, the greater the likelihood of having an authoritative and relevant set of web pages.

I know that Mike Martinez has stated before that Ask incorporated this model into its algorithm (though I truthfully can't find any confirmation of it) but I've yet to see any evidence of Google doing the same. I would be willing to bet this method of capturing expert documents would eliminate much of the unsightliness of the web as we see it today.

In any case, If the search engines and web-spam czar Matt Cutts want to get serious about the counterforcing spam on the web they've got to come up with a better alternative to properly measuring initial conditions than they currently have. I've noticed, especially in the last couple of years, a significant deterioration of search engine results pages due to spam occurrences - which is a real shame, because I can't seem to find anything I'm really searching for anymore on a search engine.