There is a city. In the city, there are many buildings, each building has many houses. And the houses have many rooms. Every building in the city is connected to each other through a road network. What if you want to go to house of mike in the city? You need the way to his home in addition to his location. You probably are going to get his home location with the city authorities. The civic authorities, looking at their citizen’s database, will tell you the road where his building is situated. Once the building is located, the location of his home in the building, on which floor is it, his house number, etc.
Have you ever wondered how the city authorities know all this? The way Google indexes websites, millions of them, is actually quite similar, except rather than the houses, Google searches web addresses.
Let’s understand how the city authorities work or in this case Google.
- The Simple Crawl
- Processing Links
- Robot.txt and sitemap
- Using 404 or 410 To Remove Pages.
- Removing Pages With NoIndex
Every search engine has crawlers, crawlers that hover around the internet to look for interlinked web pages. After finding each page, crawlers feeds it back into the search engine database.
Going back to my example, crawlers are like searchers allocated by the civic authorities to visit every road in the city, look for every building, every house in that building, and the name of the house (like mike’s home). After finding each house, the searcher puts that information in the citizen’s database.
Technically speaking, it’s a crawl scheduling system that de-duplicates and scuffles pages by significance to index far along. While it’s there, it accumulates a list of all the pages each page links to. In case of internal links, the crawler maybe follow them to other pages. In case external links, they get placed into a database for later.
Things were going fine for the city authorities, but as the city grows the number of person with the name mike were more than one even 100. For a search “mike” in the citizen’s database was returning more than hundred results. To cope up with the situation, the authorities decided to prioritise the placement of results for Mike. Likewise, Mike who works for city administration was put on the first place while the one, a burglar was put last on list.
In similar terms, as soon as the link graph gets administered, the search engine twitches all the links from the database and attaches them, rating them as per their authority. The rating, let’s say out of 10, can be 10 for an authoritative link say Forbes.com, 5 for a mediocre link, and 1 for a spam link.
Soon deciding a person’s authority became a headache, citizens started raising finger. This is when the authorities decided: if a highly authoritative person recognises a normal citizen, he will rank higher whenever his first name is put in the search and vice a versa.
Similarly, if a 7 rating website links to a 5 rating site, the former passes juice to latter. The latter, as a result will benefit in terms of search rank. The inverse is also true.
If out of 10 links pointing to a website, 4 are in the range of 7-10 and rest are in 1-3 rating, then the website will rank better than a website having 5 links having an 8 rating and the rest are spam.
The city’s citizens are responsible. Suppose a building in the city is under renovation. Now, there is no point for a searcher to look for them. Thus, the residents of that building decided to stop the access to the building until renovation. The searcher can still see the building but can’t enter or say for the name of the houses inside it if updated.
Robot.txt has similar function. It tells the search engine not to search it. Although the search engine can still check links to that page and count them, It won’t be able to see what pages that page links to, but it will be able to add link value metrics for the page — which affects the domain as a whole. Sitemap is more like a list pasted beside a building’s gate in the city, listing the name of the residents.
The renovation went horribly wrong and few of the houses were abandoned. Those residents, being responsible citizens decided to put a board “abandoned” near the gate of their houses. So that nobody, especially the civic searcher, will waste his or her time looking inside.
With 404, a website gives a clear message that it’s not there anymore. There is only one conclusive method to halt the movement of link value at the end point–deleting the page. 410 is more absolute than 404, and both will cause the page to be thrown down out of the index ultimately.
Noindex is more definitive than robots.txt, but less than 404. Raising NoIndex prevents crawling of that page, the search engine is can access it, but at that time is stated to go away. The search engine can still give ratings to links pointing to that page.
The way the search engine works is too similar to human society. Everything is interconnected to form a network.
Original Published at: Click Here