Knowledgebase
How TO: Control Search Engines from Crawling your Website.
Posted by Shakir M. on 23 August 2019 11:10 AM

Control Search Engines from Crawling your Website.

In order for your website to be found by other people, search engine crawlers and save your website contents, also sometimes referred to as bots or spiders, will crawl your website looking for updated text and links to update their search indexes.

 

Control search engine crawlers with a robots.txt file.

Website owners can instruct search engines on how they should crawl a website, by using a robots.txt file. When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

 

How to Create/Edit Reboot.txt file?

The robots.txt file needs to be at the root of your site folder. If your domain was domain.com it should be found under your website root directory:

/home/cpaneluser/public_html/robots.txt

 

As per the screenshot:

 


 

 

 

Search engine User-agents

The most common rule you’d use in a robots.txt file is based on the User-agent of the search engine crawler.

Search engine crawlers use a User-agento classify themselves when crawling, here are some common samples:

 

Top 3 search engine User-agents:

 

Googlebot; By Google

Yahoo!; By Yahoo

Slurp bingbot; By Microsoft

 

Common search engine User-agents blocked:

AhrefsBot

Baiduspider

Ezooms

MJ12bot

YandexBot

 

Search engine crawler access via robots.txt file

There are quite a few options when it comes to controlling how your site is crawled with the robots.txt file.

The User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.

Disallow: sets the files or folders that are not allowed to be crawled.

 

HOW TO: 

 

Set a crawl delay for all search engines:

If you had 1,000 pages on your website, a search engine could potentially index your entire site in a few minutes.

However, this could cause high system resource usage with all of those pages loaded in a short time period.

 

Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

 

You can set the Crawl-delay: for all search engines at once with:

User-agent: *

Crawl-delay: 30

 


Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *

Disallow:

 


Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *

Disallow: /

 


Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider

Disallow: /

 


Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin//private/, and /tmp/ we didn’t want bots to crawl we could use this:

User-agent: *

Disallow: /cgi-bin/

Disallow: /private/

Disallow: /tmp/

 


Disallow all search engines from particular files:

If we had files like contactus.htmindex.htm, and store.htm we didn’t want bots to crawl we could use this:

User-agent: *

Disallow: /contactus.htm

Disallow: /index.htm

Disallow: /store.htm

 


Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory and disallow all other bots we could use:

User-agent: *

Disallow: /private/ 

User-agent: Googlebot

Disallow:

 

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories. 

 

 

 

 

 

 

 

 

 

 

(1 vote(s))
Helpful
Not helpful

Comments (0)
Copyright © 1998 - 2018 Shinjiru International Inc. All Rights Reserved.