Robots.txt file is an important term which we should learn to understand its importance in crawling.
What is the Robots.txt file?
Robots.txt stands for Robots Exclusion Protocol (REP). By inserting this file on your website, you can disallow crawlers for crawling to a specific webpage of your website.
When the crawler visits your website, initially it checks for this file. If it is not present, then it will crawl all the crawlable information present on your website.
Syntax for Robots.txt files:
User-agent: user agent name (unnecessary)
Disallow: URL string not to be crawled
Meaning of technical terms associated in robots.txt syntax:
User-agent: Here you specify the name of user-agent (Bot name example: Googlebot, Bingbot etc.)
Disallow: Here you specify the URL or file you do not want to crawl
Allow (only for Googlebot): Allow the crawler to crawl the subfolder despite the main folder being disallowed
Crawl-delay: The time Googlebot should wait to crawl the page (Googlebot ignores this command)
Sitemap: To tell the crawler the location of sitemap present for that URL
Now let’s take some examples to better understand the robots.txt file:
Case1: We allow the crawlers to crawl the entire website (Never do this)
Case2: We disallow the crawlers, not to crawler services section of our website www.example.com/services
Case3: Apply robots.txt file to the entire website and block crawler to crawl it
Case4: Disallow a specific crawler not to crawl your website
Where to insert the robots.txt file:
After making this file (done on a notepad by saving the file in .txt format) you must upload the same on the root directory (Homepage) of your website. The crawlers only crawl the root directory for the robots.txt file if they don’t find it there they assume that this website does not have this file.