Skip to main content

The Guide to Understanding Robots.txt and Protecting Your Website

What is robots.txt and how does it work?

A robots.txt file is a plain text file that tells web crawlers (also called bots or spiders) which pages on your website they can and cannot access. It’s a core part of the Robots Exclusion Protocol (REP), which defines how search engine crawlers should behave when visiting a site.

The file lives in the root directory of your website (yourdomain.com/robots.txt) and provides directives such as:

  • Allow: grants bots access to a page or directory
  • Disallow: blocks bots from accessing a page or directory
  • Crawl-delay: tells bots how long to wait between requests (helpful for managing server load)
note

A robots.txt file is a directive, not an enforcement tool. Ethical bots (like Googlebot, Bingbot, etc.) will respect it, but malicious crawlers often ignore it. That’s why robots.txt is best used alongside other protections like firewalls, security plugins, and server rules.

Identifying the bots causing issues

If you suspect that certain bots are putting strain on your server or crawling where they shouldn’t, you can identify them through analytics tools.

In cPanel, the Awstats tool provides a breakdown of robot traffic:

  1. Log in to cPanel and open Awstats.
  2. Choose your domain (if using HTTPS, select the (SSL) option).
  3. Scroll to “Robots/Spiders visitors (Top 25)” to see which bots are hitting your site most often.

Once identified, you can adjust your robots.txt to slow them down or block them entirely.

Protecting your website with robots.txt

Here’s how to set up and manage your robots.txt file:

  1. Log into cPanel.
  2. Open File Manager and navigate to the public_html directory.
  3. Create a new file named robots.txt.
  4. Edit the file with the directives you want to enforce.

For example, a basic block for all bots looks like this:

User-agent: *
Disallow: /

This prevents all crawlers from accessing your entire site. Most site owners don’t want this (it removes your site from search engines), but it’s useful temporarily if you’re working on a private or staging environment.

A good 2025 default template

For most websites, you’ll want to allow major search engines to crawl your site, but slow them down slightly to avoid server overload, while blocking unknown or unnecessary bots.

Here’s a practical template:

# Crawl Delay for major spiders
User-agent: Mediapartners-Google
Crawl-delay: 10
Disallow:
User-agent: Googlebot
Crawl-delay: 10
Disallow:
User-agent: Adsbot-Google
Crawl-delay: 10
Disallow:
User-agent: Googlebot-Image
Crawl-delay: 10
Disallow:
User-agent: Googlebot-Mobile
Crawl-delay: 10
Disallow:
User-agent: MSNBot
Crawl-delay: 10
Disallow:
User-agent: bingbot
Crawl-delay: 10
Disallow:
User-agent: Slurp
Crawl-delay: 10
Disallow:
User-agent: Yahoo! Slurp
Crawl-delay: 10
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

This setup ensures Google and Bing can still index your site but at a controlled pace, while unknown bots are blocked.

Helpful tools

If you don’t want to manually write your robots.txt file, you can use free generators such as:

Key Takeaways for 2025

  • Robots.txt remains essential for search engine communication but isn’t a security tool.
  • Use Crawl-delay to manage server load without blocking good bots.
  • Block unwanted bots to reduce wasted bandwidth.
  • Always test your robots.txt file before deploying changes.

For more details, check out Google’s official robots.txt documentation.