robots.txt

What is robots.txt?

robots.txt is a plain text file that sits at the root of your website (yoursite.com/robots.txt). It gives instructions to search engine crawlers about which pages they're allowed to visit and which they should skip.

It's the first file crawlers check when they arrive at your site. Before Google looks at a single page, it reads your robots.txt to understand the ground rules.

Why it matters for your rankings

robots.txt is a blunt instrument - but a critical one.

Protecting private areas. Your admin panel, API endpoints, staging pages, and internal tools shouldn't be in Google's index. robots.txt keeps crawlers away from these areas.

Saving crawl budget. Google allocates limited time to crawl each site. If crawlers waste time on utility pages (search results, filtered views, print versions), your important content gets crawled less frequently.

Preventing accidental deindexing. This is the scary one. A single line - Disallow: / - tells all crawlers to stay away from your entire site. It happens more often than you'd think, especially during site migrations or when a staging robots.txt accidentally goes to production.

What happens when robots.txt goes wrong:

Block everything: your entire site disappears from Google within days
Block CSS/JS: Google can't render your pages and may rank them lower
Block important sections: product pages or blog posts vanish from search
Missing robots.txt: not a disaster, but crawlers don't know about your sitemap

How it actually works

A basic robots.txt file:

User-Agent: *
Allow: /
Disallow: /api/
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Breaking this down:

User-Agent: * - these rules apply to all crawlers
Allow: / - crawl everything by default
Disallow: /api/ - except the API routes
Sitemap: - here's where to find the sitemap

You can also target specific crawlers:

User-Agent: GPTBot
Disallow: /

User-Agent: Googlebot
Allow: /

This blocks OpenAI's crawler but allows Google. Some sites use this to prevent AI training on their content while staying in search results.

Important nuances:

robots.txt is a request, not an enforcement. Well-behaved crawlers respect it; malicious ones ignore it.
Disallow prevents crawling, not indexing. If other sites link to a blocked page, Google may still index the URL (just without content). Use noindex meta tags for true deindexing.
robots.txt is public. Anyone can read it at yoursite.com/robots.txt. Don't put sensitive URL patterns there.

Common mistakes:

Leaving the staging Disallow: / after launching
Blocking CSS and JavaScript files (breaks Google's rendering)
Not including a Sitemap reference
Using robots.txt to hide pages instead of noindex (they still appear in search as URL-only results)

How Webentity handles this

Webentity generates robots.txt at build time with sensible defaults: allow everything, block API routes, and link to the sitemap. The file is type-safe (generated from code, not a static text file), so you can't accidentally break it with a typo.

During development, staging deployments use a different robots.txt that blocks all crawlers - preventing Google from indexing your work-in-progress. When you deploy to production, the production rules take over automatically.