Glossary

What Is Robots.txt?

Robots.txt is a plain text file placed at the root of a website (e.g. https://example.com/robots.txt) that provides instructions to web crawlers about which pages they can and cannot access. It follows the Robots Exclusion Protocol and is the first thing most crawlers check before crawling a site. While not a security measure (it's publicly readable), robots.txt is a crawl budget management tool — directing search engine resources toward your most important pages.

Robots.txt syntax

A robots.txt file consists of one or more 'user-agent blocks'. User-agent specifies which crawler the rule applies to (use * for all). Disallow blocks specific paths. Allow permits a path that would otherwise be blocked by a broader Disallow. Sitemap: [url] at the end of the file references your XML sitemap — a best practice for discoverability. Example: `User-agent: * / Disallow: /admin / Disallow: /api / Sitemap: https://example.com/sitemap.xml`

What to disallow

Common paths to disallow: /admin, /wp-admin (WordPress admin), /api routes not meant for public consumption, /staging, thank-you pages, duplicate filtered/sorted views, and search results pages with no unique content. Do NOT disallow pages you want indexed — a common mistake is blocking CSS or JS files Google needs to render your pages.

AI crawlers and robots.txt

Since 2023, AI companies have released their own crawlers: GPTBot (OpenAI), Claude-Bot (Anthropic), Google-Extended (Google AI training), and others. By default, these are allowed unless explicitly blocked. Allowing AI crawlers can lead to citations in AI assistants — which is valuable organic exposure. Block them only if you have legal or content reasons to prevent AI training on your content.

Robots.txt & Canvas Builder

canvasbuilder.co's robots.txt explicitly allows all major AI crawlers including GPTBot, ClaudeBot, and PerplexityBot — maximising citation potential in AI assistants and search tools.

Try Canvas Builder →

Frequently Asked Questions

Does robots.txt block pages from being indexed?

Not directly — robots.txt prevents crawling, not indexing. A disallowed page can still be indexed if Google finds links to it. To prevent indexing, use a noindex meta tag on the page itself.

Does canvasbuilder.co allow AI crawlers?

Yes — canvasbuilder.co's robots.txt explicitly allows GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and other AI crawlers. This maximises visibility in AI search and citation tools.