PureNews

PureNews is an amazingly sleek and powerful news theme with unlimited color variations.

View full feature list Check out the live demo Buy this theme today

Robots.txt basics

The Robots.txt file is used to direct search engines to the content you want them to see and block the content you dont want them to see. It can also be used to block search engines you specify from crawling your site altogether.

Robots.txt Basics

To make a robots.txt file all you need to do is open notepad (or any other text editor) and save the file as robots.txt:) You can place a robots.txt file on any directory in website however the most common is the root directory. The robots.txt file uses a standard hierarchy – ie. the robots.txt file will work for all directories underneath it unless otherwise stated.

There are numerous parameters you can use but the two main ones are ‘User-agent’ and ‘Disallow’. User agent lets you control what crawling agents are allowed to crawl your site. This can be useful if you find that someone is using software to rip content from your blog for example. Disallow lets you control what directories and files are allowed to be crawled.

Here are some basic examples of what you can put in the the file to control those pesky little spiders!!!

Allow all crawlers
User-agent: *
Disallow:

Disallow all crawlers
User-agent: *
Disallow: /

Disallow google from your blog
User-agent: Googlebot
Disallow:

Disallow search engines from your wordpress admin area and cgi-bin
User-agent: *
Disallow:/wp-admin/
Disallow:/cgi-bin/

Block the evilbot from your admin area
User-agent: Evilbot
Disallow: /admin/

More Parameters

You can also use the parameter ‘Allow’. So for example, say you want to stop google spidering your images directory apart from one image.

Allow one image to be crawled but not the directory

User-Agent: Googlebot
Disallow: /images/
Allow: /images/example.gif

Another parameter that you can use with most major search crawlers is ‘Sitemap’.

Show crawlers where your sitemap is
User-agent: *
Disallow:
Sitemap: http://www.bloggingtips.com/sitemap.xml

A wordpress example

Here is the example robots.txt file which can be found on wordpress.

User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /dh_
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact/
Disallow: /wp-content/b
Disallow: /wp-content/p
Disallow: /wp-content/themes/askapache/4
Disallow: /wp-content/themes/askapache/c
Disallow: /wp-content/themes/askapache/d
Disallow: /wp-content/themes/askapache/f
Disallow: /wp-content/themes/askapache/h
Disallow: /wp-content/themes/askapache/in
Disallow: /wp-content/themes/askapache/p
Disallow: /wp-content/themes/askapache/s
Disallow: /trackback/
Disallow: /*?*
Disallow: */trackback/

User-agent: Googlebot
# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*.php*
Disallow: */trackback*
Disallow: /*?*
Disallow: /z/
Disallow: /wp-*
Allow: /wp-content/uploads/

# allow google image bot to search all images
User-agent: Googlebot-Image
Allow: /*

# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow: /*?*
Allow: /about/
Allow: /contact/
Allow: /wp-content/
Allow: /tag/
Allow: /*.php$
Allow: /*.js$

# disallow archiving site
User-agent: ia_archiver
Disallow: /

# disable duggmirror
User-agent: duggmirror
Disallow: /

Summary

Robots.txt is very easy to use and can be very useful in protecting your private and admin areas. If you are having trouble with it please let me know :)

Kevin Muldoon is a webmaster and blogger who lives in Central Scotland. His current project is WordPress Mods; a blog which focuses on WordPress Themes, Plugins, Tutorials, News and Modifications and useful resources such as 101 Places To Find Images For Your Blog Posts.

3 comments - Leave a reply
  • Posted by Stephen Welton on 14th May 2007

    I might wait to spend some time on this one when you get back.

    I have been wondering what the buzz was about this and I’m sure it will become important to me as my site begins to grow to prevent the nasty crawlers out there that can slide by your website.

    I have seen a few sites that show up in google and they don’t index pretty.

  • Posted by AskApache on 17th May 2007

    Original Article at AskApache.

  • Posted by Internet Shopping on 8th Jun 2008

    Kevin, this is the Terrific post by you, wonderful.

    I got every thing which i want to know about robots.txt at one place and that's your informative post. many thanks for it.