Kevin MuldoonRobots.txt basics

The Robots.txt file is used to direct search engines to the content you want them to see and block the content you dont want them to see. It can also be used to block search engines you specify from crawling your site altogether.

Robots.txt Basics

To make a robots.txt file all you need to do is open notepad (or any other text editor) and save the file as robots.txt:) You can place a robots.txt file on any directory in website however the most common is the root directory. The robots.txt file uses a standard hierarchy – ie. the robots.txt file will work for all directories underneath it unless otherwise stated.

There are numerous parameters you can use but the two main ones are ‘User-agent’ and ‘Disallow’. User agent lets you control what crawling agents are allowed to crawl your site. This can be useful if you find that someone is using software to rip content from your blog for example. Disallow lets you control what directories and files are allowed to be crawled.

Here are some basic examples of what you can put in the the file to control those pesky little spiders!!!

Allow all crawlers
User-agent: *
Disallow:

Disallow all crawlers
User-agent: *
Disallow: /

Disallow google from your blog
User-agent: Googlebot
Disallow:

Disallow search engines from your wordpress admin area and cgi-bin
User-agent: *
Disallow:/wp-admin/
Disallow:/cgi-bin/

Block the evilbot from your admin area
User-agent: Evilbot
Disallow: /admin/

More Parameters

You can also use the parameter ‘Allow’. So for example, say you want to stop google spidering your images directory apart from one image.

Allow one image to be crawled but not the directory

User-Agent: Googlebot
Disallow: /images/
Allow: /images/example.gif

Another parameter that you can use with most major search crawlers is ‘Sitemap’.

Show crawlers where your sitemap is
User-agent: *
Disallow:
Sitemap: http://www.bloggingtips.com/sitemap.xml

A wordpress example

Here is the example robots.txt file which can be found on wordpress.

User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /dh_
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact/
Disallow: /wp-content/b
Disallow: /wp-content/p
Disallow: /wp-content/themes/askapache/4
Disallow: /wp-content/themes/askapache/c
Disallow: /wp-content/themes/askapache/d
Disallow: /wp-content/themes/askapache/f
Disallow: /wp-content/themes/askapache/h
Disallow: /wp-content/themes/askapache/in
Disallow: /wp-content/themes/askapache/p
Disallow: /wp-content/themes/askapache/s
Disallow: /trackback/
Disallow: /*?*
Disallow: */trackback/

User-agent: Googlebot
# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*.php*
Disallow: */trackback*
Disallow: /*?*
Disallow: /z/
Disallow: /wp-*
Allow: /wp-content/uploads/

# allow google image bot to search all images
User-agent: Googlebot-Image
Allow: /*

# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow: /*?*
Allow: /about/
Allow: /contact/
Allow: /wp-content/
Allow: /tag/
Allow: /*.php$
Allow: /*.js$

# disallow archiving site
User-agent: ia_archiver
Disallow: /

# disable duggmirror
User-agent: duggmirror
Disallow: /

Summary

Robots.txt is very easy to use and can be very useful in protecting your private and admin areas. If you are having trouble with it please let me know :)

Follow this blogger on Twitter!

Kevin Muldoon Written by Kevin Muldoon from Blog Themes Club
Posted on May 14th, 2007 and filed under Search Engine Optimisation
Do not forget to subscribe to our RSS feed for updates
  • Digg This Post
  • Tweet This Post
  • Stumble This Post
  • Submit This Post To Delicious
  • Submit This Post To Reddit
  • Submit This Post To Mixx
  • BloggingTips Uses Aweber

3 Responses to “Robots.txt basics”

Author comments are in a darker gray color for you to easily identify the posts author in the comments

  1. I might wait to spend some time on this one when you get back.

    I have been wondering what the buzz was about this and I’m sure it will become important to me as my site begins to grow to prevent the nasty crawlers out there that can slide by your website.

    I have seen a few sites that show up in google and they don’t index pretty.

  2. Kevin, this is the Terrific post by you, wonderful.
    I got every thing which i want to know about robots.txt at one place and that’s your informative post. many thanks for it.

Trackbacks

Comments are closed.

Comments are closed since this post is older than 30 days. However, you can continue this discussion in our popular Blogging Forums

Subscribe To BloggingTips Via RSS Subscribe To Blogging Tips Via Email Follow Us On Twitter Follow us on Facebook Find Out More About Our Newsletter

Sponsors

Blogging Tips Newsletter

Webmaster Corner

 

Our Free E-Books

Site Partners