Vcoderz Community - View Single Post

god · 08-06-2007

I thought i'd give you guys an idea about this file!
As we all know, the google crawler engine (the one that gets files into the google database, so you can search them) has access to pages in websites, that you don't have access to ! like a /admin/ folder that's .htaccess'd (password protected)
now website owners dont want people to know the content of these directories, so they .htaccess them (password protect them), but with a google search + google cache, u can still see the contents! so they should stop google from accessing these directories.. how ? in the site's main, they make a file called robots.txt
for example http://www.********.com/robots.txt
the robots.txt file looks like this:

Code:

# everything after a "#" is not taken into consideration
# you can write anything here ! 
# ammouna
User-agent: * <<< specifies which user agent should be allowed
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /cache/
Disallow: /class/
Disallow: /images/
Disallow: /include/
Disallow: /install/
Disallow: /kernel/
Disallow: /language/
Disallow: /templates_c/
Disallow: /themes/
Disallow: /uploads/

so these directories wont be shown by the google search, and are secure!
Now how you can actually use this info ? it depends

it can be quite useful, and sometimes meaningless!

the most famous robots.txt file on the net is ..... uh ... nsita lol ill update it later..

ya i found it

here it is www.whitehouse.gov/robots.txt
check it out its ok to open it :P

08-06-2007	#1
god Registered Member Last Online: 02-14-2010 Join Date: Mar 2006 Posts: 846 Thanks: 71 Thanked 293 Times in 217 Posts Groans: 0 Groaned at 0 Times in 0 Posts	The robots.txt file! I thought i'd give you guys an idea about this file! As we all know, the google crawler engine (the one that gets files into the google database, so you can search them) has access to pages in websites, that you don't have access to ! like a /admin/ folder that's .htaccess'd (password protected) now website owners dont want people to know the content of these directories, so they .htaccess them (password protect them), but with a google search + google cache, u can still see the contents! so they should stop google from accessing these directories.. how ? in the site's main, they make a file called robots.txt for example http://www.*******.com/robots.txt the robots.txt file looks like this: Code: # everything after a "#" is not taken into consideration # you can write anything here ! # ammouna User-agent: <<< specifies which user agent should be allowed Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /cache/ Disallow: /class/ Disallow: /images/ Disallow: /include/ Disallow: /install/ Disallow: /kernel/ Disallow: /language/ Disallow: /templates_c/ Disallow: /themes/ Disallow: /uploads/ so these directories wont be shown by the google search, and are secure! Now how you can actually use this info ? it depends it can be quite useful, and sometimes meaningless! the most famous robots.txt file on the net is ..... uh ... nsita lol ill update it later.. ya i found it here it is www.whitehouse.gov/robots.txt check it out its ok to open it :P __________________ --Capitalisation is the only difference between "I helped my uncle Jack off a horse" and "I helped my uncle jack off a horse" !! http://img482.imageshack.us/img482/4889/hell7ta.jpg Last edited by god; 08-06-2007 at 01:16 PM.