Archive
Monthly
Go
|
|
DNN Blog
May
19
Posted by:
Anthony Glenwright
Saturday, May 19, 2007
I discovered something interesting this week using the Google webmaster tools & thought I's share it. I had a look at the diagnostic tools (many of these are only available if you "verify" your site), and Google listed several hundred Urls (mostly forum posts) that it had not indexed, with an explanation that it did not index the page because it did not find a robots.txt file for my site.
A robots.txt file is used to provide instructions to web "robots" like the one that Google uses to index web sites. The absence of a robots.txt is generally meant to tell robots to go ahead and index everything, but from the results I saw in Google diagnostics, it appears that you will get a better indexing result from Google by explicitly including a robots.txt file in your site.
Here's my robots.txt: I ended up adding all the sub-folders in DNN except for /Portals, even though many of them would never get linked to anyway. (For example, including /bin is a bit of overkill, since no page would ever have a link to its contents).
User-agent: *
Disallow: /Admin/
Disallow: /App_Browser/
Disallow: /App_Code/
Disallow: /App_Data/
Disallow: /App_GlobalResources/
Disallow: /bin/
Disallow: /Components/
Disallow: /Config/
Disallow: /Controls/
Disallow: /DesktopModules/
Disallow: /Documentation/
Disallow: /Install/
Disallow: /js/
Disallow: /Providers/
Disallow: /Resources/
A simple version of the above would be:
User-agent: *
Disallow:
I couldn't find any documentation to back this observation up, but after I added the robots.txt file above and waited a couple of days, the errors disapeared from my Google diagnostics page, and the pages were indexed. If anyone else out there is using the Google webmaster tools & can check to see if their results match mine, I'd appreciate it if you'd let me know by posting a blog comment.
14 comment(s) so far...
Re: A simple SEO tip - create a robots.txt
http://www.robotstxt.org/
By Sébastien on
Sunday, May 20, 2007
|
Re: A simple SEO tip - create a robots.txt
Where did you place the robots.txt file, in your root web folder? I wonder if it would make sense to start including this file in distributions?
By mattchristenson on
Sunday, May 20, 2007
|
Re: A simple SEO tip - create a robots.txt
If there's no web page within a directory that should be excluded from indexing, then don't add such a directory to the robots.txt file! This makes the robots.txt file smaller (faster load, less bandwidth) and hides those directories from robots/spiders and hackers!
The robots.txt file should only contain web pages that should be excluded from indexing, or directories that only contain those web pages.
That means that a robots.tct file such as
User-agent: * Disallow: /Admin/ Disallow: /App_Browser/ Disallow: /App_Code/ Disallow: /App_Data/ Disallow: /App_GlobalResources/ Disallow: /bin/ Disallow: /Components/ Disallow: /Config/ Disallow: /Controls/ Disallow: /DesktopModules/ Disallow: /Documentation/ Disallow: /Install/ Disallow: /js/ Disallow: /Providers/ Disallow: /Resources/
makes no sense, because there's nothing in these directories a search engine has access to (no links) or will index (binaries). If I'm wrong, please let me know. :) In order to prevent files from being read anonymously (robots/spiders/humans), just remove anonymous read permissions in your web server.
Anyway, it would make more sense to include a file such as ErrorPage.aspx... Disallow: ErrorPage.aspx
However, the robots.txt file only prevents web pages from being indexed as long as the search engine obeys the robots.txt file!
A good way to inform search engines about pages that are available for crawling is, is the use of a sitemap file. This standard has been adopted by Google, MSN, Yahoo, and others.
More about sitemaps can be found at www.sitemaps.org
More about the robots.txt exclusion protocol can be found at www.robotstxt.org
More about file types indexed by search engines can be found at http://www.netmechanic.com/news/vol5/promo_no10.htm
Finally, it doesn't generally effect the indexing process negatively, if there's no robots.txt file. However, the robots.txt file makes it easier for search engines to skip files that shouldn't be indexed! Therefore, the crawling process is faster, resulting in more indexed pages!
By deanman1 on
Sunday, May 20, 2007
|
Re: A simple SEO tip - create a robots.txt
I would tend to agree with Anthony and include the disallowed folders as he has. My only question would be about desktopmodules. I would think that 99% of what is in there you would want disallowed, but are there any cases where modules load themselves in that URL?
By daxdavis on
Monday, May 21, 2007
|
Re: A simple SEO tip - create a robots.txt
Sitemap has been adopted by DNN from 4.5.0 (and is automatically generated). Check out http://www.dotnetnuke.com/sitemap.aspx for instance. If you have 4.5.0 or above try changing adapting URL to your website and prepare to be pleasently surprised.
Interesting thing about setting permissions, I never thought about that and it sounds a very good idea. Unfortunately my hosting provider is not letting me set permissions, according to IE7 FTP it is not supported, so I'm emailing them about this.
By NukeAlexS on
Tuesday, May 22, 2007
|
Re: A simple SEO tip - create a robots.txt
Ah two things to add to this. a) By protecting a folder it causes a grey login/password box to appear in the browser. Shouldn't be a problem though I guess (or could it be?). b) Does't the .ascx files need to be read by anonymous user?, or does the ASP.NET process handle this (I would guess the latter is true).
Anyway my hosting provider won't let me change the permissions at all :(
By NukeAlexS on
Tuesday, May 22, 2007
|
Re: A simple SEO tip - create a robots.txt
FYI: A robots.txt file goes in the root folder of your website.
I'd also like to note that I don't agree with the comment about removing read permissions from deanman1, this would likely disrupt your website.
But it makes sense to include disallow lines for errorpage.aspx, rss.aspx.
The new sitemaps feature is great, for many users this will be a complete solution for "publishing" your site to Google.
This blog is about the "other" pages on your site - that is, blog posts and forum posts - that don't appear in a sitemap.
By anthony.glenwright on
Tuesday, May 22, 2007
|
Re: A simple SEO tip - create a robots.txt
Also wanted to comment on "Finally, it doesn't generally effect the indexing process negatively, if there's no robots.txt file".
This was my understanding also. But my observations in the Google webmaster tools as noted in the blog indicate otherwise. That's what the blog is about!
By anthony.glenwright on
Tuesday, May 22, 2007
|
Re: A simple SEO tip - create a robots.txt
Yes you are right, creating a robots.txt file is very important for your SEO. I would just like to add that you should use robots.txt to exclude the login, register, terms and privacy controls.
Why? - Look at the terms control, the text is 99% identical to the terms of dotnetnuke.com and thousands of other sites using dotnetnuke. So search engines will think that you have copied from dotnetnuke.com and penalise you accordingly.
There is a big article on our website www.bestwebsites.co.nz about this issue.
JK.
By jk_nzd on
Thursday, May 31, 2007
|
Re: A simple SEO tip - create a robots.txt
I'm kind of new to the DNN community. We're using Google sitemaps for all out site with the portal / subportal structure - how do I specify a robots.txt for a specific subportal? Is there a module that's available?
Thanks...
By astonishresults on
Tuesday, October 06, 2009
|
Re: A simple SEO tip - create a robots.txt
Anthony -
I coordinate the Southern California DotNetNuke Users Group that meets monthly (www.socaldug.org). We do virtual presentations via MS Live Meeting. Would you be able to "meet" with us (virtually) the 2nd Wed. of any month in 2008? We start at 5:30 pm Pacific time so we can include the East Coast. Whatever you would like to present would be interesting to our group, I'm sure, especially the Documents module. Please let me know at dma@dmcma.com.
Dave McMullen
By dma111 on
Tuesday, October 06, 2009
|
Re: A simple SEO tip - create a robots.txt
not to beat a dead horse too much on this robots.txt, and so late. But where I'm confused is, the privacy/terms controls, related to multiple portals.
If you look on your/my main page, you see the links to privacy and terms in the default skin. These links are re-written for each protal in the format: http://www.dnnreactor.com/Home/tabid/36/ctl/Privacy/Default.aspx which really, to effectivally prevent indexing, would need to resemble: /home/tabid/*/ctl/privacy/default.aspx or something. But wildcards in the dissalow like this example do not work. And the tab id number where I put * changes for each and every portal.
So, since a link is being written on the fly, how would you block this for every child portal? Would I put an entry in the robots for each child-portal?
Anyhow I'll start a forum thread for this since it's probably more appropriate there.
By chicagojsh on
Tuesday, October 06, 2009
|
Where is the "root folder"?
I'm confused about what is the "root folder" for a DNN portal for purposes of placing the robots.txt file. If I place it as www.domainname.com/robots.txt it sure appears to me that engines like google are NOT seeing it. The literal path to the portal is, of course, something like: www.rootdomain.com/httpdocs/Portals/15/robots.txt.
So, where does the robots.txt file go?
Thanks,
Ken
By kflorian on
Tuesday, October 06, 2009
|
Re: A simple SEO tip - create a robots.txt
I've got just a single portal on my website. Based on the pages that are delivered, there isn't any content on my site that I want to have captured other than those that dnn delivers from its database. Since I am using friendly URL naming convention, I can just exclude the querystring version of the same pages (thereby avoiding duplicates, the register, login, privacy and terms pages) using the following robots.txt:
User-agent: * Disallow: /Default.aspx? Disallow: /Terms/Default.aspx Disallow: /Privacy/Default.aspx
My belief is that nothing else is needed in version 4.6 to 4.9 of DNN. If you want to include blog pages and other items that the standard sitemap.aspx doesn't include, then read the article found at www.codeproject.com/KB/aspnet/DNNGoogleSiteMapProvider.aspx
I found that without this robots.txt file, Google's reference to my site was to the register page -- not what I had intended.
Also, don't forget to categorize your site according to Google's directions to the Open Directory Project, found at www.dmoz.org/add.html.
By b b on
Tuesday, October 06, 2009
|
|