Monday, April 12, 2004

Beware GoogleBot

I get notifications via email whenever anything changes on the FlexWiki wiki. It's one of the coolest features of FlexWiki, IMO, because I like for notifications to arrive via email/RSS (they're the same to me because of NewsGator). Anyway, it's often my habit to review the changes that have been made to the wiki while going through my morning mail.


Today, I saw something annoying. Somone had gone through and changed a massive number of pages on the wiki, including some of the pages that I'm responsible for maintaining. I went to the wiki and found that my pages had been rolled back to previous versions. It was the work of only a few seconds to restore them to their correct condition, but I was bothered that someone had changed it out of malice or out of ignorance. The more so since it looked like they'd done the same to dozens of other pages.


While I was there, I noticed that David Ornstein, the spiritual leader of the FlexWiki effort, had made some changes to the site layout using his new and ultra-cool WikiTalk engine. Hoping that error and not vandalism was at the root of the problem, I threw a question out on the FlexWiki mailing list to see if perhaps that was the issue.


Not too much later, Tommy Williams, another FlexWiki contributor and a really bright guy, came up with a theory. If he's right it's a doozy!


Tommy noticed that the IP address of the offender is owned by Google. So he figures that the GoogleBot came in and visited the “restore this page to a previous version” link that's on every page in the new layout (it used to be in a dropdown). And of course, we don't have any “hey Google, don't follow this link” magic on those pages. Although obviously we need some, even if that's not what happened in this particular case.


Take this as a warning. If you have any unprotected links to “delete” or “order” or “trigger webmaster's ejection seat” functionality on your webpages, and you're assuming that no human would mistakenly click on them, remember that not all visitors to your website are human.

9 comments:

  1. Hmm.. Yeah that would be a pretty big doozie... I wonder if google has a way of letting you label links that you don't want them to follow. I assume the simplest way to solve this would be to have a javascript convirmation dialog verify that you want to revert back to the previous page before it actually sends you off.

    ReplyDelete
  2. There are definitely ways to tell search engines to not visit certain pages - we're far from the first people to have this problem - but I have no idea what they are. Something about a robots.txt file. I'm not sure that would work for us, since we could wind up installing anywhere in a site's hierarchy.

    The dialog box is a pretty darn good idea, actually.

    ReplyDelete
  3. I am not sure if don't want the page indexed at all, but there is a meta-data tag that disallows indexing, which google honors [1]. For more information [2]. And, as you mentioned, the robots.txt file works as well. Google plays nice.

    [1] <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
    [2] http://www.google.com/webmasters/3.html#B3

    ReplyDelete
  4. Googlebot follows HREF links and SRC links, like a link checker would.

    I wouldn't want a link checker to be able to easily delete the data in my site or Wiki.

    Solution: change the 'action' links should to a button.

    Most Wiki save pages and blog comment save pages do the same, for a reason.

    ReplyDelete
  5. Yeah, that's what we did: changed it to a button.

    ReplyDelete
  6. This sort of thing is exactly why the HTTP RFC says that GET should be "Safe and Idempotent". You should never put a destructive action behind a simple link.

    ReplyDelete
  7. Also, some people mentioned using a META tag or robots.txt to prevent a spider from hitting a destructive link. This works for Google, but there's lots of spiders out there that don't respect robots.txt and don't parse META tags. About a year ago, Mark Pilgrim documented about 50 badly behaved robots that have hit his site. Sure, you can ban them after the fact, but that's after you've undone the destruction.

    ReplyDelete
  8. Maybe I'm missing something but don't most of these "destructive" actions require proper authentication and authorization? What I'm wondering is how in the world are these robots able to grant themselves a proper AA to do such things in the first place?

    ReplyDelete
  9. Nevermind. I just found out these 'wiki's by nature allow anonymous editing - didn't know what wiki was *doh*. I don't understand the concept of wiki and why it requires anonymous editing but here's one solution that's used everywhere to combat these robots! I think...

    Require the visitor to enter the text that is displayed in an image on the page. You know what I'm talking about. You can see this in action on many sites including network solutions, when trying to do a whois. I guess that still doesn't change the fact that you still can't have a simple hyperlink - it's a bad idea anyway - but you can still have a relatively small form that posts by GET. Or a simple javascript can pick up the entry before following the link without a form.

    ReplyDelete