Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Block spam bots and evil web scrapers

Posted by plattapuss on November 12th, 2008

I noticed on one server today a huge CPU load. A quick look at Netstat showed that most of the current traffic was coming from someone in Africa. I crossed referenced the IP address from Netstat with the access log files on the various sites on the server and saw that it had a UserAgent of 'DTS Agent'. A quick Google showed this to be a scraper for email contacts.

It was time that I added a little something for these spam bots to help reduce their ability to see my sites. Simply adding the following code to an .htaccess file in the root of the site structure would help to curb evil good for nothing spam bots:

[Last Updated: April 7th 2009]

CODE:
  1. RewriteEngine On
  2.  
  3. RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|InfonaviRobot|InterGET|Internet\sNinja|Jennybot|JetCar|JOC\sWeb\sSpider|Kenjin.Spider|Keyword.Density|larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|Mass\sDownloader|Mata.Hari|Microsoft.URL|MIDown\stool|MIIxpc|Mister.PiX|Mister\sPiX|moget|Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|Net\sVampire|NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|Offline\sExplorer|Offline\sNavigator|Openfind|Pagerabber|Papa\sFoto|pavuk|pcBrowser|Program\sShareware\s1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|Teleport\sPro|Telesoft|The.Intraformant|TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|WebFetch|WebGo\sIS|Web.Image.Collector|Web\sImage\sCollector|WebLeacher|WebmasterWorldForumbot|WebReaper|WebSauger|Website\seXtractor|Website.Quester|Website\sQuester|Webster.Pro|WebStripper|Web\sSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|Xaldon\sWebSpider|Xenu's|Zeus|DTS\sAgent) [NC]
  4. RewriteRule .* - [F]
  5. ErrorDocument 403 /403.html
  6.  
  7. # IF THE UA STARTS WITH THESE
  8. SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
  9. SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT
  10. SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT 
  11. SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|diibot|dittospyder|dragonfly) HTTP_SAFE_BADBOT     
  12. SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT
  13. SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT
  14. SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT
  15. SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT
  16. SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT
  17. SetEnvIfNoCase ^User-Agent$ .*(justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp) HTTP_SAFE_BADBOT
  18. SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT
  19. SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT 
  20. SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT
  21. SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT
  22. SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT
  23. SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|pavuk) HTTP_SAFE_BADBOT
  24. SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT
  25. SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|true_robot) HTTP_SAFE_BADBOT
  26. SetEnvIfNoCase ^User-Agent$ .*(repomonkey|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT
  27. SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT
  28. SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT
  29. SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT
  30. SetEnvIfNoCase ^User-Agent$ .*(turingos|urly.?warning|vacuum|voideye|whacker) HTTP_SAFE_BADBOT
  31. SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT
  32. SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT
  33. SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
  34. SetEnvIfNoCase ^User-Agent$ .*(dts\sagent) HTTP_SAFE_BADBOT

This code is a cleaned up version of the code found here and here. Note that you really should look through all the User Agents and be sure you are not blocking someone or some software that you would like to keep. The list above is far from complete, but is a good start.

For a fairly up-to-date list of User Agents you will find a useful User Agent database here.



Reader Comments

Nice cleaned-up version plattapuss.. Very smart code.