Information and Links
Join the fray by commenting, tracking what others have to say, or linking to it from your blog.
Block spam bots and evil web scrapers
I noticed on one server today a huge CPU load. A quick look at Netstat showed that most of the current traffic was coming from someone in Africa. I crossed referenced the IP address from Netstat with the access log files on the various sites on the server and saw that it had a UserAgent of 'DTS Agent'. A quick Google showed this to be a scraper for email contacts.
It was time that I added a little something for these spam bots to help reduce their ability to see my sites. Simply adding the following code to an .htaccess file in the root of the site structure would help to curb evil good for nothing spam bots:
[Last Updated: April 7th 2009]
-
RewriteEngine On
-
-
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|InfonaviRobot|InterGET|Internet\sNinja|Jennybot|JetCar|JOC\sWeb\sSpider|Kenjin.Spider|Keyword.Density|larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|Mass\sDownloader|Mata.Hari|Microsoft.URL|MIDown\stool|MIIxpc|Mister.PiX|Mister\sPiX|moget|Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|Net\sVampire|NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|Offline\sExplorer|Offline\sNavigator|Openfind|Pagerabber|Papa\sFoto|pavuk|pcBrowser|Program\sShareware\s1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|Teleport\sPro|Telesoft|The.Intraformant|TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|WebFetch|WebGo\sIS|Web.Image.Collector|Web\sImage\sCollector|WebLeacher|WebmasterWorldForumbot|WebReaper|WebSauger|Website\seXtractor|Website.Quester|Website\sQuester|Webster.Pro|WebStripper|Web\sSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|Xaldon\sWebSpider|Xenu's|Zeus|DTS\sAgent) [NC]
-
RewriteRule .* - [F]
-
ErrorDocument 403 /403.html
-
-
# IF THE UA STARTS WITH THESE
-
SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|diibot|dittospyder|dragonfly) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|pavuk) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|true_robot) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(repomonkey|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(turingos|urly.?warning|vacuum|voideye|whacker) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
-
SetEnvIfNoCase ^User-Agent$ .*(dts\sagent) HTTP_SAFE_BADBOT
This code is a cleaned up version of the code found here and here. Note that you really should look through all the User Agents and be sure you are not blocking someone or some software that you would like to keep. The list above is far from complete, but is a good start.
For a fairly up-to-date list of User Agents you will find a useful User Agent database here.


Nice cleaned-up version plattapuss.. Very smart code.