{"id":2244,"date":"2012-05-27T18:04:22","date_gmt":"2012-05-28T01:04:22","guid":{"rendered":"http:\/\/www.sheer.us\/weblogs\/?p=2244"},"modified":"2012-05-27T18:04:22","modified_gmt":"2012-05-28T01:04:22","slug":"www-80legs-com-stopping-them-technical","status":"publish","type":"post","link":"http:\/\/www.sheer.us\/weblogs\/uncategorized\/www-80legs-com-stopping-them-technical","title":{"rendered":"www.80legs.com &#8211; stopping them &#8211; technical"},"content":{"rendered":"<p>Recently, one of the sites I admin for came under what I would refer to as a DDOS attack by http:\/\/www.80legs.com\/webcrawler.html.<\/p>\n<p>This claims to be a ordinary web spider, but it does some things that other web spiders don&#8217;t:<\/p>\n<p>1) It makes between 20 and 100 connections to the server, from different IP addresses<br \/>\n2) It makes requests as fast as the server will answer<\/p>\n<p>Now, for a web server with flat files, this is fine. But this particular web server had very complex database content that involved a lot of joins and multiple queries to build each page. It runs on a fairly powerful box &#8211; four of them, actually &#8211; but it still wasn&#8217;t up for 100 connections querying as fast as it would respond. I think probably most database-driven sites would have some problems with this.<\/p>\n<p>As 80legs points out on their web site, blocking them by IP will not work because they are a distributed engine spanning thousands of IPs. Kind of like a botnet. And their indexing is user-driven.. that is, you can pay them to index a particular site for you. Good way to mess with your competitors. \ud83d\ude09<\/p>\n<p>Anyway, my solution was simple and elegent. We already use haproxy to distribute load among the web servers, so I just pulled out the &#8216;tarpit&#8217; and wrote a quick regex. For those of you not familiar with haproxy, it&#8217;s a single threaded non blocking daemon (Oh, i love those! Just like ew-too!) that proxies web requests to servers, automatically adjusts when servers go down, and has a bunch of neat features. It&#8217;s free software, and it has worked extremely well for us.<\/p>\n<p>Anyway, I stuck the following in haproxy.cfg:<\/p>\n<p>reqitarpit ^User-Agent: .*www.80legs.com.*<\/p>\n<p>Goodbye, 80legs. Have fun hanging out in 30-second-delay-for-any-request land \ud83d\ude09<\/p>\n<p>For those of you who haven&#8217;t set up haproxy before, it&#8217;s pretty trivial. It can run on the same box as your web server and just attach to a different interface (i.e. bind the webserver to localhost and it to the outside interface) or a different machine, or whatever. It&#8217;s a very lightweight load, as STNB things tend to be.<\/p>\n<p>Random factoid for those of you not familiar with ew-too &#8211; the reason ew-too was written STNB is that it was originally designed to run on university computers, and be such a light load that the administrators never noticed it &#8211; on machines that were the equivalent of a 486. With a hundred people or more connected. STNB is a very clever approach for situations that it works for.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently, one of the sites I admin for came under what I would refer to as a DDOS attack by http:\/\/www.80legs.com\/webcrawler.html. This claims to be a ordinary web spider, but it does some things that other web spiders don&#8217;t: 1) It makes between 20 and 100 connections to the server, from different IP addresses 2) [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/posts\/2244"}],"collection":[{"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/comments?post=2244"}],"version-history":[{"count":1,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/posts\/2244\/revisions"}],"predecessor-version":[{"id":2245,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/posts\/2244\/revisions\/2245"}],"wp:attachment":[{"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/media?parent=2244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/categories?post=2244"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.sheer.us\/weblogs\/wp-json\/wp\/v2\/tags?post=2244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}