Python Web Crawler

Spiders and webs

Web Spiders can be very useful to the Website Administrator, in addition to indexing your sites, you can load test the server and an intelligent web crawler can even simulate normal -> moderate -> high levels of web traffic, allowing you to benchmark your website and server performance information.

While there are potentially malicious applications for any web technology, the utility of the Web Crawler makes having one handy a good idea.  Here is my implementation, using Python multiprocessingurllib, and socket.

The most difficult part of this endeavor was managing the asynchronous aspects of multiprocessing, I started using multiprocessing.Pipe for intra-process communication but wound up using multiprocessing.Queue in favor of the flexibility.

Here r are the sending and receiving queues, respectively.  After initializing, the number of processes set by the  -p | --procs  flag will be started and appended to the list of  jobs .  The main process then sleeps for the time specified by  --maxtime (defaults to 20s).  Once time is up (or after an unhandled exception) jobs are killed via the  kill_jobs() function.

I ran into issues collecting the results, as the blocking when using Pipe was preventing data from being returned, using the two queues to send kill signals and receive stats may be overkill but it has functioned in each case and without noticeable performance hits at this level of execution.

* Note, the ‘Empty’ exception is imported from Python’s normal Queue, not multiprocessing.Queue.  ‘from Queue import Empty’

count_beans collects the stats returned by each process and aggregates them.  With multiprocessing handling this in the main process might lead to some slight discrepancies with the counts, but the multiprocessing.Process class lacks a callback method for better statistical polling (as opposed to multiprocessing.Pool‘s apply_async and others).

PokeyCrawl.py

usage

While there are a number of options to customize the behavior of this webcrawler, the primary useful options are:

  • -r | --report – will print a post-execution summary, no matter the result of the crawl.   This report contains a count of the links visited, the time taken to visit each link, and certain error statistics.
  • -s | --speed – this sets the delay between “clicking” links, lowering this will increase the rate of crawling, simulating a higher visitor load.
  • -v | --vary  – varies the user-agent string.  Currently this is limited to a few ua-strings since this is for load-testing purposes only.  Included are a couple of mobile browser agents and at least 3 desktop ua strings.
  • --silent – silences the per-URL “crawl” messages for quieter execution.  Still shows maintenance type messages (e.g. job spawned, job killed, time up, etc)
  • --maxtime  – the maximum execution time, jobs will be terminated once this time has elapsed and statistics will be gathered.  Due to some nuances with the URL history, there is some variance in accuracy (certainly links might be listed as crawled, though they had not been completely loaded yet, these will not be accounted for in the count).

Lesser used commands :

  • --verbose – if present, full header information will be returned for each request.  Recommended with only 1 process (this may be strictly enforced in future versions).
  • --robots – will follow robots.txt directives (or try to, this is experimental)
  • --gz – load gzip-compressed content
  • --ua – specify a user-agent string
  • -p | --procs – use with  -s | --speed to control the rate at which sites are crawled.  Max 5 processes, this is a heavy load as-is.  Adapt it for higher if you need.
  • -d | --debug – raise errors and provide other feedback during execution, recommended with  --silent to supress the crawl messages.

1 thought on “Python Web Crawler”

  1. Very cool dude. This is better than like 90% of the spiders you see floating around. I like that you actually have flags to specify if the spider should honor robots.txt and to specify user-agent strings. I’ll have to add that to mine before I make a spiders post. Shameless self-plug.

Leave a Reply

Your email address will not be published. Required fields are marked *