Crawler Python API

Getting started with Crawler is easy. The main class you need to care about is Crawler

crawler.main

Main Module

class crawler.main.Crawler(url, delay, ignore)

Main Crawler object.

Example:

c = Crawler('http://example.com')
c.crawl()
Parameters:
  • delay – Number of seconds to wait between searches
  • ignore – Paths to ignore
crawl()

Crawl the URL set up in the crawler.

This is the main entry point, and will block while it runs.

get(url)

Get a specific URL, log its response, and return its content.

Parameters:url – The fully qualified URL to retrieve
crawler.main.run_main()

A small wrapper that is used for running as a CLI Script.

crawler.utils

utils.should_ignore(ignore_list, url)

Returns True if the URL should be ignored

Parameters:
  • ignore_list – The list of regexs to ignore.
  • url – The fully qualified URL to compare against.
>>> should_ignore(['blog/$'], 'http://ericholscher.com/blog/')
True
# This test should fail
>>> should_ignore(['home'], 'http://ericholscher.com/blog/')
True
utils.log(url, status)

Log information about a response to the console.

Parameters:
  • url – The URL that was retrieved.
  • status – A status code for the Response.
>>> log('http://ericholscher.com/blog/', 200)
OK: 200 http://ericholscher.com/blog/
>>> log('http://ericholscher.com/blog/', 500)
ERR: 500 http://ericholscher.com/blog/
# This test should fail
>>> log('http://ericholscher.com/blog/', 500)
OK: 500 http://ericholscher.com/blog/