SiteCrawlerQuick module

Crawl a list of URLs asynchronously.

class SiteCrawlerQuick.SiteCrawlerQuick(urls=None, target_loc=None, target_scheme=None, conn_limit=None, verbosity=50)

Bases: object

Crawl the site in an asynchronous fashion.

Parameters
  • urls (list) – A list of URLs to crawl.

  • target_loc (str, optional) – The location (hostname) to target if different from that defined in the sitemap.

  • target_scheme (str, optional) – The scheme (http/https) to use if different from that defined in the sitemap.

  • conn_limit (int, optional) – The maximum number of connections to use.

  • verbosity (int, optional) – The verbosity setting for output. (see: https://docs.python.org/3/library/logging.html#logging-levels)

urls

A list of URLs to crawl.

Type

list

target_loc

The location (hostname) to target if different from that defined in the sitemap.

Type

str

target_scheme

The scheme (http/https) to use if different from that defined in the sitemap.

Type

str

conn_limit

The maximum number of connections to use.

Type

int

verbosity

The verbosity setting for output. (see: https://docs.python.org/3/library/logging.html#logging-levels)

Type

int

async bound_request(sem, url, session, spinner)

Bind the request to the semaphore pool.

Parameters
  • sem (obj) – a sempahore object to manage an internal counter for the connection limit

  • url (str) – the URL to be requested by the crawler

  • session (obj) – an aiohttp Client Session

change_url_location(urls=[]) → list

Change the netloc in the URL to something user-defined.

Parameters

urls (list, optional) – URLs that we want to change, by default []

Returns

A list of dictionaries containing modified URLs as their keys.

Return type

list

async crawl_sites() → list

Asynchronously crawl a list of URLs.

Returns

A list of dictionaries containing URLs crawled as their key and their response codes

as the value:

Example:

[
    {
        'https://www.javierayala.com/page1': 200
    },
    {
        'https://www.javierayala.com/page2': 200
    }
]

Return type

list

get_urls() → list

Getter for the provided URLs.

Returns

list: a list of dictionaries with URLs to crawl as their keys.

async request(url, session, spinner) → dict

Asynchronously request a URL from a web server.

Parameters
  • url (str) – the URL to be requested by the crawler.

  • session (obj) – an aiohttp Client Session

Returns

The URL as the key, and it’s response code as the value from the crawler.

Return type

dict