SiteCrawlerQuick module¶
Crawl a list of URLs asynchronously.
-
class
SiteCrawlerQuick.
SiteCrawlerQuick
(urls=None, target_loc=None, target_scheme=None, conn_limit=None, verbosity=50)¶ Bases:
object
Crawl the site in an asynchronous fashion.
- Parameters
urls (list) – A list of URLs to crawl.
target_loc (
str
, optional) – The location (hostname) to target if different from that defined in the sitemap.target_scheme (
str
, optional) – The scheme (http/https) to use if different from that defined in the sitemap.conn_limit (
int
, optional) – The maximum number of connections to use.verbosity (
int
, optional) – The verbosity setting for output. (see: https://docs.python.org/3/library/logging.html#logging-levels)
-
urls
¶ A list of URLs to crawl.
- Type
list
-
target_loc
¶ The location (hostname) to target if different from that defined in the sitemap.
- Type
str
-
target_scheme
¶ The scheme (http/https) to use if different from that defined in the sitemap.
- Type
str
-
conn_limit
¶ The maximum number of connections to use.
- Type
int
-
verbosity
¶ The verbosity setting for output. (see: https://docs.python.org/3/library/logging.html#logging-levels)
- Type
int
-
async
bound_request
(sem, url, session, spinner)¶ Bind the request to the semaphore pool.
- Parameters
sem (obj) – a sempahore object to manage an internal counter for the connection limit
url (str) – the URL to be requested by the crawler
session (obj) – an aiohttp Client Session
-
change_url_location
(urls=[]) → list¶ Change the netloc in the URL to something user-defined.
- Parameters
urls (list, optional) – URLs that we want to change, by default []
- Returns
A list of dictionaries containing modified URLs as their keys.
- Return type
list
-
async
crawl_sites
() → list¶ Asynchronously crawl a list of URLs.
- Returns
- A list of dictionaries containing URLs crawled as their key and their response codes
as the value:
Example:
[ { 'https://www.javierayala.com/page1': 200 }, { 'https://www.javierayala.com/page2': 200 } ]
- Return type
list
-
get_urls
() → list¶ Getter for the provided URLs.
- Returns
list: a list of dictionaries with URLs to crawl as their keys.
-
async
request
(url, session, spinner) → dict¶ Asynchronously request a URL from a web server.
- Parameters
url (str) – the URL to be requested by the crawler.
session (obj) – an aiohttp Client Session
- Returns
The URL as the key, and it’s response code as the value from the crawler.
- Return type
dict