- auzieman's Blog
- Log in or register to post comments
Requests Spider
Requests Spider is a simple Python script that crawls the links found in a list of URLs and returns the response times for each link. It uses the Requests library to fetch the HTML for each URL and BeautifulSoup to parse the HTML and extract the links.
Repository has more recent information DTLab AuzieTek GIT
A recent version of the code.
import os
from bs4 import BeautifulSoup
import json
import sys
import time
import requests
from datetime import datetime
def crawl_url(url):
"""
Crawls the given URL and returns the response time and any internal links found.
"""
start_time = datetime.now()
response = requests.get(url)
soup = None
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
internal_links = []
if soup:
for link in soup.find_all('a'):
href = link.get('href')
if href and not href.startswith('http'):
internal_links.append(href)
response_time = (datetime.now() - start_time).total_seconds() * 1000
return response_time, internal_links
def main():
script_dir = os.environ.get('REQUESTS_SPIDER_DIR', '/requests_spider/')
interval = int(os.environ.get('REQUEST_SPIDER_INTERVAL', '10'))
forks = int(os.environ.get('REQUEST_SPIDER_FORKS', '1'))
while True:
urls_file = os.path.join(script_dir, 'urls.json')
if not os.path.isfile(urls_file):
print(f"Error: {urls_file} not found.")
sys.exit(1)
with open(urls_file) as f:
urls = json.load(f)
runs = []
for url in urls:
for i in range(1, forks + 1):
time.sleep(0.5) # add a delay to be polite
response_time, internal_links = crawl_url(url)
runs.append({
"url": url,
"fork": i,
"start_time": datetime.now().timestamp(),
"end_time": None,
"exit_status": None,
"response_time": response_time,
"internal_links": internal_links,
})
time.sleep(interval)
for run in runs:
if run["end_time"] is not None:
continue
run["end_time"] = datetime.now().timestamp()
# Print the results
for run in runs:
print(f"URL: {run['url']}\tFork: {run['fork']}\tStart Time: {datetime.fromtimestamp(run['start_time'])}\tEnd Time: {datetime.fromtimestamp(run['end_time'])}\tResponse Time: {run['response_time']}ms\tInternal Links: {run['internal_links']}")
if __name__ == '__main__':
main()
Getting Started
Prerequisites
To run the Requests Spider script, you will need:
- Python 3.x
- Requests library
- BeautifulSoup library
Installation
- Clone the repository:
git clone https://github.com/your-username/requests_spider.git - Install the required libraries:
pip install -r requirements.txt
Usage
The Requests Spider script can be run from the command line using the following syntax:
python requests_spider.py [-h] [-f FORKS] [-d DELAY] [-v] url_file
where:
url_fileis the path to the JSON file containing the URLs to crawl-hshows the help message and exits-f FORKSsets the number of forks to use (default is 1)-d DELAYsets the delay between each loop in seconds (default is 10)-venables verbose mode, which displays additional output
Example
To run the Requests Spider script with 2 forks and a delay of 5 seconds between loops, use the following command:
python requests_spider.py -f 2 -d 5 urls.json
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Acknowledgments
- Requests library: https://requests.readthedocs.io/
- BeautifulSoup library: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
The included Docker file on this repository could be configured to run this job in the background.