One of the most significant challenges with large language models today is their generalized knowledge. It’s often outdated due to training cut-off dates, rendering it useless for new technologies. For instance, if you ask a standard LLM about a new AI framework like Pydantic-AI, it will likely have no idea what you’re talking about. Even models with web-searching capabilities provide only bare-bones information.
However, what if you could feed the entire framework documentation into the LLM’s knowledge base? If you do that and ask the same question, the answer you receive is suddenly spot-on. This is the power of Retrieval-Augmented Generation (RAG), a method for feeding curated, external knowledge into an LLM to make it an expert on a specific topic. This could be an AI agent framework, your e-commerce store’s product catalog, or anything else you can imagine.
The primary bottleneck is the curation step. Ingesting an entire website into a knowledge base can be incredibly difficult and slow. How can you accomplish this quickly, before 2027 arrives and AI has taken over the world anyway?
Introducing Crawl-for-AI
This is where Crawl-for-AI comes in. Crawl-for-AI is an open-source web crawling framework designed specifically to scrape websites and format the output perfectly for LLM consumption. It masterfully solves the common problems associated with website scraping systems, which are often slow, overly complicated, and resource-intensive. Crawl-for-AI is the opposite: intuitive, incredibly fast, easy to set up, and extremely memory-efficient.
In this article, I’ll demonstrate how to use Crawl-for-AI to scrape any website for an LLM in mere seconds. We’ll even explore a RAG AI agent built to be an expert on the Pydantic-AI framework, using knowledge curated entirely by Crawl-for-AI.
Why Use Crawl-for-AI?
When you extract raw HTML from a website, it’s a chaotic mess. It’s difficult for a human to parse, and a good rule of thumb is that if it’s hard for a human to understand, it’s even harder for an LLM.
Crawl-for-AI’s most crucial function is transforming this ugly HTML into clean, human-readable Markdown. This text-based format is ideal for feeding into a large language model for RAG. It achieves this with remarkable efficiency and speed, handling complex underlying tasks like proxy and session management automatically.
Furthermore, it’s completely open-source and easy to deploy, with a Docker option available for even simpler setup. Another valuable feature is its ability to remove irrelevant content. Raw HTML is littered with script tags and redundant information. Crawl-for-AI filters this out, ensuring the final output contains only what you care about for your knowledge base.
Getting Started with Crawl-for-AI
Getting started is incredibly simple. First, you install the Python package and then run a setup command to install Playwright, the underlying tool Crawl-for-AI uses for its browser automation.
pip install crawl-for-ai
python -m playwright install
Playwright is a fantastic open-source tool for web testing and automation, making it an excellent choice for the web-scraping functionality here.
Once installed, you can scrape a page with a simple script. Here’s an example that scrapes the homepage of the Pydantic-AI documentation.
import asyncio
from crawl4ai import WebCrawler
async def main():
# Initialize the crawler with a URL
crawler = WebCrawler(url="https://pydantic-ai.readthedocs.io/en/latest/")
# Run the crawler and get the result
result = await crawler.run()
# Print the markdown content
if result and result.markdown:
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Running this script takes only a few seconds to print the entire page’s content as clean markdown in the terminal. While the markdown syntax might look a little busy to us, it’s a perfect format for an LLM to understand, especially when compared to the raw HTML page source.
Scraping an Entire Website
A single page is a good start, but for a truly useful knowledge base, we need to ingest every page of the Pydantic-AI documentation. To do this, we need an efficient and scalable way to extract all the necessary URLs.
Manually copying and pasting each URL is inefficient and not scalable. A much better solution is to use a sitemap. Most websites provide a sitemap.xml file at their root domain (e.g., https://example.com/sitemap.xml). This file provides a structured list of all the pages on the site. The Pydantic-AI documentation has one, and we can use it to programmatically fetch all the page URLs.
A Note on Web Scraping Ethics
Before you start scraping, it’s crucial to consider the ethics. Most websites specify their scraping rules in a robots.txt file (e.g., https://example.com/robots.txt). This file tells you which parts of the site you are allowed or disallowed to crawl. For example, GitHub’s robots.txt asks that you contact them before crawling. Always check this file to ensure you are scraping ethically and responsibly.
Efficient Multi-URL Crawling
With our list of URLs from the sitemap, our next goal is to crawl them efficiently. A naive approach would be to loop through the URLs and run the crawler for each one. However, this is highly inefficient as it spins up a brand-new browser instance for every single URL.
Crawl-for-AI’s documentation provides a much better approach: using the same browser session for all pages. This significantly speeds up the process. The following script demonstrates how to pull URLs from the sitemap and crawl them sequentially within a single browser session.
import asyncio
import xml.etree.ElementTree as ET
import requests
from crawl4ai import WebCrawler, CrawlerConfig, BrowserConfig
# Function to fetch and parse the sitemap
def get_urls_from_sitemap(sitemap_url):
response = requests.get(sitemap_url)
root = ET.fromstring(response.content)
urls = [elem.text for elem in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]
return urls
async def main():
sitemap_url = "https://pydantic-ai.readthedocs.io/sitemap.xml"
all_urls = get_urls_from_sitemap(sitemap_url)
# Configure to use a single browser session
browser_config = BrowserConfig(
headless=True,
)
crawler_config = CrawlerConfig(
browser_config=browser_config,
)
async with WebCrawler(config=crawler_config) as crawler:
for url in all_urls:
result = await crawler.run(url=url)
if result.status == "success":
print(f"Successfully crawled {url}. Content length: {len(result.markdown)}")
else:
print(f"Failed to crawl {url}")
if __name__ == "__main__":
asyncio.run(main())
Running this script is incredibly fast. It processes each page in seconds, confirming the success and content length for each URL. At this point, we already have a rapid way to get the markdown for the entire Pydantic-AI documentation, ready for a vector database.
Pushing Performance with Parallel Processing
We can make this even faster. Although the previous script was quick, it still processed each URL sequentially. Crawl-for-AI allows for parallel processing, enabling us to visit multiple pages at the same time.
The framework lets you create different sessions within a single browser instance to visit URLs in parallel. The following script is adapted from the official example to process URLs in batches.
import asyncio
import xml.etree.ElementTree as ET
import requests
from crawl4ai import WebCrawler, CrawlerConfig, BrowserConfig
import psutil
import os
# (get_urls_from_sitemap function remains the same)
def get_urls_from_sitemap(sitemap_url):
response = requests.get(sitemap_url)
root = ET.fromstring(response.content)
urls = [elem.text for elem in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]
return urls
async def main():
process = psutil.Process(os.getpid())
ram_usage_before = process.memory_info().rss / (1024 * 1024)
print(f"Current RAM usage: {ram_usage_before:.2f} MB")
sitemap_url = "https://pydantic-ai.readthedocs.io/sitemap.xml"
all_urls = get_urls_from_sitemap(sitemap_url)
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerConfig(browser_config=browser_config)
# Crawl 10 pages at a time
results = await WebCrawler.crawl_urls_in_parallel(
urls=all_urls,
config=crawler_config,
batch_size=10
)
print(f"Crawled {len(results)} pages.")
ram_usage_after = process.memory_info().rss / (1024 * 1024)
print(f"Peak RAM usage: {ram_usage_after:.2f} MB")
if __name__ == "__main__":
asyncio.run(main())
Here, we set a batch_size of 10, meaning the crawler visits 10 pages simultaneously. When this script runs, you can observe its incredible memory efficiency. Even with a full browser running in the background and visiting 10 pages at a time, the peak memory usage remains remarkably low—often around 120 MB. This batch processing dramatically speeds up the entire operation, especially for sites with hundreds or thousands of pages.
A Practical Application: The RAG AI Agent
With this powerful scraping process, I built a full RAG AI agent that is an expert on Pydantic-AI. I used the exact method described to pull all the documentation, place it into a PG Vector database, and build an agent around it.
When I ask this specialized agent a question that a general LLM would fail on, like “What are the supported models?”, it provides a perfect answer and even links to the relevant documentation pages for reference.
I can also ask for complex code examples, such as the “weather agent example” from the documentation. The agent quickly searches its knowledge base and returns the complete, accurate code.
Conclusion
Crawl-for-AI provides a bulletproof, lightning-fast way to scrape any site and transform it into a knowledge base for your LLM. This is invaluable no matter your use case, as there is almost always a time and place to bring external web data into your AI applications. In my mind, this makes Crawl-for-AI a game-changer.
While there are many ways to bring knowledge into an LLM, including manual curation or advanced concepts like K-RAG, scraping data from a site remains one of the most common and effective ways to make an AI agent an expert on something you care about.