Agentic Web Crawling

AI
13 min read

In my last blog, I covered my experience building a data import agent. The goal of that project was to be able to import data from any source, but as I built it out, I ran into a lot of challenges crawling websites for data. In this post, I'll walk you through my experiences.

Defining the Task

When extracting data from the web, there are three actions that occur. Scraping, crawling, and searching. Each of these actions is a superset of the previous one. Scraping is the act of extracting data from a single web page. Crawling is the act of scraping multiple web pages by following links from one page to another. Searching is the act of using a search engine to find relevant web pages and then crawling those pages for data. In this post, I'll focus on scraping and crawling.

The Simplest Crawler

The simplest crawler is given a starting URL and a goal and is simply told to extract data from that page, including links to other pages that might contain relevant data. I won't go into details on extracting data from the prompt since I covered that in my last blog. In this case, we are extracting an object that looks like this:

class CrawlSourceDetails(BaseModel):
    starting_page: str
    data_to_extract: str
    max_pages: int = 20
    max_depth: int = 1

class PromptExtractionResult(BaseModel):
    source: CrawlSourceDetails
    error: str = ""
    summary: str = ""

As you can see, our CrawlSourceDetails object defines some guardrails to prevent our agent from crawling the entirety of the internet. max_pages lets you define a limit to the total number of pages crawled including the starting page. max_depth allows you to define the maximum link depth from the starting page. For example, if page A links to pages B and C, and page B links to pages D and E, then starting from page A with a max_depth of 1 would only allow crawling pages B and C, while a max_depth of 2 would allow crawling pages D and E as well. We will start with the prompt:

From the url "https://en.wikipedia.org/wiki/List_of_cocktails", crawl up to 15 pages to extract cocktail recipes.

From which we can extract

{
  "source": {
    "starting_page": "https://en.wikipedia.org/wiki/List_of_cocktails",
    "data_to_extract": "cocktail recipes",
    "max_pages": 15
  },
  "summary": "Crawl up to 50 pages starting from the Wikipedia List of Cocktails to extract cocktail recipes."    
}

Now that we have extracted our crawl source details from the prompt, we can start crawling.

class ExtractedItem(BaseModel):
    entity_type: str
    fields: dict[str, Any]

class CrawlOutput(BaseModel):
    url: str
    urls_to_follow: list[str] = []
    extracted_items: list[ExtractedItem] = []
    error : Optional[str] = None


simple_crawling_agent = Agent(
    name="Simple Web Crawler Agent",
    instructions="""
You are a web crawling agent that extracts links, and data from web pages based on the job requirements.
You should crawl the provided URL extracting all relevant data and links as per the job description.
""",
    output_type=CrawlOutput,
    tools=[read_url]
)

We start by defining our output format and then our simple crawling agent. Next we'll implement run_simple_crawler which can be call recursively max_depth times. Each time run_simple_crawler is called with a list of pages to crawl. When the depth is 0 then the value of current_pages will be a list containing only the starting page. As it crawls each page, it builds the list of links to follow on the next depth level, as it extracts data from each page.

async def run_simple_crawler(conn, sess, data_to_extract, current_pages, depth, pages, max_depth, max_pages):
    pages_visited = 0
    extracted_items = []
    urls_to_follow = []
    for i, current_page in enumerate(current_pages):
        try:
            prompt = f"""
The data to extract is: {data_to_extract}
The current page to crawl is: {current_page}
Extract links: {'True' if depth + 1 <= max_depth else 'False'}
"""

            output = await simple_crawling_agent.run(conn, prompt=prompt, session=sess)

            if output.error is not None and len(output.error) > 0:
                print(f"Error crawling {current_page}: {output.error}")
                sys.exit(1)

            pages_visited += 1
            extracted_items.extend(output.extracted_items)
            urls_to_follow.extend(output.urls_to_follow)

            if pages_visited > max_pages:
                break

        except Exception as e:
            print(f"Error running simple crawler: {e}")
            sys.exit(1)

    if depth + 1 < max_depth and pages_visited < max_pages and len(urls_to_follow) > 0:
        recursive_pages_visited, recursive_extracted_items= await run_simple_crawler(conn, sess, data_to_extract, urls_to_follow, depth + 1, pages + pages_visited, max_depth, max_pages)
        pages_visited += recursive_pages_visited
        extracted_items.extend(recursive_extracted_items)

    return pages_visited, extracted_items

Our first execution of the simple_crawling_agent returns the following output:

{
  "url": "https://en.wikipedia.org/wiki/List_of_cocktails",
  "urls_to_follow": [
    "https://en.wikipedia.org/wiki/Category:Cocktails",
    "https://en.wikipedia.org/wiki/List_of_cocktail_families",
    "https://en.wikipedia.org/wiki/List_of_IBA_official_cocktails"
  ],
  "extracted_items": [
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Absinthe"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Beer"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Brandy"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Cachaça"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Gin"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Mezcal"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Rum"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Sake"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Tequila"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Vodka"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Whisky"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Wines"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Flavored liqueurs"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Less common spirits"
      }
    },
    {
      "entity_type": "Potential Cocktail Category",
      "fields": {
        "name": "Other"
      }
    }
  ],
  "error": null
}

Looking at the output above, we can see that our simple crawler had numerous issues. If we visit this page in a browser, we see that there are links to more than a hundred cocktail recipes; however, our crawler only extracted three links. In addition to missing numerous links, if we visit the links that were extracted in a browser, we can see that our OpenAI model hallucinated one of the three urls. There is no page located at https://en.wikipedia.org/wiki/List_of_cocktail_families, and a simple search of our page's source code shows that this link does not exist on the page. Finally, our extracted items are not cocktail recipes, but instead cocktail categories. Though this is not what we wanted, it didn't hallucinate the data, and having a "Potential Cocktail Category" entity type can be ignored when processing the results.

To attempt to get all the links on the page, I will split the crawling into two steps. The first step will extract all the links on the page, and the second will extract the data from each link.

class LinkExtractionOutput(BaseModel):
    url: str
    urls_to_follow: list[str] = []
    error : Optional[str] = None

link_extracting_agent = Agent(
    name="Link Extracting Agent",
    instructions="""
You are a web crawling agent that extracts links to pages containing relevant data based on the job requirements.
You should crawl the provided URL extracting all relevant links as per the job description. You should scrape
the entire page and find all links that may contain relevant data.
""",
    output_type=LinkExtractionOutput,
    tools=[read_url]
)

class ScrapingOutput(BaseModel):
    url: str
    extracted_items: list[ExtractedItem] = []
    error : Optional[str] = None

scraping_agent = Agent(
    name="Web Scraping Agent",
    instructions="""
You are a web scraping agent that extracts relevant data from web pages based on the job requirements.
You should scrape the provided URL extracting all relevant data as per the job description.
""",
    output_type=ScrapingOutput,
    tools=[read_url]
)

With our agents defined, we can now implement our two-step crawler. The logic is more or less the same as our simple crawler, Only the link extraction and data scraping are done in separate LLM calls.

async def run_two_part_crawler(conn, sess, data_to_extract, current_pages, depth, pages, max_depth, max_pages):
    print(f"Running simple crawler at depth {depth} with a max_depth of {max_depth} on pages: {current_pages}")

    pages_visited = 0
    extracted_items = []
    urls_to_follow = []
    for i, current_page in enumerate(current_pages):
        prompt = f"""
The data to extract is: {data_to_extract}
The current page to crawl is: {current_page}
"""
        print(prompt)
        if depth + 1 <= max_depth:
            try:
                output = await link_extracting_agent.run(conn, prompt=prompt, session=sess)

                if output.error is not None and len(output.error) > 0:
                    print(f"Error scraping {current_page} for links: {output.error}")
                    sys.exit(1)

                urls_to_follow.extend(output.urls_to_follow)

            except Exception as e:
                print(f"Error scraping {current_page} for links: {e}")
                sys.exit(1)

        try:
            output = await scraping_agent.run(conn, prompt=prompt, session=sess)

            if output.error is not None and len(output.error) > 0:
                print(f"Error crawling {current_page}: {output.error}")
                sys.exit(1)

            pages_visited += 1
            extracted_items.extend(output.extracted_items)

        except Exception as e:
            print(f"Error crawling {current_page}: {e}")
            sys.exit(1)

        if pages_visited > max_pages:
            break

    if depth + 1 < max_depth and pages_visited < max_pages and len(urls_to_follow) > 0:
        recursive_pages_visited, recursive_extracted_items= await run_two_part_crawler(conn, sess, data_to_extract, urls_to_follow, depth + 1, pages + pages_visited, max_depth, max_pages)
        pages_visited += recursive_pages_visited
        extracted_items.extend(recursive_extracted_items)

    return pages_visited, extracted_items

If we look at the results of the first step of our two part crawler we see the following results:

 {
  "url": "https://en.wikipedia.org/wiki/List_of_cocktails",
  "urls_to_follow": [
    "/wiki/List_of_cocktails_(A%E2%80%93C)",
    "/wiki/List_of_cocktails_(D%E2%80%93F)",
    "/wiki/List_of_cocktails_(G%E2%80%93I)",
    "/wiki/List_of_cocktails_(J%E2%80%93L)",
    "/wiki/List_of_cocktails_(M%E2%80%93O)",
    "/wiki/List_of_cocktails_(P%E2%80%93R)",
    "/wiki/List_of_cocktails_(S)",
    "/wiki/List_of_cocktails_(T%E2%80%93Z)"
  ],
  "error": null
}

We can see that our links this time are completely hallucinated. None of these links exist on the page, nor do they exist anywhere on Wikipedia. This is a major problem with using LLMs for web crawling. At times, LLMs feel like pure magic. They accomplish tasks that seem impossible without any code specific to the task. Other times, they completely fail at simple tasks. I played with a lot of different agent instructions and prompts to try and get the model to extract real links from the page, but I was never able to get it to work reliably.

Converting to Markdown

I had a hunch that the model was struggling to extract links from the raw HTML. HTML's nested structure can cause issues requiring you to keep a lot of data in memory to be able to find where tags start and end correctly. It is also bloated with data that we don't care about for data extraction such as styling tags, scripts, comments, id and class attributes, etc. To test this theory I provided the model with a tool to retrieve the HTML pages as Markdown instead of raw HTML.

md_file_cache = {}
md_file_cache_lock = threading.Lock()

@function_tool
def read_url_as_markdown(url: str, start: int, num_bytes:int = -1, cache:bool = True):
    """
Fetches content from a URL, optionally reading a portion of the content from a specified start position for a given number of bytes.

    Args:
        url (str): The URL to fetch content from. If the content is HTML, it will be converted to markdown.
        start (int): The starting byte position to read from.
        num_bytes (int): The number of bytes to read. If -1, reads the entire content.
        cache (bool): Whether to cache the content fetched from the URL to avoid repeated network requests when streaming portions of the same file through multiple calls. If the value is False, the content will be fetched from the URL even if the URL was fetched and cached previously.

    Returns:
        bytes: The binary content fetched from the URL and converted to markdown
    """
    
    with md_file_cache_lock:
        if cache and url in md_file_cache:
            file_contents = md_file_cache[url]
        else:
            file_contents = remote_html_to_markdown(url)

            if cache:
                md_file_cache[url] = file_contents

    if num_bytes == -1:
        return file_contents[start:]
    else:
        return file_contents[start:start+num_bytes]

Providing the model this tool instead of the read_url tool I started receiving the following error:

Error getting response: Error code: 429 - {
    'error': {
        'message': 'Request too large for gpt-4.1 in organization org-2xNsxYsUqCbvX0vmrOo8QOXx on tokens per min (TPM): Limit 30000, Requested 55620. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.',
        'type': 'tokens', 
        'param': None, 
        'code': 'rate_limit_exceeded'
    }
}. (request_id: req_102a627368fe4cd8969f191234e45020)

Adding some debug output to our previous read_url function and to our new read_url_as_markdown function we can see that we are sending less data to the model when using Markdown (454,036 bytes for the HTML vs. 168,828 bytes for the Markdown). Based on this and the error message, it seems likely that the issue was that the number of output tokens, not the number of input tokens. I tested this theory by simply requesting a maximum of 30 links, and with that change I was able to get real links extracted from the page:

{
  "url": "https://en.wikipedia.org/wiki/List_of_cocktails",
  "urls_to_follow": [
    "https://en.wikipedia.org/wiki/List_of_cocktails_(alphabetical)",
    "https://en.wikipedia.org/wiki/List_of_cocktails_(alphabetical)",
    "https://en.wikipedia.org/wiki/List_of_cocktails_(by_primary_alcohol)",
    "https://en.wikipedia.org/wiki/Aviation_(cocktail)",
    "https://en.wikipedia.org/wiki/Bellini_(cocktail)",
    "https://en.wikipedia.org/wiki/Black_Russian",
    "https://en.wikipedia.org/wiki/Bloody_Mary_(cocktail)",
    "https://en.wikipedia.org/wiki/Blue_Lagoon_(cocktail)",
    "https://en.wikipedia.org/wiki/Boulevardier",
    "https://en.wikipedia.org/wiki/Bramble_(cocktail)",
    "https://en.wikipedia.org/wiki/Bramble_(cocktail)",
    "https://en.wikipedia.org/wiki/Buck's_Fizz",
    "https://en.wikipedia.org/wiki/Caipirinha",
    "https://en.wikipedia.org/wiki/Cape_Codder_(cocktail)",
    "https://en.wikipedia.org/wiki/Champagne_cocktail",
    "https://en.wikipedia.org/wiki/Cosmopolitan_(cocktail)",
    "https://en.wikipedia.org/wiki/Cuba_Libre",
    "https://en.wikipedia.org/wiki/Daiquiri",
    "https://en.wikipedia.org/wiki/Dirty_Martini",
    "https://en.wikipedia.org/wiki/Espresso_martini",
    "https://en.wikipedia.org/wiki/French_75_(cocktail)",
    "https://en.wikipedia.org/wiki/Gibson_(cocktail)",
    "https://en.wikipedia.org/wiki/Gimlet_(cocktail)",
    "https://en.wikipedia.org/wiki/Grasshopper_(cocktail)",
    "https://en.wikipedia.org/wiki/Hanky_Panky_(cocktail)",
    "https://en.wikipedia.org/wiki/Hemingway_Special",
    "https://en.wikipedia.org/wiki/Horse's_Neck",
    "https://en.wikipedia.org/wiki/Irish_Coffee",
    "https://en.wikipedia.org/wiki/Kir_(cocktail)",
    "https://en.wikipedia.org/wiki/Long_Island_iced_tea"
  ],
  "error": null
}

I continued along this path, adjusting the agent instructions to process the Markdown in chunks, but I was never able to extract the links with the same level of quality. I am continuing to experiment with this approach, but I will now try to improve the quality of data extracted using the same Markdown approach.

Improving Data Extraction

Let's start by seeing how well the Markdown approach works for data extraction. I will modify the prompt to test a single page.

From the url "https://en.wikipedia.org/wiki/Negroni" extract details of the cocktail and it's recipe. The max_depth of
crawling is 0 and the max_pages to crawl is 1.

From this prompt we extract the following data:

{
  "url": "https://en.wikipedia.org/wiki/Negroni",
  "extracted_items": [
    {
      "entity_type": "Cocktail",
      "fields": {
        "Name": "Negroni",
        "Type": "Cocktail / Apéritif",
        "Standard Drinkware": "Old fashioned glass (rocks glass)",
        "Standard Garnish": "Orange slice (sometimes orange or lemon peel)",
        "Served": "On the rocks (poured over ice)",
        "Base Spirit": "Gin, sweet red vermouth, Campari",
        "Preparation": "Build in glass over ice, garnish and serve.",
        "Ingredients": [
          "30 mL gin",
          "30 mL sweet red vermouth",
          "30 mL Campari"
        ],
        "IBA Recipe Notes": "Build over ice in an old-fashioned glass and garnish with a slice of orange. Common variations use orange or lemon peel as garnish; can also be stirred and served straight up or on the rocks.",
        "Recipe (Modern Standard)": "Equal parts gin, sweet vermouth, and Campari, built over ice, typically garnished with a slice of orange.",
        "Historical Background": "The Negroni, now a classic cocktail and apéritif, was first documented in Italy in the late 1940s and rose to popularity in the 1950s. The basic recipe of equal parts gin, vermouth rosso, and Campari is first recorded in French cocktail books of the late 1920s. Early recipes sometimes differ from the modern standard, and the drink developed from American- and Italian-style cocktails, such as the Milano-Torino and Americano. The Negroni is famously associated with Count Camillo Negroni, who is said to have asked a bartender in Florence in 1919 to strengthen his Americano by adding gin instead of soda water, with the orange garnish added to signify the difference."
      }
    }
  ],
  "error": null
}

The data looks pretty good, but running it multiple times returns different results. Sometimes it returns separate entities for the cocktail and the recipe, other times it returns just the cocktail with recipe fields. We can update our prompt to include more specific instructions to try and get more consistent results.

From the url "https://en.wikipedia.org/wiki/Negroni" extract details of the cocktail and its recipe. The max_depth of
crawling is 0 and the max_pages to crawl is 1. From every page crawled look for cocktail recipes return cocktail recipes
in the following format:

class Cocktail:
    class Ingredient:
        class Amount:
            quantity: float  # floating point quantity of the ingredient
            unit: str  # Unit of measurement (e.g., ml, oz, tsp)
            
        name: str  # Name of the ingredient
        amount: Amount
    
    entity_type:str = "cocktail"
    description: str  # A text description of the cocktail and its history when available.
    instructions: str  # A text description of how to prepare the cocktail.
    ingredients: List[Ingredient]  # A list of ingredients used in the cocktail.
    garnish: str  # Description of the garnish used for the cocktail.

Now that we've updated our prompt, we need to make sure to include the optional schema field in our prompt extraction output object.

class PromptExtractionResult(BaseModel):
    sources: list[CrawlSourceDetails] = []
    error: str = ""
    summary: str = ""
    schema: Optional[str] = None

And then include it in the call that executes our agent:

async def run_two_part_crawler(conn, sess, data_to_extract, schema, current_pages, depth, pages, max_depth, max_pages):

And then in our prompt:

        prompt = f"""
The data to extract is: {data_to_extract}
The current page to crawl is: {current_page}
"""

        if schema is not None:
            prompt += f"The destination schema is: {schema}\n"

With these changes, we get consistent results that match our desired output format:

{
  "url": "https://en.wikipedia.org/wiki/Negroni",
  "extracted_items": [
    {
      "entity_type": "cocktail",
      "fields": {
        "description": "The Negroni is a popular Italian cocktail, made of one part gin, one part vermouth rosso (red, semi-sweet), and one part Campari, garnished with orange peel. It is considered an apéritif and is known for its bitter and spirit-forward flavor profile. The drink is a descendant of the Americano and is said to have been invented in Florence, Italy, at Caffè Casoni in 1919, when Count Camillo Negroni asked the bartender to strengthen his favorite cocktail, the Americano, by replacing the soda water with gin.",
        "instructions": "Stir the gin, vermouth, and Campari together with ice in a mixing glass. Strain into a rocks glass filled with ice. Garnish with orange peel.",
        "ingredients": [
          {
            "name": "gin",
            "amount": {
              "quantity": 1,
              "unit": "oz"
            }
          },
          {
            "name": "vermouth rosso",
            "amount": {
              "quantity": 1,
              "unit": "oz"
            }
          },
          {
            "name": "Campari",
            "amount": {
              "quantity": 1,
              "unit": "oz"
            }
          }
        ],
        "garnish": "Orange peel"
      }
    }
  ],
  "error": null
}

Conclusion

This approach is promising, but there is still a lot of work to be done. Link extraction in particular is still a challenge. By converting to Markdown, we are getting less hallucinated data, but the model still struggles to extract all the links on a page, even when we iterate over the page in chunks. If you'd like to share your experience with agent development, or are interested in playing with these agents yourself, please reach out to us on Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.