Skip to content

crawlbase/langchain-crawlbase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

langchain-crawlbase

LangChain primitives backed by the Crawlbase Crawling API.

Three drop-in classes:

  • CrawlbaseLoader — a BaseLoader that fetches a list of URLs and returns clean Markdown Documents ready for chunking.
  • CrawlbaseTool — a BaseTool that lets an LLM agent fetch a live web page mid-conversation.
  • CrawlbaseRetriever — a BaseRetriever that fetches a fixed set of seed URLs and filters by query.

All three return GitHub-flavored Markdown via Crawlbase's format=md parameter, so you skip the HTML-stripping step entirely.

Installation

pip install langchain-crawlbase

Setup

Get a token from your Crawlbase dashboard. Use your normal token for static pages, or your JavaScript token for SPA / JS-rendered pages — Crawlbase routes the request automatically based on which token you send.

export CRAWLBASE_TOKEN=your_token

Usage

Document loader

import os
from langchain_crawlbase import CrawlbaseLoader

loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://en.wikipedia.org/wiki/Large_language_model",
        "https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
    ],
)
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata)  # {'source': '...', 'pc_status': 200, ...}

For SPA pages, just use your JavaScript token instead — same interface:

loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_JS_TOKEN"],
    urls=["https://some-spa-site.com/page"],
)

Agent tool

import os
from langchain_crawlbase import CrawlbaseTool

tool = CrawlbaseTool(token=os.environ["CRAWLBASE_TOKEN"])

# Use directly:
markdown = tool.invoke({"url": "https://example.com"})

# Or bind to an LLM:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
agent_llm = llm.bind_tools([tool])

Retriever

import os
from langchain_crawlbase import CrawlbaseRetriever

retriever = CrawlbaseRetriever(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://crawlbase.com/docs/crawling-api",
        "https://crawlbase.com/docs/crawling-api#parameters",
    ],
)
docs = retriever.invoke("how do I render JavaScript pages")

v0.1 uses simple substring matching. For semantic retrieval, pair CrawlbaseLoader with a vector store of your choice.

Extra Crawlbase parameters

Pass any Crawlbase API parameter via extra_params:

loader = CrawlbaseLoader(
    token=token,
    urls=["https://example.com"],
    extra_params={"country": "US", "device": "mobile"},
)

Development

pip install -e ".[dev]"
pytest tests/unit
ruff check .

Integration tests are gated on CRAWLBASE_TOKEN:

CRAWLBASE_TOKEN=xxx pytest tests/integration

License

MIT — © Crawlbase Team. Contact: support@crawlbase.com

About

LangChain document loader, tool, and retriever backed by the Crawlbase Crawling API.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages