langchain-crawlbase

LangChain primitives backed by the Crawlbase Crawling API.

Three drop-in classes:

CrawlbaseLoader — a BaseLoader that fetches a list of URLs and returns clean Markdown Documents ready for chunking.
CrawlbaseTool — a BaseTool that lets an LLM agent fetch a live web page mid-conversation.
CrawlbaseRetriever — a BaseRetriever that fetches a fixed set of seed URLs and filters by query.

All three return GitHub-flavored Markdown via Crawlbase's format=md parameter, so you skip the HTML-stripping step entirely.

Installation

pip install langchain-crawlbase

Setup

Get a token from your Crawlbase dashboard. Use your normal token for static pages, or your JavaScript token for SPA / JS-rendered pages — Crawlbase routes the request automatically based on which token you send.

export CRAWLBASE_TOKEN=your_token

Usage

Document loader

import os
from langchain_crawlbase import CrawlbaseLoader

loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://en.wikipedia.org/wiki/Large_language_model",
        "https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
    ],
)
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata)  # {'source': '...', 'pc_status': 200, ...}

For SPA pages, just use your JavaScript token instead — same interface:

loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_JS_TOKEN"],
    urls=["https://some-spa-site.com/page"],
)

Agent tool

import os
from langchain_crawlbase import CrawlbaseTool

tool = CrawlbaseTool(token=os.environ["CRAWLBASE_TOKEN"])

# Use directly:
markdown = tool.invoke({"url": "https://example.com"})

# Or bind to an LLM:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
agent_llm = llm.bind_tools([tool])

Retriever

import os
from langchain_crawlbase import CrawlbaseRetriever

retriever = CrawlbaseRetriever(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://crawlbase.com/docs/crawling-api",
        "https://crawlbase.com/docs/crawling-api#parameters",
    ],
)
docs = retriever.invoke("how do I render JavaScript pages")

v0.1 uses simple substring matching. For semantic retrieval, pair CrawlbaseLoader with a vector store of your choice.

Extra Crawlbase parameters

Pass any Crawlbase API parameter via extra_params:

loader = CrawlbaseLoader(
    token=token,
    urls=["https://example.com"],
    extra_params={"country": "US", "device": "mobile"},
)

Development

pip install -e ".[dev]"
pytest tests/unit
ruff check .

Integration tests are gated on CRAWLBASE_TOKEN:

CRAWLBASE_TOKEN=xxx pytest tests/integration

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src/langchain_crawlbase		src/langchain_crawlbase
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langchain-crawlbase

Installation

Setup

Usage

Document loader

Agent tool

Retriever

Extra Crawlbase parameters

Development

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

langchain-crawlbase

Installation

Setup

Usage

Document loader

Agent tool

Retriever

Extra Crawlbase parameters

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages