DocStrange API + LangChain: Build High-Quality Document RAG in Minutes

Convert PDFs and images into LLM-ready Markdown with the DocStrange API, then index and query with LangChain.
Cleaner input → better embeddings → smarter answers.

Looking for the UI? Try docstrange.nanonets.com.
API endpoint: https://extraction-api.nanonets.com/extract

Step 0 — Install Required Packages

Before you begin the tutorial, install the required Python libraries:

pip install langchain_community
pip install chromadb
pip install langchain_openai
pip install langchain-huggingface langchain-chroma

Run these commands in your terminal or notebook to ensure all dependencies are installed.

Step 1 — Extract Markdown with the DocStrange API

Minimal Python to convert a PDF/image into LLM-ready Markdown:

import requests


    url = "https://extraction-api.nanonets.com/api/v1/extract/sync"
    
    
    files = { "file": ("example-file", open("example-file", "rb")) }
    payload = {
        "output_format": "markdown",
        "file_url": "",
        "file_base64": "",
        "custom_instructions": "",
        "prompt_mode": "append",
        "json_options": "",
        "csv_options": "",
        "include_metadata": ""
    }
    headers = {"Authorization": "Bearer {API_KEY}"}
    
    
    resp = requests.post(url, data=payload, files=files, headers=headers)
    
    
    resp.raise_for_status()
    
    
    markdown = resp.json()["result"]["markdown"]["content"]
    with open("annual_report.md", "w", encoding="utf-8") as f:
       f.write(markdown)
    
    
    print("✅ Extracted and saved: annual_report.md")

Tip: For table-heavy financial PDFs, use output_type="markdown-financial-docs".

Step 2 — Index with LangChain (chromaDB)


from pathlib import Path
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma


# Load the Markdown file produced by DocStrange
text = Path("annual_report.md").read_text(encoding="utf-8")
docs = [Document(page_content=text, metadata={"source": "annual_report.pdf"})]

# Split for embeddings
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Build Chroma vector index (persistent)
persist_dir = "indexes/annual_report_chroma"
collection_name = "annual_report_2023"

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_dir,
    collection_name=collection_name,
)

print("✅ Built Chroma index from extracted Markdown")

Step 3 — Query with Retrieval-Augmented Generation (RAG)


# ---------- RAG query pipeline ----------
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

QUESTION = "Summarize key revenue insights from the 2023 section."

# Reopen the persisted Chroma store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vs = Chroma(
    embedding_function=embeddings,
    persist_directory=persist_dir,
    collection_name=collection_name,
)

retriever = vs.as_retriever(search_kwargs={"k": 5})

prompt = ChatPromptTemplate.from_template("""
You are a financial analyst. Use the provided context to answer the question.

Question: {question}

Context:
{context}

Answer:
""")

llm = ChatOpenAI(model="gpt-4.1", api_key="YOUR_OPENAI_API_KEY")

def format_docs(docs):
    return "\n\n---\n\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(chain.invoke(QUESTION))

Open in Colab

FAQ

What is the DocStrange document extraction API?

DocStrange is an AI-powered document extraction API by Nanonets that converts PDFs and images into structured, LLM-ready Markdown/JSON/CSV/HTML suitable for LangChain and other RAG frameworks.

How does DocStrange improve LangChain RAG quality?

By preserving tables, headings, and layout, the API produces cleaner input for chunking and embeddings, which improves retrieval quality and final LLM answers.

Which output formats are best for LLMs?

markdown is ideal for general RAG. For table-dense documents, use markdown-financial-docs. Use json when you need programmatic post-processing.

Do I need to run anything locally?

No — the API is hosted. Just send your files to the endpoint and receive structured output.