
Document Processing Profiles
New feature — Bundle document extraction methods and chunking strategies into profiles. Create multiple profiles and select per Knowledge Base (KB).
What is a Profile?
Previously, only one global extraction/chunking setting was possible. With profiles, define different setting bundles per use case and pick the appropriate profile for each KB.Profile Management
In the Document Processing Profiles section at the top of the Documents settings page, create and manage profiles.Create profile
Click + New Profile to open the profile creation modal.
| Setting | Description | Default |
|---|---|---|
| Name | Profile identifier name | — |
| Content Extraction Engine | Method for extracting text from documents | Default |
| PDF Image Extraction | Whether to extract text from PDF images | OFF |
| Text Splitter | Chunking strategy (fixed-size / semantic) | Character |
| Chunk Size | Size of each chunk | 1000 |
| Chunk Overlap | Overlap between chunks | 100 |
| Advanced Settings | Table preservation, context preservation, per-engine details | — |
Set default profile
Click Set as Default in the profile list to sync that profile’s settings as the global default. KBs without a specified profile are processed with this default.
Content Extraction
Pick the engine that extracts text from documents.| Engine | Strengths | Best For | Additional Settings |
|---|---|---|---|
| Default | Built-in text extractor | Simple text documents | PDF Extract Images toggle |
| Tika | Apache Tika server-based | Diverse file format support | Server URL (default: http://tika:9998) |
| Docling | Advanced document processing engine | Documents with complex layouts | Server URL (default: http://docling:5001) |
| Document Intelligence | Azure AI Document Intelligence | Azure environments, accurate OCR | Endpoint + API Key |
| Document AI | Google Cloud Document AI | Tables, forms, multi-column layouts | Project ID + Processor ID + Location |
| Mistral OCR | Mistral-based OCR | Image-based documents | Mistral API Key |
| LLM Vision | Vision LLM-based extraction | Complex layouts, text in charts/images | Vision model ID + prompt (optional) |
LLM Vision Extraction
New feature — A method that sends page images directly to a Vision-capable LLM (GPT-4o, Claude, etc.) for extraction as markdown text.
- Convert PDF to per-page images (300DPI-class)
- Send each page image to Vision LLM in parallel
- LLM returns markdown preserving table/list/title structure
- Auto-correct broken sentences at adjacent page boundaries
| Setting | Description |
|---|---|
| Vision Model | Pick a Vision-supporting model registered in the system (e.g., gpt-4o) |
| Extraction Prompt | Custom prompt (default prompt used if empty) |
LLM Vision requires the PyMuPDF package. Without it, document uploads using this engine will fail. Verify PyMuPDF installation in the deployment environment.
Text Splitting
Configure how extracted text is split into search-friendly chunks.| Setting | Default | Description |
|---|---|---|
| Text Splitter | Character | Character: split by char count. Token: split by Tiktoken tokens. Semantic: meaning-based split |
| Chunk Size | 1000 | Chunk size (chars or tokens). Reference value in Semantic mode |
| Chunk Overlap | 100 | Overlap between chunks (preserves context flow) |
How should I pick the Chunk Size?
How should I pick the Chunk Size?
| Size | Pro | Con | Recommended |
|---|---|---|---|
| Small (≤ 500) | Precise search | Context may be cut | FAQ, short paragraphs |
| Medium (1000) | Balanced performance | — | Most cases (default) |
| Large (≥ 2000) | Wider context preserved | Lower search precision | Long narrative documents |
Why is Chunk Overlap needed?
Why is Chunk Overlap needed?
When sentences are cut at chunk boundaries, related content may be missed during search. Overlap shares some text between adjacent chunks to preserve context. The default 100 is appropriate in most cases.
Enabling Bypass Embedding and Retrieval skips text splitting and embedding, injecting the entire document directly into the LLM context. Use only for small documents — large documents may exceed token limits.
Semantic Chunking
New feature — A method that splits chunks based on inter-sentence meaning similarity instead of fixed size.
| Fixed Size (Character/Token) | Semantic | |
|---|---|---|
| Split criterion | Character / token count | Meaning similarity change |
| Pro | Fast and predictable | Splits at topic boundaries, improves search accuracy |
| Con | May cut mid-sentence/paragraph | Requires embedding engine, longer processing |
| Recommended | General documents | Long documents with frequent topic shifts |
Table Preservation
New feature — Preserve tables (HTML/markdown format) in documents intact instead of splitting.
- Auto-detect tables in the document (HTML
<table>, markdown|...|) - Standard chunking on text portions only
- Tables are attached intact to the closest text chunk
- Tables exceeding chunk size are split row-by-row while preserving the header
Contextual Chunking
New feature — Each chunk has a context summary of the entire document generated by the LLM and prepended. Implements Anthropic’s Contextual Retrieval technique.
- After chunking completes, call LLM for each chunk
- Generate a summary of “where this chunk is in the entire document and its context”
- Prepend the summary to the chunk before vectorization
| Setting | Description |
|---|---|
| Contextual Chunking | Activate/deactivate toggle |
| Model | LLM model used for context summarization |
Embedding Settings
Configure the embedding engine and model that converts documents to vectors.| Setting | Default | Description |
|---|---|---|
| Embedding Engine | SentenceTransformers | Service for vector conversion |
| Embedding Model | all-MiniLM-L6-v2 | Model name to use |
| Batch Size | 32 | Documents processed per batch (only shown for Ollama, OpenAI, Azure) |
| Embedding Dimension | 0 (auto) | Vector dimension count. 0 uses model default |
- SentenceTransformers
- OpenAI
- Azure OpenAI
- Ollama
- Vertex AI
- Gemini
Open-source embedding engine running locally.
| Setting | Description |
|---|---|
| Model | HuggingFace model name (e.g., sentence-transformers/all-MiniLM-L6-v2) |
File Upload Limits
| Setting | Default | Description |
|---|---|---|
| Max file size | Unlimited | Single file upload max size (in MB) |
| Max files | Unlimited | Files usable simultaneously per chat |
| Allowed file extensions | All allowed | Comma-separated (e.g., pdf, docx, txt). Empty = all extensions allowed |
| PDF conversion extensions | Disabled | Extensions to convert to PDF via LibreOffice (e.g., pptx, docx, xlsx). LibreOffice installation required |
Question Generation
A feature that pre-generates “questions a user might ask” for each chunk via LLM to improve search accuracy.| Setting | Default | Description |
|---|---|---|
| Enabled | OFF | Toggle on |
| Model | — | LLM for question generation |
| Max questions per chunk | 10 | Range 1–20 |
| Question vector weight | 0.5 | 0.0 (content only) ~ 1.0 (questions only). 0.5 = equal weight |
Cloud Storage Integration
Import documents into Knowledge Bases from external cloud storage. This setting only provides toggles — actual auth info must be set via environment variables.| Storage | Toggle | Required Environment Variables |
|---|---|---|
| Google Drive | ON/OFF | GOOGLE_DRIVE_CLIENT_ID, GOOGLE_DRIVE_API_KEY |
| OneDrive | ON/OFF | ONEDRIVE_CLIENT_ID |
| SharePoint | ON/OFF | ONEDRIVE_CLIENT_ID_BUSINESS, SHAREPOINT_TENANT_ID, SHAREPOINT_SITE_URL |
Environment variables must be set to enable toggles. Enabling a toggle shows the cloud source in the Knowledge Base’s Add Content menu.
Reindexing and Reset
The actions below are irreversible. Always confirm before executing.
| Action | Description | Impact Scope |
|---|---|---|
| Reindex Knowledge Bases | Rebuild vector indexes for all KBs | All Knowledge Bases (takes time) |
| Reset Vector Storage | Delete all data in the Vector DB | Deletes all vectors → reindexing required |
| Reset Upload Directory | Delete all files uploaded to the server | Deletes original files |
When Reindexing is Needed
- When you’ve changed the embedding model
- When you’ve changed the search engine (Vector DB)
- When you’ve changed chunk size/overlap
- When document processing has issues
Related Pages
Knowledge Base
Create Knowledge Bases and manage documents — pick profile per KB
Search Engine
Vector DB and search parameter settings
Dynamic Filters
Knowledge Base metadata filter settings
