Skip to main content
The document processing quality of Knowledge Bases (KBSphere) is determined by these settings. Properly configuring content extraction engines, text splitting, and embedding models is essential for RAG search accuracy. Configure in Admin > Settings > Documents.
Documents settings tab

Document Processing Profiles

New feature — Bundle document extraction methods and chunking strategies into profiles. Create multiple profiles and select per Knowledge Base (KB).

What is a Profile?

Previously, only one global extraction/chunking setting was possible. With profiles, define different setting bundles per use case and pick the appropriate profile for each KB.
Examples:
 • "Default Extraction" → default engine + fixed-size chunking
 • "High-Precision OCR" → Document Intelligence + table preservation
 • "LLM Vision + Context" → Vision LLM extraction + semantic chunking + Contextual Chunking

Profile Management

In the Document Processing Profiles section at the top of the Documents settings page, create and manage profiles.
1

Create profile

Click + New Profile to open the profile creation modal.
SettingDescriptionDefault
NameProfile identifier name
Content Extraction EngineMethod for extracting text from documentsDefault
PDF Image ExtractionWhether to extract text from PDF imagesOFF
Text SplitterChunking strategy (fixed-size / semantic)Character
Chunk SizeSize of each chunk1000
Chunk OverlapOverlap between chunks100
Advanced SettingsTable preservation, context preservation, per-engine details
2

Set default profile

Click Set as Default in the profile list to sync that profile’s settings as the global default. KBs without a specified profile are processed with this default.
3

Pick profile in KB

In Workspace > Knowledge Base edit screen, pick the document processing profile. If unset, the default profile applies.
The default profile can’t be deleted. Set another profile as default before deleting.

Content Extraction

Pick the engine that extracts text from documents.
EngineStrengthsBest ForAdditional Settings
DefaultBuilt-in text extractorSimple text documentsPDF Extract Images toggle
TikaApache Tika server-basedDiverse file format supportServer URL (default: http://tika:9998)
DoclingAdvanced document processing engineDocuments with complex layoutsServer URL (default: http://docling:5001)
Document IntelligenceAzure AI Document IntelligenceAzure environments, accurate OCREndpoint + API Key
Document AIGoogle Cloud Document AITables, forms, multi-column layoutsProject ID + Processor ID + Location
Mistral OCRMistral-based OCRImage-based documentsMistral API Key
LLM VisionVision LLM-based extractionComplex layouts, text in charts/imagesVision model ID + prompt (optional)
The default engine is sufficient for most text-based documents (PDF, DOCX, TXT). For mostly scanned documents or image-heavy PDFs, Document Intelligence or Mistral OCR is recommended.

LLM Vision Extraction

New feature — A method that sends page images directly to a Vision-capable LLM (GPT-4o, Claude, etc.) for extraction as markdown text.
Compared to traditional OCR engines, this provides higher accuracy for complex layouts, text within images, and chart descriptions. Behavior:
  1. Convert PDF to per-page images (300DPI-class)
  2. Send each page image to Vision LLM in parallel
  3. LLM returns markdown preserving table/list/title structure
  4. Auto-correct broken sentences at adjacent page boundaries
SettingDescription
Vision ModelPick a Vision-supporting model registered in the system (e.g., gpt-4o)
Extraction PromptCustom prompt (default prompt used if empty)
LLM Vision incurs page count × LLM calls. A 10-page PDF → ~19 calls (10 extractions + 9 boundary corrections). Consider processing time and cost for large documents.
LLM Vision requires the PyMuPDF package. Without it, document uploads using this engine will fail. Verify PyMuPDF installation in the deployment environment.

Text Splitting

Configure how extracted text is split into search-friendly chunks.
SettingDefaultDescription
Text SplitterCharacterCharacter: split by char count. Token: split by Tiktoken tokens. Semantic: meaning-based split
Chunk Size1000Chunk size (chars or tokens). Reference value in Semantic mode
Chunk Overlap100Overlap between chunks (preserves context flow)
SizeProConRecommended
Small (≤ 500)Precise searchContext may be cutFAQ, short paragraphs
Medium (1000)Balanced performanceMost cases (default)
Large (≥ 2000)Wider context preservedLower search precisionLong narrative documents
When sentences are cut at chunk boundaries, related content may be missed during search. Overlap shares some text between adjacent chunks to preserve context. The default 100 is appropriate in most cases.
Enabling Bypass Embedding and Retrieval skips text splitting and embedding, injecting the entire document directly into the LLM context. Use only for small documents — large documents may exceed token limits.

Semantic Chunking

New feature — A method that splits chunks based on inter-sentence meaning similarity instead of fixed size.
When Text Splitter is set to Semantic, sentences are converted to embedding vectors, and chunk boundaries are auto-determined where similarity between adjacent sentences drops sharply.
Fixed Size (Character/Token)Semantic
Split criterionCharacter / token countMeaning similarity change
ProFast and predictableSplits at topic boundaries, improves search accuracy
ConMay cut mid-sentence/paragraphRequires embedding engine, longer processing
RecommendedGeneral documentsLong documents with frequent topic shifts
Semantic chunking requires an embedding engine to be configured. Without one, picking Semantic causes errors.

Table Preservation

New feature — Preserve tables (HTML/markdown format) in documents intact instead of splitting.
Standard chunking can split tables across multiple chunks, losing row/column information. With Table Preservation enabled:
  1. Auto-detect tables in the document (HTML <table>, markdown |...|)
  2. Standard chunking on text portions only
  3. Tables are attached intact to the closest text chunk
  4. Tables exceeding chunk size are split row-by-row while preserving the header
Controlled by the Preserve Tables option in profile advanced settings (default: enabled).
Especially effective for table-heavy financial reports and technical specifications. Search accuracy for specific cell values within tables is greatly improved.

Contextual Chunking

New feature — Each chunk has a context summary of the entire document generated by the LLM and prepended. Implements Anthropic’s Contextual Retrieval technique.
In later chunks of long documents, earlier context is lost, causing search matching failures. With contextual preservation enabled:
  1. After chunking completes, call LLM for each chunk
  2. Generate a summary of “where this chunk is in the entire document and its context”
  3. Prepend the summary to the chunk before vectorization
[Before contextual preservation]
  Chunk: "In conclusion, the initial hypothesis was confirmed."
  → Can't tell what "A hypothesis" refers to → search fails

[After contextual preservation]
  Chunk: "This chunk is the conclusion of the A Hypothesis Verification Report,
  containing the final judgment integrating 50 experiment results.

  In conclusion, the initial hypothesis was confirmed."
  → Includes context → search succeeds
SettingDescription
Contextual ChunkingActivate/deactivate toggle
ModelLLM model used for context summarization
LLM calls happen per chunk count. A document with 100 chunks → 100 LLM calls. Bulk uploading large documents can greatly increase API costs. Use selectively for small or important documents.

Embedding Settings

Configure the embedding engine and model that converts documents to vectors.
SettingDefaultDescription
Embedding EngineSentenceTransformersService for vector conversion
Embedding Modelall-MiniLM-L6-v2Model name to use
Batch Size32Documents processed per batch (only shown for Ollama, OpenAI, Azure)
Embedding Dimension0 (auto)Vector dimension count. 0 uses model default
Changing the embedding model makes existing document vectors incompatible. After changing the model, all Knowledge Bases must be reindexed.
Supported Engines:
Open-source embedding engine running locally.
SettingDescription
ModelHuggingFace model name (e.g., sentence-transformers/all-MiniLM-L6-v2)

File Upload Limits

SettingDefaultDescription
Max file sizeUnlimitedSingle file upload max size (in MB)
Max filesUnlimitedFiles usable simultaneously per chat
Allowed file extensionsAll allowedComma-separated (e.g., pdf, docx, txt). Empty = all extensions allowed
PDF conversion extensionsDisabledExtensions to convert to PDF via LibreOffice (e.g., pptx, docx, xlsx). LibreOffice installation required

Question Generation

A feature that pre-generates “questions a user might ask” for each chunk via LLM to improve search accuracy.
SettingDefaultDescription
EnabledOFFToggle on
ModelLLM for question generation
Max questions per chunk10Range 1–20
Question vector weight0.50.0 (content only) ~ 1.0 (questions only). 0.5 = equal weight
Enabling adds LLM calls during document processing. Increases processing time and cost — use only when search accuracy is insufficient.

Cloud Storage Integration

Import documents into Knowledge Bases from external cloud storage. This setting only provides toggles — actual auth info must be set via environment variables.
StorageToggleRequired Environment Variables
Google DriveON/OFFGOOGLE_DRIVE_CLIENT_ID, GOOGLE_DRIVE_API_KEY
OneDriveON/OFFONEDRIVE_CLIENT_ID
SharePointON/OFFONEDRIVE_CLIENT_ID_BUSINESS, SHAREPOINT_TENANT_ID, SHAREPOINT_SITE_URL
Environment variables must be set to enable toggles. Enabling a toggle shows the cloud source in the Knowledge Base’s Add Content menu.

Reindexing and Reset

The actions below are irreversible. Always confirm before executing.
ActionDescriptionImpact Scope
Reindex Knowledge BasesRebuild vector indexes for all KBsAll Knowledge Bases (takes time)
Reset Vector StorageDelete all data in the Vector DBDeletes all vectors → reindexing required
Reset Upload DirectoryDelete all files uploaded to the serverDeletes original files

When Reindexing is Needed

  • When you’ve changed the embedding model
  • When you’ve changed the search engine (Vector DB)
  • When you’ve changed chunk size/overlap
  • When document processing has issues

Knowledge Base

Create Knowledge Bases and manage documents — pick profile per KB

Search Engine

Vector DB and search parameter settings

Dynamic Filters

Knowledge Base metadata filter settings