Skip to content

Knowledge Ingestion

This guide covers best practices for ingesting documents into Cortex’s knowledge store, including chunking strategies, metadata usage, and collection organization.

The knowledge store provides vector-indexed document storage for RAG pipelines. When you ingest a document, Cortex:

  1. Chunks the document into smaller segments
  2. Embeds each chunk using the Iris SDK (calls your configured provider API)
  3. Indexes chunks for vector and full-text search
  4. Stores metadata for filtering and attribution

Collections group related documents. Think of them as folders or categories.

Terminal window
cortex knowledge create-collection \
--name product-docs \
--description "Product documentation and guides"
  • By domain: engineering-docs, support-articles, legal-contracts
  • By source: github-issues, slack-exports, confluence-pages
  • By version: v2-docs, v3-docs for versioned documentation
Terminal window
cortex knowledge ingest \
--collection product-docs \
--title "Getting Started Guide" \
--file getting-started.md
Terminal window
cortex knowledge ingest-dir \
--collection product-docs \
--dir ./docs \
--pattern "*.md" \
--recursive

Attach metadata for filtering and attribution:

Terminal window
cortex knowledge ingest \
--collection support-articles \
--title "Password Reset FAQ" \
--file password-reset.md \
--metadata '{"category": "authentication", "author": "support-team", "last_updated": "2024-01-15"}'

Chunking determines how documents are split for embedding. The right strategy depends on your content type.

Splits by token count. Simple but may break mid-sentence.

Terminal window
cortex knowledge ingest \
--collection docs \
--title "API Reference" \
--file api.md \
--chunk-strategy fixed \
--chunk-max-tokens 500 \
--chunk-overlap 50

Best for: Structured content, code documentation, tables

Splits on sentence boundaries. Preserves semantic units.

Terminal window
cortex knowledge ingest \
--collection docs \
--title "User Guide" \
--file guide.md \
--chunk-strategy sentence \
--chunk-max-tokens 512

Best for: Prose, FAQs, narrative documentation

Splits on paragraph breaks. Preserves topic coherence.

Terminal window
cortex knowledge ingest \
--collection docs \
--title "Blog Post" \
--file post.md \
--chunk-strategy paragraph \
--chunk-max-tokens 1024

Best for: Long-form content, articles, reports

Uses embeddings to detect topic boundaries. Most intelligent but slowest.

Terminal window
cortex knowledge ingest \
--collection docs \
--title "Research Paper" \
--file paper.md \
--chunk-strategy semantic \
--chunk-max-tokens 1024

Best for: Mixed content, research documents, complex technical docs

Content TypeStrategyMax TokensOverlap
API docsfixed400-60050
FAQssentence300-50030
Tutorialsparagraph800-1200100
Researchsemantic1000-1500150
ExtensionHandling
.mdMarkdown parsed, headings preserved
.txtPlain text
.htmlHTML stripped, text extracted
.pdfText extracted (requires pdf support)
.jsonStructured data, keys become metadata

When searching, filter by metadata:

Terminal window
cortex knowledge search "authentication" \
--collection support-articles \
--metadata-filter '{"category": "authentication"}'
FieldDescription
sourceOriginal document location
authorDocument author
versionDocument version
categoryTopic category
tagsArray of tags
created_atCreation date
updated_atLast update date

For directories where file paths encode metadata:

docs/v2/authentication/oauth.md
# Extract: version=v2, category=authentication
cortex knowledge ingest-dir \
--collection product-docs \
--dir ./docs \
--pattern "**/*.md" \
--recursive \
--extract-path-metadata

Re-ingest documents that have changed:

Terminal window
# Only ingest files modified since last sync
cortex knowledge ingest-dir \
--collection product-docs \
--dir ./docs \
--pattern "*.md" \
--incremental

Combines vector similarity with keyword matching:

Terminal window
cortex knowledge search "how to configure OAuth" \
--collection product-docs \
--mode hybrid \
--limit 5

Pure semantic similarity:

Terminal window
cortex knowledge search "authentication setup process" \
--mode vector

Keyword-based with BM25 ranking:

Terminal window
cortex knowledge search "OAuth2 redirect_uri" \
--mode fts
Terminal window
cortex knowledge stats --collection product-docs

Output:

Collection: product-docs
Documents: 145
Chunks: 2,340
Total Tokens: 890,000
Last Ingested: 2024-01-15T10:30:00Z

If you change embedding models:

Terminal window
cortex knowledge reembed --collection product-docs
Terminal window
# By document ID
cortex knowledge delete --document-id doc_abc123
# By collection
cortex knowledge delete --collection old-docs

Via MCP, use knowledge_ingest:

{
"tool": "knowledge_ingest",
"arguments": {
"collection": "product-docs",
"title": "API Reference",
"content": "# Authentication\n\nTo authenticate...",
"chunk_strategy": "paragraph",
"metadata": {
"category": "api",
"version": "2.0"
}
}
}

And knowledge_search:

{
"tool": "knowledge_search",
"arguments": {
"query": "how to authenticate",
"collection": "product-docs",
"mode": "hybrid",
"limit": 5
}
}