Knowledge Ingestion

Knowledge Ingestion Guide

This guide covers best practices for ingesting documents into Cortex’s knowledge store, including chunking strategies, metadata usage, and collection organization.

Overview

The knowledge store provides vector-indexed document storage for RAG pipelines. When you ingest a document, Cortex:

Chunks the document into smaller segments
Embeds each chunk using the Iris SDK (calls your configured provider API)
Indexes chunks for vector and full-text search
Stores metadata for filtering and attribution

Collections

Collections group related documents. Think of them as folders or categories.

Creating Collections

cortex knowledge create-collection \
  --name product-docs \
  --description "Product documentation and guides"

Collection Best Practices

By domain: engineering-docs, support-articles, legal-contracts
By source: github-issues, slack-exports, confluence-pages
By version: v2-docs, v3-docs for versioned documentation

Ingesting Documents

Single Document

cortex knowledge ingest \
  --collection product-docs \
  --title "Getting Started Guide" \
  --file getting-started.md

Directory Ingestion

cortex knowledge ingest-dir \
  --collection product-docs \
  --dir ./docs \
  --pattern "*.md" \
  --recursive

With Metadata

Attach metadata for filtering and attribution:

cortex knowledge ingest \
  --collection support-articles \
  --title "Password Reset FAQ" \
  --file password-reset.md \
  --metadata '{"category": "authentication", "author": "support-team", "last_updated": "2024-01-15"}'

Chunking Strategies

Chunking determines how documents are split for embedding. The right strategy depends on your content type.

Fixed Chunking

Splits by token count. Simple but may break mid-sentence.

cortex knowledge ingest \
  --collection docs \
  --title "API Reference" \
  --file api.md \
  --chunk-strategy fixed \
  --chunk-max-tokens 500 \
  --chunk-overlap 50

Best for: Structured content, code documentation, tables

Sentence Chunking

Splits on sentence boundaries. Preserves semantic units.

cortex knowledge ingest \
  --collection docs \
  --title "User Guide" \
  --file guide.md \
  --chunk-strategy sentence \
  --chunk-max-tokens 512

Best for: Prose, FAQs, narrative documentation

Paragraph Chunking

Splits on paragraph breaks. Preserves topic coherence.

cortex knowledge ingest \
  --collection docs \
  --title "Blog Post" \
  --file post.md \
  --chunk-strategy paragraph \
  --chunk-max-tokens 1024

Best for: Long-form content, articles, reports

Semantic Chunking

Uses embeddings to detect topic boundaries. Most intelligent but slowest.

cortex knowledge ingest \
  --collection docs \
  --title "Research Paper" \
  --file paper.md \
  --chunk-strategy semantic \
  --chunk-max-tokens 1024

Best for: Mixed content, research documents, complex technical docs

Chunk Size Guidelines

Content Type	Strategy	Max Tokens	Overlap
API docs	`fixed`	400-600	50
FAQs	`sentence`	300-500	30
Tutorials	`paragraph`	800-1200	100
Research	`semantic`	1000-1500	150

Supported File Types

Extension	Handling
`.md`	Markdown parsed, headings preserved
`.txt`	Plain text
`.html`	HTML stripped, text extracted
`.pdf`	Text extracted (requires pdf support)
`.json`	Structured data, keys become metadata

Metadata Filtering

When searching, filter by metadata:

cortex knowledge search "authentication" \
  --collection support-articles \
  --metadata-filter '{"category": "authentication"}'

Common Metadata Fields

Field	Description
`source`	Original document location
`author`	Document author
`version`	Document version
`category`	Topic category
`tags`	Array of tags
`created_at`	Creation date
`updated_at`	Last update date

Bulk Ingestion Patterns

Directory with Metadata Extraction

For directories where file paths encode metadata:

# Extract: version=v2, category=authentication

cortex knowledge ingest-dir \
  --collection product-docs \
  --dir ./docs \
  --pattern "**/*.md" \
  --recursive \
  --extract-path-metadata

Incremental Updates

Re-ingest documents that have changed:

# Only ingest files modified since last sync
cortex knowledge ingest-dir \
  --collection product-docs \
  --dir ./docs \
  --pattern "*.md" \
  --incremental

Searching Ingested Content

Hybrid Search

Combines vector similarity with keyword matching:

cortex knowledge search "how to configure OAuth" \
  --collection product-docs \
  --mode hybrid \
  --limit 5

Vector-Only Search

Pure semantic similarity:

cortex knowledge search "authentication setup process" \
  --mode vector

Full-Text Search

Keyword-based with BM25 ranking:

cortex knowledge search "OAuth2 redirect_uri" \
  --mode fts

Maintenance

View Statistics

cortex knowledge stats --collection product-docs

Output:

Collection: product-docs
Documents: 145
Chunks: 2,340
Total Tokens: 890,000
Last Ingested: 2024-01-15T10:30:00Z

Re-embed Collection

If you change embedding models:

cortex knowledge reembed --collection product-docs

Delete Documents

# By document ID
cortex knowledge delete --document-id doc_abc123

# By collection
cortex knowledge delete --collection old-docs

MCP Tool Usage

Via MCP, use knowledge_ingest:

{
  "tool": "knowledge_ingest",
  "arguments": {
    "collection": "product-docs",
    "title": "API Reference",
    "content": "# Authentication\n\nTo authenticate...",
    "chunk_strategy": "paragraph",
    "metadata": {
      "category": "api",
      "version": "2.0"
    }
  }
}

And knowledge_search:

{
  "tool": "knowledge_search",
  "arguments": {
    "query": "how to authenticate",
    "collection": "product-docs",
    "mode": "hybrid",
    "limit": 5
  }
}

Next Steps

Entity Extraction - Auto-extract entities from ingested content
MCP Tools Reference - Complete MCP tool documentation
CLI Reference - All CLI commands