Knowledge Ingestion
Knowledge Ingestion Guide
Section titled “Knowledge Ingestion Guide”This guide covers best practices for ingesting documents into Cortex’s knowledge store, including chunking strategies, metadata usage, and collection organization.
Overview
Section titled “Overview”The knowledge store provides vector-indexed document storage for RAG pipelines. When you ingest a document, Cortex:
- Chunks the document into smaller segments
- Embeds each chunk using the Iris SDK (calls your configured provider API)
- Indexes chunks for vector and full-text search
- Stores metadata for filtering and attribution
Collections
Section titled “Collections”Collections group related documents. Think of them as folders or categories.
Creating Collections
Section titled “Creating Collections”cortex knowledge create-collection \ --name product-docs \ --description "Product documentation and guides"Collection Best Practices
Section titled “Collection Best Practices”- By domain:
engineering-docs,support-articles,legal-contracts - By source:
github-issues,slack-exports,confluence-pages - By version:
v2-docs,v3-docsfor versioned documentation
Ingesting Documents
Section titled “Ingesting Documents”Single Document
Section titled “Single Document”cortex knowledge ingest \ --collection product-docs \ --title "Getting Started Guide" \ --file getting-started.mdDirectory Ingestion
Section titled “Directory Ingestion”cortex knowledge ingest-dir \ --collection product-docs \ --dir ./docs \ --pattern "*.md" \ --recursiveWith Metadata
Section titled “With Metadata”Attach metadata for filtering and attribution:
cortex knowledge ingest \ --collection support-articles \ --title "Password Reset FAQ" \ --file password-reset.md \ --metadata '{"category": "authentication", "author": "support-team", "last_updated": "2024-01-15"}'Chunking Strategies
Section titled “Chunking Strategies”Chunking determines how documents are split for embedding. The right strategy depends on your content type.
Fixed Chunking
Section titled “Fixed Chunking”Splits by token count. Simple but may break mid-sentence.
cortex knowledge ingest \ --collection docs \ --title "API Reference" \ --file api.md \ --chunk-strategy fixed \ --chunk-max-tokens 500 \ --chunk-overlap 50Best for: Structured content, code documentation, tables
Sentence Chunking
Section titled “Sentence Chunking”Splits on sentence boundaries. Preserves semantic units.
cortex knowledge ingest \ --collection docs \ --title "User Guide" \ --file guide.md \ --chunk-strategy sentence \ --chunk-max-tokens 512Best for: Prose, FAQs, narrative documentation
Paragraph Chunking
Section titled “Paragraph Chunking”Splits on paragraph breaks. Preserves topic coherence.
cortex knowledge ingest \ --collection docs \ --title "Blog Post" \ --file post.md \ --chunk-strategy paragraph \ --chunk-max-tokens 1024Best for: Long-form content, articles, reports
Semantic Chunking
Section titled “Semantic Chunking”Uses embeddings to detect topic boundaries. Most intelligent but slowest.
cortex knowledge ingest \ --collection docs \ --title "Research Paper" \ --file paper.md \ --chunk-strategy semantic \ --chunk-max-tokens 1024Best for: Mixed content, research documents, complex technical docs
Chunk Size Guidelines
Section titled “Chunk Size Guidelines”| Content Type | Strategy | Max Tokens | Overlap |
|---|---|---|---|
| API docs | fixed | 400-600 | 50 |
| FAQs | sentence | 300-500 | 30 |
| Tutorials | paragraph | 800-1200 | 100 |
| Research | semantic | 1000-1500 | 150 |
Supported File Types
Section titled “Supported File Types”| Extension | Handling |
|---|---|
.md | Markdown parsed, headings preserved |
.txt | Plain text |
.html | HTML stripped, text extracted |
.pdf | Text extracted (requires pdf support) |
.json | Structured data, keys become metadata |
Metadata Filtering
Section titled “Metadata Filtering”When searching, filter by metadata:
cortex knowledge search "authentication" \ --collection support-articles \ --metadata-filter '{"category": "authentication"}'Common Metadata Fields
Section titled “Common Metadata Fields”| Field | Description |
|---|---|
source | Original document location |
author | Document author |
version | Document version |
category | Topic category |
tags | Array of tags |
created_at | Creation date |
updated_at | Last update date |
Bulk Ingestion Patterns
Section titled “Bulk Ingestion Patterns”Directory with Metadata Extraction
Section titled “Directory with Metadata Extraction”For directories where file paths encode metadata:
# Extract: version=v2, category=authentication
cortex knowledge ingest-dir \ --collection product-docs \ --dir ./docs \ --pattern "**/*.md" \ --recursive \ --extract-path-metadataIncremental Updates
Section titled “Incremental Updates”Re-ingest documents that have changed:
# Only ingest files modified since last synccortex knowledge ingest-dir \ --collection product-docs \ --dir ./docs \ --pattern "*.md" \ --incrementalSearching Ingested Content
Section titled “Searching Ingested Content”Hybrid Search
Section titled “Hybrid Search”Combines vector similarity with keyword matching:
cortex knowledge search "how to configure OAuth" \ --collection product-docs \ --mode hybrid \ --limit 5Vector-Only Search
Section titled “Vector-Only Search”Pure semantic similarity:
cortex knowledge search "authentication setup process" \ --mode vectorFull-Text Search
Section titled “Full-Text Search”Keyword-based with BM25 ranking:
cortex knowledge search "OAuth2 redirect_uri" \ --mode ftsMaintenance
Section titled “Maintenance”View Statistics
Section titled “View Statistics”cortex knowledge stats --collection product-docsOutput:
Collection: product-docsDocuments: 145Chunks: 2,340Total Tokens: 890,000Last Ingested: 2024-01-15T10:30:00ZRe-embed Collection
Section titled “Re-embed Collection”If you change embedding models:
cortex knowledge reembed --collection product-docsDelete Documents
Section titled “Delete Documents”# By document IDcortex knowledge delete --document-id doc_abc123
# By collectioncortex knowledge delete --collection old-docsMCP Tool Usage
Section titled “MCP Tool Usage”Via MCP, use knowledge_ingest:
{ "tool": "knowledge_ingest", "arguments": { "collection": "product-docs", "title": "API Reference", "content": "# Authentication\n\nTo authenticate...", "chunk_strategy": "paragraph", "metadata": { "category": "api", "version": "2.0" } }}And knowledge_search:
{ "tool": "knowledge_search", "arguments": { "query": "how to authenticate", "collection": "product-docs", "mode": "hybrid", "limit": 5 }}Next Steps
Section titled “Next Steps”- Entity Extraction - Auto-extract entities from ingested content
- MCP Tools Reference - Complete MCP tool documentation
- CLI Reference - All CLI commands