Skip to content

Knowledge Base Service

The KnowledgeBaseService provides comprehensive capabilities to interact with the knowledge base, including file upload/download, content search, and metadata filtering. A Content represents a file of any type stored in the knowledge base.

Initialization:

#initialize_kb_service_standalone
kb_service = KnowledgeBaseService.from_settings()

Core Capabilities: - Upload & Download: Store and retrieve files securely - Search: Find content using semantic (vector), keyword, or hybrid search - Metadata Filtering: Use smart rules to narrow search results - Chat Integration: Attach files to chat messages for user access

Content Upload

For security, prefer uploading from memory to avoid disk-based information leakage. This method takes raw bytes and uploads them directly to the knowledge base without creating intermediate files on disk.

1
2
3
4
5
6
7
8
content_bytes = b"Your file content here"
content = kb_service.upload_content_from_bytes(
    content=content_bytes,
    content_name="document.txt",
    mime_type="text/plain",
    scope_id=scope_id,
    metadata={"category": "documentation", "version": "1.0"}
)

Upload from File

When you must upload from disk (e.g., when working with large files or when the content is already saved locally): - skip_ingestion: Controls whether the content should be processed for semantic search. Set to True to make the content searchable via vector/keyword search, or False if you only need to store the file without indexing it.

Use cases: - Files generated by external libraries that write to disk - Batch uploads of existing files

#kb_service_upload_from_file
1
2
3
4
5
6
7
8
9
# Configure ingestion settings
content = kb_service.upload_content(
    path_to_content=str(file_path),
    content_name=Path(file_path).name,
    mime_type="text/plain",
    scope_id=scope_id,
    skip_ingestion=False,  # Process the content for search
    metadata={"department": "legal", "classification": "confidential"}
)

Make Uploaded Document Available to User

When you generate or process a file that should be shown to the user in the chat interface, you need to: 1. Upload the content to the knowledge base 2. Create a ContentReference linking to the uploaded content 3. Attach the reference to an assistant message

This makes the file appear as a downloadable attachment in the chat.

uploaded_content = kb_service.upload_content(
        path_to_content=str(output_filepath),
        content_name=output_filepath.name,
        mime_type=str(mimetypes.guess_type(output_filepath)[0]),
        chat_id=payload.chat_id,
        skip_ingestion=skip_ingestion,  # Usually True for generated files
    )

reference = ContentReference(
    id=content.id,
    sequence_number=1,
    message_id=message_id,
    name=filename,
    source=payload.name,
    source_id=chat_id,
    url=f"unique://content/{uploaded_content.id}",  # Special URL format for content
)

self.chat_service.modify_assistant_message(
                content="Please find the translated document below in the references.",
                references=[reference],
                set_completed_at=True,
            )

Common use cases: - Returning generated reports, summaries, or translations - Providing processed/converted files (e.g., PDF to Word) - Making analysis results available for download

Full Examples Download (Click to expand)
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
content_bytes = b"Your file content here"
content = kb_service.upload_content_from_bytes(
    content=content_bytes,
    content_name="document.txt",
    mime_type="text/plain",
    scope_id=scope_id,
    metadata={"category": "documentation", "version": "1.0"},
)
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
file_path = Path(__file__).parent / "test.txt"
# Configure ingestion settings
content = kb_service.upload_content(
    path_to_content=str(file_path),
    content_name=Path(file_path).name,
    mime_type="text/plain",
    scope_id=scope_id,
    skip_ingestion=False,  # Process the content for search
    metadata={"department": "legal", "classification": "confidential"},
)

Content Download

Prefer downloading to memory for security - this approach avoids leaving sensitive data on disk and is suitable for most use cases where you can process the content directly in memory.

How it works: 1. download_content_to_bytes() retrieves the file content as raw bytes 2. Use io.BytesIO() to create a file-like object in memory that many libraries can read from 3. Process the content directly without touching the filesystem

Common use cases: - Reading text files - Processing images with PIL/Pillow - Parsing JSON/XML/CSV data - Any operation where the library supports file-like objects or byte streams

#kb_service_download_bytes
# Download content as bytes
content_bytes = kb_service.download_content_to_bytes(
    content_id=content_id or "unknown",
)

# Process in memory
text = ""
with io.BytesIO(content_bytes) as file_like:
    text = file_like.read().decode("utf-8")

print(text)

Download to Temporary File

When you need a file on disk, use secure temporary directories. This is necessary when: - A library requires a file path and cannot work with file-like objects or bytes - You need to pass the file to an external command-line tool - The file format requires random access (seeking) not available with streams

Important security practices: 1. Always use tempfile.mkdtemp() to create a secure, random temporary directory 2. Use a try/finally block to ensure cleanup happens even if an error occurs 3. Delete both the file and the temporary directory when done

#kb_service_download_file
# Download to secure temporary file

filename = "my_testfile.txt"
temp_file_path = kb_service.download_content_to_file(
    content_id=content_id,
    output_filename=filename,
    output_dir_path=Path(tempfile.mkdtemp())  # Use secure temp directory
)

try:
    # Process the file
    with open(temp_file_path, 'rb') as file:
        text = file.read().decode("utf-8")
        print(text) 
finally:
    # Always clean up temporary files
    if temp_file_path.exists():
        temp_file_path.unlink()
    # Clean up the temporary directory
    temp_file_path.parent.rmdir()
Full Examples Download (Click to expand)
# %%
import io
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
content_id = demo_env_vars.get("UNIQUE_CONTENT_ID") or "unknown"
# Download content as bytes
content_bytes = kb_service.download_content_to_bytes(
    content_id=content_id or "unknown",
)

# Process in memory
text = ""
with io.BytesIO(content_bytes) as file_like:
    text = file_like.read().decode("utf-8")

print(text)
# %%
import tempfile
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
content_id = demo_env_vars.get("UNIQUE_CONTENT_ID") or "unknown"
# Download to secure temporary file

filename = "my_testfile.txt"
temp_file_path = kb_service.download_content_to_file(
    content_id=content_id,
    output_filename=filename,
    output_dir_path=Path(tempfile.mkdtemp()),  # Use secure temp directory
)

try:
    # Process the file
    with open(temp_file_path, "rb") as file:
        text = file.read().decode("utf-8")
        print(text)
finally:
    # Always clean up temporary files
    if temp_file_path.exists():
        temp_file_path.unlink()
    # Clean up the temporary directory
    temp_file_path.parent.rmdir()

Content Deletion

Permanently removes content from the knowledge base. This operation: - Deletes the file from storage - Removes all indexed chunks from the vector database

#kb_service_delete_content
1
2
3
kb_service.delete_content(
    content_id=content.id
)
Full Examples Content Search (Click to expand)
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
content_bytes = b"Your file content here"
content = kb_service.upload_content_from_bytes(
    content=content_bytes,
    content_name="document.txt",
    mime_type="text/plain",
    scope_id=scope_id,
    metadata={"category": "documentation", "version": "1.0"},
)
kb_service.delete_content(content_id=content.id)

Semantic Search (Vector-Based)

Use vector search for semantic similarity matching. This search method understands the meaning of your query and finds conceptually similar content, even if the exact words don't match.

How it works: - Your search string is converted to a vector embedding - The system finds content chunks with similar embeddings - Results are ranked by semantic similarity

Parameters: - search_string: Your natural language query - search_type: Set to ContentSearchType.VECTOR for semantic search - limit: Maximum number of chunks to return - score_threshold: Minimum similarity score (0.0 to 1.0). Higher values = stricter matching - scope_ids: Optional list of folder IDs to restrict search scope

Best for: - Natural language queries - Finding conceptually related content - When exact keyword matching isn't necessary

Combined Search (Hybrid)

Combine semantic and keyword search for best results. This approach provides the most comprehensive results by leveraging both search methods.

How it works: - Performs both vector (semantic) and keyword (full-text) search in parallel - Merges and ranks results using a hybrid scoring algorithm - Returns the most relevant matches from both search types

Recommended as the default search type for most use cases.

Search for complete content files (not chunks) by metadata. This is useful when you want to find whole files rather than text snippets.

Difference from chunk search: - search_content_chunks(): Returns text snippets from within files - search_contents(): Returns complete file metadata objects

Use cases: - Listing all files in a folder - Finding files by title, creation date, or custom metadata - Getting files uploaded to a specific chat

Full Examples

Full Examples Content Search (Click to expand)
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)
from unique_toolkit.content.schemas import (
    ContentSearchType,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
# Search for content using vector similarity
content_chunks = kb_service.search_content_chunks(
    search_string="Harry Potter",
    search_type=ContentSearchType.VECTOR,
    limit=10,
    score_threshold=0.7,  # Only return results with high similarity
    scope_ids=[scope_id],
)

print(f"Found {len(content_chunks)} relevant chunks")
for i, chunk in enumerate(content_chunks[:3]):
    print(f"  {i + 1}. {chunk.text[:100]}...")
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)
from unique_toolkit.content.schemas import (
    ContentSearchType,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
# Combined semantic and keyword search for best results
content_chunks = kb_service.search_content_chunks(
    search_string="Harry Potter",
    search_type=ContentSearchType.COMBINED,
    limit=15,
    search_language="english",
    scope_ids=[scope_id],  # Limit to specific scopes if configured
)

print(f"Combined search found {len(content_chunks)} chunks")
# %%
from pathlib import Path

from dotenv import dotenv_values

from unique_toolkit import (
    KnowledgeBaseService,
)

kb_service = KnowledgeBaseService.from_settings()
demo_env_vars = dotenv_values(Path(__file__).parent / "demo.env")
scope_id = demo_env_vars.get("UNIQUE_SCOPE_ID") or "unknown"
# Search for specific content files
contents = kb_service.search_contents(
    where={"title": {"contains": "manual"}},
)

Best Practices

Security Considerations

  1. Prefer Memory Operations: Always prefer download_content_to_bytes() and upload_content_from_bytes() to avoid disk-based information leakage.

  2. Temporary File Cleanup: When using temporary files, always clean them up:

    import tempfile
    import os
    
    temp_dir = tempfile.mkdtemp()
    try:
        # Your file operations
        pass
    finally:
        # Clean up all files in temp directory
        import shutil
        shutil.rmtree(temp_dir)
    

  3. Secure File Names: Use random names for temporary files to prevent information leakage through file names.