Run Semantic Search Directly on Apache Iceberg Tables with Oracle AI Database 26ai


Tired of duplicating massive datasets just to add vector search capabilities? With Oracle AI Database 26ai, you can now run high-performance similarity search directly on your existing Apache Iceberg tables stored in object storage — no data copying, no extra ETL pipelines, and no governance headaches.

This feature is a game-changer for data lakes built on Iceberg, Parquet, and cloud storage (OCI Object Storage, S3, etc.).

Why This Matters

  • Avoid massive data duplication and sync issues
  • Keep data in its original governed location
  • Query Iceberg + Oracle tables together in the same SQL
  • Create fast vector indexes without moving the source data
  • Works great for RAG, semantic search, and recommendation systems

Step-by-Step: Query Iceberg Vectors in Minutes

1. Create External Table over Iceberg

CREATE TABLE ext_iceberg_vectors (
    id           VARCHAR2(100),
    content      CLOB,
    embedding    VECTOR(1024, FLOAT32)   -- match your embedding dimension
)
ORGANIZATION EXTERNAL
(
    TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY DATA_PUMP_DIR
    ACCESS PARAMETERS
    (
        com.oracle.bigdata.credential.name = 'OCI_CRED',
        com.oracle.bigdata.fileformat = 'parquet',
        com.oracle.bigdata.access_protocol = 'iceberg'
    )
    LOCATION ('iceberg:https://objectstorage.<region>.oraclecloud.com/.../metadata/v1.metadata.json')
)
REJECT LIMIT UNLIMITED;

2. Run Similarity Search (with on-the-fly embedding)

SELECT id,
       content,
       VECTOR_DISTANCE(embedding, 
                       VECTOR_EMBEDDING(embedding_model USING :search_query AS data)) AS score
FROM   ext_iceberg_vectors
ORDER  BY score
FETCH FIRST 10 ROWS ONLY;

3. Speed It Up with Vector Index

CREATE VECTOR INDEX iceberg_vec_idx 
ON ext_iceberg_vectors(embedding)
ORGANIZATION NEIGHBOR PARTITIONS
WITH TARGET ACCURACY 95;

Best Practices for Production

  • Use credential objects for secure access to object storage
  • Match vector dimension and type exactly with your embedding model
  • Create IVF or HNSW indexes for large Iceberg tables
  • Combine with Oracle tables in the same query for hybrid search
  • Great for air-gapped environments (embeddings run in-database via ONNX)

Real-World Use Cases

  • Semantic search over data lake documents
  • RAG applications using Iceberg as the knowledge base
  • Real-time recommendations without data movement
  • Unified analytics across structured + unstructured data

Conclusion

Oracle AI Database 26ai + Apache Iceberg gives you the best of both worlds: the governance and scale of a modern data lake with the powerful, familiar vector search capabilities of Oracle.

No more unnecessary data copies. Just point, index, and query — delivering fast semantic search on your existing Iceberg tables today.



Post a Comment

Previous Post Next Post