Efficient Hybrid Search in Vector Database

Authors

Zhilin GAO (zhilin.gao@connect.polyu.hk)

Zhejun HE (zhejun.he@connect.polyu.hk)

Dr Ken YIU (csmlyiu@comp.polyu.edu.hk)

Introduction

This project is inspired by a research project conducted at Alibaba. With recent advancements in machine learning and representation learning, querying and analyzing either structured or unstructured data alone has become comparatively easy nowadays. In practice, however, queries in analytical databases usually involve both structured and unstructured data. We noted that it remains a challenge in the industry to enable searching structured (i.e., related attributes) and unstructured data (i.e., feature vectors) in a hybrid and efficient manner. This research project aims to investigate and propose a better solution in this regard and developed a demo system.

Research output

Technical Report.

Quantization-based method has been adopted by Alibaba and Milvus, and we conduct relevant research. Study: [link]

Our proposed methods

We took advantage of Approximate Nearest Neighbor Search (ANNs) and implemented both VP-tree and Product Quantization (PQ) methods to expedite hybrid search. In the case of VP-tree, instead of reckoning vector search and attribute filtering as distinct stages, we proposed a simultaneous application of both criteria. Our research has demonstrated the superior performance of the concurrent hybrid search method in VP-tree compared to the pre-query and post-query methods.

Regarding PQ, we proposed the utilization of a pre-query method that allows attribute filtering to be performed prior to distance calculation between each query vector and the corresponding pq-code. And importantly, pre-query would not compromise the result of Top-K retrieval. We observe that incorporating attribute filtering before the distance calculation process significantly improves the overall efficiency of the search, considering metrics such as speed and resource utilization.

Demo system

Video Link (This demo system is only used for testing numerous hybrid search methods with VP-tree and PQ, where text and image size is considered as structured data.)