CHAMELEON
Adaptive Indexing and Processing of High-dimensional Vector Data


The Chameleon project is a research project that is funded by the Hellenic Foundation for Research and Innovation (HFRI). The instrument aims to support faculty members and researchers, and the principal investigator of Chameleon is Christos Doulkeridis. The project is hosted at the Department of Digital Systems in the University of Piraeus.

Find Out More

Description of Work


Several research challenges are associated with the efficient support of indexing and processing high-dimensional data. However, existing works focus mainly on the data distribution and updates of data, while they are mostly oblivious to the query workload and shifts in the query distribution. This is partly because taking into account the query workload results in hard optimization problems, which require either (a) costly exact algorithms that exhibit poor runtime performance, or (b) heuristic algorithms that work well only for specific instances of input datasets.

Consequently, this project proposal identifies a set of research and technological challenges that need to be effectively addressed, in order to enhance indexing and processing of high-dimensional vector data:

1. Optimization objectives for workload-aware partitioning

The problem under study is how to divide the vectors into partitions, aiming to minimize the processing costs of the given workload. Given a dataset of vectors organized in a set of partitions P, we will define the cost C(q) of an individual query vector q as the weighted sum of the time required for finding which partitions need to be examined for q plus the cost of examining the objects in these matched partitions. Then, the problem can be formalized as finding the optimal partitioning, such that the total cost C(q) (defined as sum of costs of individual queries) is minimized for a given query workload. As this problem is NP-hard, our approach will try to learn the optimal partitioning by applying a learning-based method to predict the induced query cost for a given partitioning (i.e., grouping of vectors); then, by recursively applying the same method, we aim to eventually obtain the desired workload-aware partitioning. Essentially, our method aims to predict which partitioning is the best one by applying machine learning.

2. Adaptive indexing techniques for high-dimensional vectors

While various indexing techniques have been applied before, our focus is on clustering-based index construction. In a high-dimensional setting, this is typically achieved by means of pivot selection techniques. A set of pivots (also known as vantage points) is used to pre-compute distances to vectors, and then (at query time) it is possible to prune many vectors based on the triangular inequality. While many algorithms for pivot selection have been studied and compared against each other, the following research questions still remain open: (a) how to dynamically and adaptively select pivots based on the query workload, (b) how to provide theoretical guarantees on the maximum number of distance comparisons for arbitrary data distributions, (c) how to make pivot selection techniques applicable in the context of dense high-dimensional vectors. Moreover, in case of multi-metric search, where a data object is represented as a vector in different metric spaces, it is challenging to find the optimal indexing approach that combines the individual indexes (e.g., trade-offs between hierarchical index composition and hybrid indexing).

3. Online learning for adaptive workload-aware query optimization

How can we adaptively minimize the processing cost of a query workload (i.e., sequence of queries)? When this question is viewed through the lens of online learning algorithms, it would require committing to a “solution” before the next query in the workload sequence is processed, based on feedback received from the processing of previous queries. Such a “solution” could be a particular partitioning of the vector data. Upon reception of feedback for the performance of this partitioning for the current query of the workload, the online learning algorithm could modify it appropriately so as to cater for the upcoming query. The question is then whether an online learning algorithm can be designed, that provably converges to an optimum (or a near-optimum) partitioning of the vector data, for the query workload. A similar problem can be stated for the pivot selection indexing technique; an online learning algorithm can maintain a pivot subset for the vector data that it modifies, anticipating the next query of the workload sequence, based on performance feedback of the pivots subset for past queries. Again, the aim would be provable convergence to an optimum (or near-optimum) pivots subset for the query workload.

4. Approximate query processing algorithms

In high-dimensional spaces, it is often infeasible to retrieve the exact nearest neighbours of the query vector due to the “the curse of dimensionality”. Instead, approximate nearest neighbour (ANN) search is employed, where given a query vector q and an approximation ratio c, the objective is to retrieve k vectors $$\{o_1,o_2,\ldots,o_k\}$$ sorted in ascending distance to q and each such vector oi satisfies: $$d(q,o_i)\leq c\cdot d(q,o_i^*),$$ where $$o_i^*$$ is the i-th nearest neighbour of q. Recent research is focused on worst-case guarantees for approximate nearest neighbour search. On the other hand, several ANN algorithms have been recently proposed with nice performance characteristics when evaluated on specific datasets. The research question that our project tries to answer is whether we can devise a fast query processing algorithm that is both empirically fast while having worst-case query time and approximation guarantees.

Objectives


  • Propose online learning algorithms that can be applied on a query workload to identify shifts in the distribution.

  • Formulate optimization problems for discovering optimal partitioning of vector-based datasets with respect to a query workload, with theoretically provable worst-case performance guarantees.

  • Discover novel adaptive data partitioning methods for vector-based data, which consider both the data distribution as well as query workload, aiming at optimized data partitions that minimize the cost of data access.

  • Design of adaptive index structures for vector data in metric and multi-metric spaces, capitalizing on dynamic data partitions.

  • Devise efficient query processing algorithms for high-dimensional vector data following well- established methodologies, including filter-and-refinement, and branch-and-bound, aiming at fast delivery of approximate query results with approximation guarantees.

People


Avatar

Christos Doulkeridis

Principal Investigator, Professor

Avatar

Orestis Telelis

Assistant Professor

Avatar

Georgios M. Santipantakis

Postdoctoral Researcher

Avatar

Dimitris Petratos

PhD Candidate

Avatar

xyz

PhD Candidate

Avatar

Kjetil Nørvåg

Professor


Results

With respect to the research and scientific impact, access methods and novel algorithms for approximate nearest neighbor search are at the heart of vector data management systems. The results of Chameleon are expected to advance the research frontier in this field and fuel the development of new AI-driven applications. Improved processing techniques for vector data will lead to less time-consuming model training, faster inference times, ability to handle larger training sets using the same hardware, etc. Most importantly, by designing new index structures and more efficient processing algorithms, our research will have profound impact in the research field, and we expect that many researchers will find interest in our work.

Impact


Every day we use digital voice assistants, driver aids, recommendations for shopping, suggestions on what news to read, etc. All these (and many more) AI-driven applications will benefit from our research, thus delivering better services of higher quality to more people. Additionally, the results of this project are expected to lead to more cost-efficient management (indexing, processing, querying) and reduce the computational complexity of processing. As operations such as similarity search over dense vector-based representations are prevalent in more and more applications, reducing the computational cost will lead to more energy-efficient and green data operations, in compliance with the European Green Deal and the target of 11.7% reduction in energy consumption by 2030. In this respect, the research results of Chameleon are expected to benefit also society at large.

Publications


Journals

  1. Authors: Title. Journal

Conferences & Workshops

  1. Authors. Title Citation

Technical Reports

  1. Authors. Title. Technical Report (wasd).

Newsletters


Newsletter #1
Newsletter #2

Brochure


Project brochure

Prototype


Prototype

Videos


TBA_1
TBA_2

Contact Us


For more information please contact Christos Doulkeridis.
For more information about the research group and the department, please visite the respective home pages: Department of Digital Systems