The Chameleon project is a research project that is funded by the Hellenic Foundation for Research and Innovation (HFRI). The instrument aims to support faculty members and researchers, and the principal investigator of Chameleon is Christos Doulkeridis. The project is hosted at the Department of Digital Systems in the University of Piraeus.
Find Out MoreSeveral research challenges are associated with the efficient support of indexing and processing high-dimensional data. However, existing works focus mainly on the data distribution and updates of data, while they are mostly oblivious to the query workload and shifts in the query distribution. This is partly because taking into account the query workload results in hard optimization problems, which require either (a) costly exact algorithms that exhibit poor runtime performance, or (b) heuristic algorithms that work well only for specific instances of input datasets.
Consequently, this project proposal identifies a set of research and technological challenges that need to be effectively addressed, in order to enhance indexing and processing of high-dimensional vector data:
The problem under study is how to divide the vectors into partitions, aiming to minimize the
processing costs of the given workload. Given a dataset of vectors organized in a set of
partitions P, we will define the cost C(q) of an individual query vector q as the weighted
sum of the time required for finding which partitions need to be examined for q plus the
cost of examining the objects in these matched partitions. Then, the problem can be
formalized as finding the optimal partitioning, such that the total cost C(q) (defined as
sum of costs of individual queries) is minimized for a given query workload. As this problem
is NP-hard, our approach will try to learn the optimal partitioning by applying a
learning-based method to predict the induced query cost for a given partitioning (i.e.,
grouping of vectors); then, by recursively applying the same method, we aim to eventually
obtain the desired workload-aware partitioning. Essentially, our method aims to predict
which partitioning is the best one by applying machine learning.
While various indexing techniques have been applied before, our focus is on clustering-based
index construction. In a high-dimensional setting, this is typically achieved by means of
pivot selection techniques. A set of pivots (also known as vantage points) is used to
pre-compute distances to vectors, and then (at query time) it is possible to prune many
vectors based on the triangular inequality. While many algorithms for pivot selection have
been studied and compared against each other, the following research questions still remain
open: (a) how to dynamically and adaptively select pivots based on the query workload, (b)
how to provide theoretical guarantees on the maximum number of distance comparisons for
arbitrary data distributions, (c) how to make pivot selection techniques applicable in the
context of dense high-dimensional vectors. Moreover, in case of multi-metric search, where
a data object is represented as a vector in different metric spaces, it is challenging to
find the optimal indexing approach that combines the individual indexes (e.g., trade-offs
between hierarchical index composition and hybrid indexing).
How can we adaptively minimize the processing cost of a query workload (i.e., sequence of
queries)? When this question is viewed through the lens of online learning algorithms, it
would require committing to a “solution” before the next query in the workload sequence is
processed, based on feedback received from the processing of previous queries. Such a
“solution” could be a particular partitioning of the vector data. Upon reception of feedback
for the performance of this partitioning for the current query of the workload, the online
learning algorithm could modify it appropriately so as to cater for the upcoming query. The
question is then whether an online learning algorithm can be designed, that provably
converges to an optimum (or a near-optimum) partitioning of the vector data, for the query
workload. A similar problem can be stated for the pivot selection indexing technique; an
online learning algorithm can maintain a pivot subset for the vector data that it modifies,
anticipating the next query of the workload sequence, based on performance feedback of the
pivots subset for past queries. Again, the aim would be provable convergence to an optimum
(or near-optimum) pivots subset for the query workload.
In high-dimensional spaces, it is often infeasible to retrieve the exact nearest neighbours
of the query vector due to the “the curse of dimensionality”. Instead, approximate
nearest neighbour (ANN) search is employed, where given a query vector q and an
approximation ratio c, the objective is to retrieve k vectors
$$\{o_1,o_2,\ldots,o_k\}$$
sorted in
ascending distance to q and each such vector oi satisfies:
$$d(q,o_i)\leq c\cdot d(q,o_i^*),$$
where
$$o_i^*$$
is the i-th nearest neighbour of q. Recent research is focused on worst-case guarantees
for approximate nearest neighbour search. On the other hand, several ANN algorithms
have been recently proposed with nice performance characteristics when evaluated on specific
datasets. The research question that our project tries to answer is whether
we can devise a fast query processing algorithm that is both empirically fast while having
worst-case query time and approximation guarantees.
With respect to the research and scientific impact, access methods and novel algorithms for approximate nearest neighbor search are at the heart of vector data management systems. The results of Chameleon are expected to advance the research frontier in this field and fuel the development of new AI-driven applications. Improved processing techniques for vector data will lead to less time-consuming model training, faster inference times, ability to handle larger training sets using the same hardware, etc. Most importantly, by designing new index structures and more efficient processing algorithms, our research will have profound impact in the research field, and we expect that many researchers will find interest in our work.
Every day we use digital voice assistants, driver aids, recommendations for shopping, suggestions on what news to read, etc. All these (and many more) AI-driven applications will benefit from our research, thus delivering better services of higher quality to more people. Additionally, the results of this project are expected to lead to more cost-efficient management (indexing, processing, querying) and reduce the computational complexity of processing. As operations such as similarity search over dense vector-based representations are prevalent in more and more applications, reducing the computational cost will lead to more energy-efficient and green data operations, in compliance with the European Green Deal and the target of 11.7% reduction in energy consumption by 2030. In this respect, the research results of Chameleon are expected to benefit also society at large.
For more information please contact Christos Doulkeridis. For more information about the research group and the department, please visite the respective home pages: Department of Digital Systems