You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by "Aravind-Suresh (via GitHub)" <gi...@apache.org> on 2023/06/15 11:57:34 UTC

[GitHub] [pinot] Aravind-Suresh opened a new issue, #10919: Vector embeddings support in Pinot

Aravind-Suresh opened a new issue, #10919:
URL: https://github.com/apache/pinot/issues/10919

   Creating this issue to initiate discussions about supporting vector embeddings in Pinot.
   
   This [write-up](https://docs.google.com/document/d/1aiXPbwK4rU_YdfMPt3K752SuCMy8KQehqM4ltPg9juE/edit) collates some initial thoughts about this. It isn't a design doc, we'll work on the design doc once we've a high-level alignment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] siddharthteotia commented on issue #10919: Vector embeddings support in Pinot

Posted by "siddharthteotia (via GitHub)" <gi...@apache.org>.

siddharthteotia commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1592971805

   Glad to see there are others thinking about this as well. 
   
   I had recently created a short internal proposal on why a case can be made for vector storage and indexing in Pinot. 
   
   I think first thing we need to do is to get alignment / consensus within the community that it makes sense to do vector search in Pinot 
   
   This is our internal Description and Business Justification we created. @jasperjiaguo can add more info
   
   **Description**
   
   Vector embeddings are numerical coordinate (multi dimensional space) based representations typically resulting from a machine learning model training. For example training of LLM on text can produce billions of vector embeddings which are the distilled representation of text / words (training data). Goal is to build optimal storage, indexing and query execution capabilities for such kind of data.
   
   **Benefit / Use Case**
   
   Can be a crucial foundation for AI systems that can leverage high performance similarity indexing and analytics on vector embeddings for recommendation, image matching, pattern recognition, anomaly detection etc. 
   
   Specifically in the case of LLMs and prompt engineering pipeline - vector storage, indexing and querying can be used to store and query domain specific facts (that were created during training e.g neural network learning) which can then be fed into NLP models / ChatBots, Conversational Prompts etc 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] xiangfu0 commented on issue #10919: Vector embeddings support in Pinot

Posted by "xiangfu0 (via GitHub)" <gi...@apache.org>.

xiangfu0 commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593802400

   cc: @kkrugler 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Aravind-Suresh commented on issue #10919: Vector embeddings support in Pinot

Posted by "Aravind-Suresh (via GitHub)" <gi...@apache.org>.

Aravind-Suresh commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593475379

   Thanks for the inputs @siddharthteotia @jasperjiaguo - yes, given the high dimensionality of the embeddings (OpenAI-davinci embeddings are >12k in dimensions), it's practical to use approximate algorithms.
   
   In addition to recommendation systems and vector-search based prompts, there are also applications in semantic searches, clustering (grouping of related issues, text) as well.
   
   We recently tried powering automated Q&A via vector-search (using vector search based prompts) and it achieves good precision on unstructured data input as well (we used langchain here - https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)
   
   Given that new features are being powered via embeddings (Glean's AI powered enterprise search is one recent example - https://www.glean.com/blog/unlocking-the-power-of-vector-search-in-enterprise), it would be good to evaluate how Pinot can support this in a real-time setup.
   
   Looking forward to the collaboration here!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] abhioncbr commented on issue #10919: Vector embeddings support in Pinot

Posted by "abhioncbr (via GitHub)" <gi...@apache.org>.

abhioncbr commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593068427

   This is interesting. +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] jasperjiaguo commented on issue #10919: Vector embeddings support in Pinot

Posted by "jasperjiaguo (via GitHub)" <gi...@apache.org>.

jasperjiaguo commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593539838

   @Aravind-Suresh Exactly. I've also been using [llama_index](https://github.com/jerryjliu/llama_index) and langchain with chatgpt apis. I think one usability addition to this feature may be to integrate Pinot vector store with this python packages or provide similar powerful python libs. Here is a list of vector store llama_index supports: https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Vector embeddings support in Pinot [pinot]

Posted by "PeterCorless (via GitHub)" <gi...@apache.org>.

PeterCorless commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-2057576972

   See [PR#11977](https://github.com/apache/pinot/pull/11977)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] xiangfu0 commented on issue #10919: Vector embeddings support in Pinot

Posted by "xiangfu0 (via GitHub)" <gi...@apache.org>.

xiangfu0 commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593813274

   Here are some takes from my side:
   High level principals:
   - CPU solution
   - KNN search has to be a distributed solution
   - The minimal search space is considered within one segment level(10-100MM rows/points)
   - Pluggable index structure along with the search algorithm
   
   Considering the doc size in one segment is usually < 10MM, so I think any of current **billion scale** approach is sufficient for us.
   
   In terms of implementation, here is just take an example of using SPTAG(https://github.com/microsoft/SPTAG), paper is: https://arxiv.org/pdf/2111.08566.pdf
   
   During Index build phase, we need to build per segment basis SPTAG index. Use hierarchical balanced clustering to generate a set of regions(centroids).
   We can configure below two parameters:
   - Number of regions or the percentage of total points are centroids(number of regions). From paper, 16% for best for search performance and memory usage
   - Replicas for a vector assigned to multiple closed clusters, larger number means better recall but search requires more resources and longer latency. From paper, 8 is best to balance perf and latency. Need to use RNG algorithm to avoid the high similarity of posting list for close regions
   
   During Query phase:
   kNN search functionality should be able to configure:
   - k(required), which is how many results to fetch,
   - t(optional), a percent number to include more regions to search based on the distance to the closest centroids, this will increase the recall rate but still keep low resources usage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] walterddr commented on issue #10919: Vector embeddings support in Pinot

Posted by "walterddr (via GitHub)" <gi...@apache.org>.

walterddr commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1608653404

   CPU solutions only make sense in certain scenarios IMO and I am not sure if those are fit. 
   - Q: can it perform significantly better in specific use cases, for example ANNS use cases that the setup/GPU I/O overhead outweighs the batch performance on the GPU. 
   - Q: can we use an algorithm that doesn't depend on product quantization (or any that specifically designed to leverage the large parallelism of GPU size) for example graph search algo that performs. 
       - this also echoes back to Q1 b/c most likely these branching algorithm are not good for batching
   - Q: would we perform significantly cheaper while still maintain the equal amount of performance? and is there a use cases similar to that (for example ad-hoc exploration of the dataset before massively scaled up when GPU is justify)
   
   specifically Pinot, i knew that most of the vector databases leverage "inverted index" mechanism to speed up the ANNS algorithm. i don't think that's identical to the inverted index we have in Pinot but we should see if the indexing framework after index-spi is introduce can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] KKcorps commented on issue #10919: Vector embeddings support in Pinot

Posted by "KKcorps (via GitHub)" <gi...@apache.org>.

KKcorps commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1596320174

   IMO, CPU based solution would be too slow for vector search. The vector embeddings popular currently use 700 to 1536 length floating point arrays for a single object. 
   
   Computing similarity across million such object at runtime for indexing is quite compute heavy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Vector embeddings support in Pinot [pinot]

Posted by "hpvd (via GitHub)" <gi...@apache.org>.

hpvd commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-2071881575

   Release video: Apache Pinot 1.1 | Overview of Latest Features and Updates https://www.youtube.com/watch?v=wSwPtOajsGY
   talks also about vector index support: https://www.youtube.com/watch?v=wSwPtOajsGY&t=1m20s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] siddharthteotia commented on issue #10919: Vector embeddings support in Pinot

Posted by "siddharthteotia (via GitHub)" <gi...@apache.org>.

siddharthteotia commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1592973429

   Would love to collaborate on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] jasperjiaguo commented on issue #10919: Vector embeddings support in Pinot

Posted by "jasperjiaguo (via GitHub)" <gi...@apache.org>.

jasperjiaguo commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593447601

   Recommendation systems and Language Model (LLM) applications often utilize high-dimensional vector spaces to represent complex data like user profiles or linguistic patterns. Similarity-based vector indexing/search, a crucial element of these systems, identifies 'close' vectors in this space, signifying high similarity. This is commonly achieved through calculating the cosine similarity or Euclidean distance between vectors.
   
   For instance, (1) in recommendation systems, items similar to a user's past interests are identified and suggested. (2) Meanwhile in LLM applications, instead of submitting a customer’s prompt directly to model, the question is first routed to the vector database (can be considered as the memory of the LLM), which will retrieve the top 10 or 15 most relevant documents for that query. The vector database then bundles those supporting documents with the user’s original question, submits the full package as the knowledge context prompt to the LLM, which returns more relevant answer. (https://mlops.community/combine-and-query-multiple-documents-with-llm/, https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/MilvusIndexDemo.html)
   
   However, given the potentially vast number of vectors, searching for the most similar ones can be computationally challenging. Therefore, Approximate Nearest Neighbor (ANN) algorithms like FAISS, Annoy, or ScaNN are employed to expedite this process by quickly finding the nearest vectors in high-dimensional spaces.
   
   https://github.com/facebookresearch/faiss
   
   https://www.datanami.com/2023/03/27/vector-databases-emerge-to-fill-critical-role-in-ai/
   
   https://github.com/linkedin/venice#read-compute


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] kishoreg commented on issue #10919: Vector embeddings support in Pinot

Posted by "kishoreg (via GitHub)" <gi...@apache.org>.

kishoreg commented on issue #10919:
URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593493167

   cc @KKcorps who is also thinking about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org