You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "utkarsharma2 (via GitHub)" <gi...@apache.org> on 2023/12/24 06:11:43 UTC

[PR] Add WeaviateDocumentIngestOperator [airflow]

utkarsharma2 opened a new pull request, #36402:
URL: https://github.com/apache/airflow/pull/36402

   Exposing `create_or_replace_document_objects` as an operator that handles a very common scenario of ingesting objects derived from unique documents and we have to keep up with the changes in documents. 
   
   ### Example
   If we have a document `https://en.wikipedia.org/wiki/Taj_Mahal` the entire document is converted to smaller chunks because LLM models have limitations on max data they can handle in a call.
   
   Assuming that the document is converted into two chunks
   
   ##### Chunk 1:
   The Taj Mahal  'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India. It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself. 
   
   ##### Chunk 2:
   The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.
   
   
   #### Changes:
   
   For LLM models to answer the question correctly they need to have only updated information and that's why there is a requirement to keep only the latest set of chunks in the Database. 
   
   If now for example we later came to know that Taj Mahal was actually commissioned in 1593 there are changes introduced in the document and there was a change in chunking/tokenizing strategy. Now we have a different set of chunks. 
   
   ##### Chunk 1:
   The Taj Mahal  'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India.
   
   ##### Chunk 2:
    It was commissioned in 1593 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself. 
   
   ##### Chunk 3:
   The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.
   
   With these new chunks we have no way of knowing which exact chunk to replace because there can be multiple ways a document can be chunked/tokenized and it may result in splitting the document in a different way. So our best bet is to drop all the objects belonging to a document and re-create the document entirely. 
   
   `WeaviateDocumentIngestOperator` handles these complexities and operates at the document level and offers `existing` param with possible values:
    
   1. `replace`: replace the existing objects with new objects. This option requires to identify the
       objects belonging to a document. which by default is done by using the document_column field.
   2. `skip`: skip the existing objects and only add the missing objects of a document.
   3. `error`:  raise an error if an object belonging to an existing document is tried to be created.
   
   
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add WeaviateDocumentIngestOperator [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #36402:
URL: https://github.com/apache/airflow/pull/36402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org