You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Noble Paul (JIRA)" <ji...@apache.org> on 2008/12/02 19:14:44 UTC
[jira] Issue Comment Edited: (SOLR-828) A RequestProcessor to support updates

    [ https://issues.apache.org/jira/browse/SOLR-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652442#action_12652442 ] 

noble.paul edited comment on SOLR-828 at 12/2/08 10:14 AM:
-----------------------------------------------------------

The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be inserted before {{RunUpdateProcessor}}. 

* The {{UpdateProcessor}} must add an update method. 
* the {{AddUpdateCommand}} has a new boolean field append. If append= true multivalued fields will be appended else old ones are removed and new ones are added
* The schema must have a {{<uniqueKey>}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.

h1.Implementation
{{UpdateableIndexProcessor}} uses a DB (JDBC / Berkley DB java?) to store the data. Each document will be a row in the DB . The uniqueKey of the document will be used as the primary key. The data will be written as a BLOB into a DB column . The format will be {{javabin}} serialized format. The {{javabin}} format in the current form is inefficient  but it is possible to enhance it (SOLR-810)

The schema of the table would be
ID : VARCHAR The primarykey of the document as string
DATA : LONGVARBINARY : A {{javabin}} Serialized SolrInputDocument
STATUS:ENUM (COMITTED = 0,UNCOMMITTED = 1,UNCOMMITTED_MARKED_FOR_DELETE = 2,COMMITTED_MARKED_FOR_DELETE = 3)
BOOST:DOUBLE
FIELD_BOOSTS:VARBINARY A {{javabin}} serialized data with boosts of each fields

h1.Implementation of various methods

h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the serialized document to the DB (COMMITTED=false) . Call next {{UpdateProcessor#add()}}

h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the documents which matches the query and delete from the data table . If it is a delete by id delete the document with that id from data table. Call next {{UpdateProcessor}}

h2.{{processCommit()}}
Call next {{UpdateProcessor}}

h2.on {{postCommit/postOmptimize}}
{{UpdateableIndexProcessor}} gets all the documents from the data table which is committed =false. If the document is present in the main index it is marked as COMMITTED=true, else it is deleted because a deletebyquery would have deleted it .

h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} check the document first in data table. If it is present read the document . If it is not present , read all the missing fields from there, and the backup document is prepared

The single valued fields are used from the incoming document (if present) others are filled from backup doc . If append=true all the multivalues values from backup document are added to the incoming document else the values from backup document is not used if they are present in incoming document also.

{{processAdd()}} is called on the next {{UpdateProcessor}}

h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backup indexes. The user must be able to get any document by id by invoking {{/backup?id=<value>}} (multiple id values can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index and construct the new doc if he wishes to do so. 

h2.Next steps
The datastore can be optimized by not storing the stored fields in the DB. This means on {{postCommit/postOptimize}} we must read back the data and remove the already stored fields and store it back. That can be another iteration



      was (Author: noble.paul):
    
The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be inserted before {{RunUpdateProcessor}}. 

* The {{UpdateProcessor}} must add an update method. 
* the {{AddUpdateCommand}} has a new boolean field append. If append= true multivalued fields will be appended else old ones are removed and new ones are added
* The schema must have a {{<uniqueKey>}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.

h1.Implementation
{{UpdateableIndexProcessor}} uses a DB (JDBC / Berkley DB java?) to store the data. Each document will be a row in the DB . The uniqueKey of the document will be used as the primary key. The data will be written as a BLOB into a DB column . The format will be {{javabin}} serialized format. The {{javabin}} format in the current form is inefficient  but it is possible to enhance it (SOLR-810)

The schema of the table would be
ID : VARCHAR The primarykey of the document as string
DATA : LONGVARBINARY : A {{javabin}} Serialized SolrInputDocument
COMMITTED:BOOL 
BOOST:DOUBLE
FIELD_BOOSTS:VARBINARY A {{javabin}} serialized data with boosts of each fields

h1.Implementation of various methods

h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the serialized document to the DB (COMMITTED=false) . Call next {{UpdateProcessor#add()}}

h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the documents which matches the query and delete from the data table . If it is a delete by id delete the document with that id from data table. Call next {{UpdateProcessor}}

h2.{{processCommit()}}
Call next {{UpdateProcessor}}

h2.on {{postCommit/postOmptimize}}
{{UpdateableIndexProcessor}} gets all the documents from the data table which is committed =false. If the document is present in the main index it is marked as COMMITTED=true, else it is deleted because a deletebyquery would have deleted it .

h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} check the document first in data table. If it is present read the document . If it is not present , read all the missing fields from there, and the backup document is prepared

The single valued fields are used from the incoming document (if present) others are filled from backup doc . If append=true all the multivalues values from backup document are added to the incoming document else the values from backup document is not used if they are present in incoming document also.

{{processAdd()}} is called on the next {{UpdateProcessor}}

h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backup indexes. The user must be able to get any document by id by invoking {{/backup?id=<value>}} (multiple id values can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index and construct the new doc if he wishes to do so. 

h2.Next steps
The datastore can be optimized by not storing the stored fields in the DB. This means on {{postCommit/postOptimize}} we must read back the data and remove the already stored fields and store it back. That can be another iteration


  
> A RequestProcessor to support updates
> -------------------------------------
>
>                 Key: SOLR-828
>                 URL: https://issues.apache.org/jira/browse/SOLR-828
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Noble Paul
>             Fix For: 1.4
>
>
> This is same as SOLR-139. A new issue is opened so that the UpdateProcessor approach is highlighted and we can easily focus on that solution. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.