You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2008/06/26 18:25:44 UTC

[jira] Created: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
-------------------------------------------------------------

                 Key: LUCENE-1317
                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
            Reporter: Jason Rutherglen


Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  

Karl mentioned:
bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.

How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608524#action_12608524 ] 

Jason Rutherglen commented on LUCENE-1317:
------------------------------------------

The problem with this is, if the fields is only indexed without vector offsets and not stored.  Is there a way to handle these types of fields?  The Token equals you are mentioning is handled in the DocumentsWriter code, however without payloads.  There may be a better way to do this reusing some of the SegmentMerger code.  

> Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
> -------------------------------------------------------------
>
>                 Key: LUCENE-1317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>            Assignee: Karl Wettin
>
> Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  
> Karl mentioned:
> bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.
> How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608550#action_12608550 ] 

Karl Wettin commented on LUCENE-1317:
-------------------------------------

bq. The problem with this is, if the fields is only indexed without vector offsets and not stored. 

Only use the vectors for the offsets, nothing else. Extract everything else (token text, posincr, payload, et c) from the inverted index using TermEnum and TermPositions.

> Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
> -------------------------------------------------------------
>
>                 Key: LUCENE-1317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>            Assignee: Karl Wettin
>
> Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  
> Karl mentioned:
> bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.
> How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608486#action_12608486 ] 

Jason Rutherglen commented on LUCENE-1317:
------------------------------------------

Looks like a modified version of addDocument will work, that operates on TokenStreams and Documents manufactured from the IndexReaders.  Can use the org.apache.lucene.search.highlight.TokenSources for the TokenStreams.  

> Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
> -------------------------------------------------------------
>
>                 Key: LUCENE-1317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>
> Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  
> Karl mentioned:
> bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.
> How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608513#action_12608513 ] 

Karl Wettin commented on LUCENE-1317:
-------------------------------------

bq. Can use the org.apache.lucene.search.highlight.TokenSources for the TokenStreams.

TokenSources only does one document at the time. It is much more efficient to create all documents in a single enumeration of the source reader. 

I'm thinking something like this:
 * Load all term vector offsets in a Map</**document number*/ Integer, Map<Term, int[]>>.
 * Create  a Document[]  with all doucments from the source reader.
 * Enumerate all terms and document positions and fill up some sort of token stream factory per field and document. Map</**doc*/Integer, Map</**field*/String, Map</**pos*/ Integer, List<Token>>>>. It would be really nice if Tokens that equals (text, offsets, payload, et c) was reused, but the cost of equality should probably be benchmarked first.
 * Add all documents to an InstantiatedIndexWriter.


> Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
> -------------------------------------------------------------
>
>                 Key: LUCENE-1317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>
> Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  
> Karl mentioned:
> bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.
> How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1317) Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin reassigned LUCENE-1317:
-----------------------------------

    Assignee: Karl Wettin

> Add InstantiatedIndexWriter.addIndexes(IndexReader[] readers)
> -------------------------------------------------------------
>
>                 Key: LUCENE-1317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1317
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>            Assignee: Karl Wettin
>
> Enable InstantiatedIndexWriter to have IndexReaders passed in like IndexWriter and merged into the index.  
> Karl mentioned:
> bq: It's doable. The simplest solution I can think of is to reconstruct all the documents in one single enumeration of the source index and then add them to the writer. I'm however not certain this is the best way nor if InstantiatedIndexWriter is the place for the code.
> How would the documents be reconstructed without creating a lot of overhead?  It seems like InstantiatedIndexWriter is the right place, given it is presumably more efficient to recreate all the IndexReaders and then commit?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org