You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Klatt (JIRA)" <ji...@apache.org> on 2007/10/26 00:17:53 UTC

[jira] Created: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Add new API method to retrieve document field data in a batch
-------------------------------------------------------------

                 Key: LUCENE-1034
                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 2.2
         Environment: JDK 1.5.X, Linux & FreeBSD
            Reporter: Michael Klatt
            Priority: Minor
         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch

I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 

In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:

public Document[] documents(int[] n, FieldSelector fieldSelector);

Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.

I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Michael Klatt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Klatt updated LUCENE-1034:
----------------------------------

    Attachment: SegmentReader.java.patch

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537917 ] 

Grant Ingersoll commented on LUCENE-1034:
-----------------------------------------

Sounds like a reasonable idea.  In order to get this reviewed, please provide a single patch against trunk.

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1034:
--------------------------------

    Attachment: LUCENE-1034.patch

I've patched it altogether into one file. I like the idea, but right now, I don't like the amount of code duplication. Arguably, this could also be moved to the Searcher family, but could prob live without that. Also still needs a test, but I've lost interest unless the code dupe can be resolved while maintaining the speed gain.

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, LUCENE-1034.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Michael Klatt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Klatt updated LUCENE-1034:
----------------------------------

    Attachment: FieldsReader.java.patch

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Michael Klatt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Klatt updated LUCENE-1034:
----------------------------------

    Attachment: IndexReader.java.patch

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1034) Add new API method to retrieve document field data in a batch

Posted by "Michael Klatt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Klatt updated LUCENE-1034:
----------------------------------

    Attachment: MultiReader.java.patch

> Add new API method to retrieve document field data in a batch
> -------------------------------------------------------------
>
>                 Key: LUCENE-1034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>         Environment: JDK 1.5.X, Linux & FreeBSD
>            Reporter: Michael Klatt
>            Priority: Minor
>         Attachments: FieldsReader.java.patch, IndexReader.java.patch, MultiReader.java.patch, SegmentReader.java.patch
>
>
> I've read in many forums about people who need to retrieve document data for a large number of search results. In our case, we need to retrieve up to 10,000 results (sometimes more) from an index of over 100 million documents (our index is about 65 GB).   This can sometimes take a couple minutes. 
> In one of my attempts to improve performance, I modified the IndexReader interface to provide a method which looks like:
> public Document[] documents(int[] n, FieldSelector fieldSelector);
> Instead of retrieving document data one at a time, I would request data for many document numbers in one shot.   The idea was to optimize the seeks on disk so that in the FieldsReader, the seeks for the indexStream would be done first, then all the seeks in the fieldStream would be completed.   For a large number of documents, this yielded a 20% speed improvement.  The improvement was not as much as I was looking for, but I felt that the improvement was significant enough that I would request changes to the IndexReader interface.
> I'm providing patches for the files that I needed to change for our application.    These patches are against the 2.2 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org