You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2006/07/17 23:49:14 UTC

[jira] Created: (LUCENE-629) Performance improvement for merging stored, compressed fields

Performance improvement for merging stored, compressed fields
-------------------------------------------------------------

                 Key: LUCENE-629
                 URL: http://issues.apache.org/jira/browse/LUCENE-629
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael Busch
            Priority: Minor


Hello everyone,

currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.

This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
   * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
   * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
   * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
   * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.


To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.

Here are the performance results:

old version:
   * Time for Indexing:  36.7 minutes
   * Time for Optimizing: 4.6 minutes

patched version:
   * Time for Indexing:  20.8 minutes
   * Time for Optimizing: 0.5 minutes

The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 

A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 

Regards,
  Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-629) Performance improvement for merging stored, compressed fields

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-629?page=all ]

Otis Gospodnetic resolved LUCENE-629.
-------------------------------------

    Resolution: Fixed

Committed. Muchos gracias!

> Performance improvement for merging stored, compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-629
>                 URL: http://issues.apache.org/jira/browse/LUCENE-629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: optimized_merge_compressed_fields.patch
>
>
> Hello everyone,
> currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.
> This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
>    * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
>    * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
>    * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
>    * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.
> To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.
> Here are the performance results:
> old version:
>    * Time for Indexing:  36.7 minutes
>    * Time for Optimizing: 4.6 minutes
> patched version:
>    * Time for Indexing:  20.8 minutes
>    * Time for Optimizing: 0.5 minutes
> The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 
> A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 
> Regards,
>   Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-629) Performance improvement for merging stored, compressed fields

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-629?page=all ]

Michael Busch updated LUCENE-629:
---------------------------------

    Attachment: optimized_merge_compressed_fields.patch

The patch file for this improvement (based on Lucene Rev. 419199)

> Performance improvement for merging stored, compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-629
>                 URL: http://issues.apache.org/jira/browse/LUCENE-629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: optimized_merge_compressed_fields.patch
>
>
> Hello everyone,
> currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.
> This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
>    * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
>    * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
>    * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
>    * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.
> To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.
> Here are the performance results:
> old version:
>    * Time for Indexing:  36.7 minutes
>    * Time for Optimizing: 4.6 minutes
> patched version:
>    * Time for Indexing:  20.8 minutes
>    * Time for Optimizing: 0.5 minutes
> The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 
> A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 
> Regards,
>   Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-629) Performance improvement for merging stored, compressed fields

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-629?page=comments#action_12427733 ] 
            
Otis Gospodnetic commented on LUCENE-629:
-----------------------------------------

This looks fine to me, patch applied after a bit of persuading, and unit tests all pass.  I'll commit this in a bit.
One question, why "...ForMerge" names?  Doesn't this patch really address only compressed fields?  Wouldn't it make sense to name things (classes, fields, vars) to indicate that?  Something like "...DontTouchMeImAlreadyCompressed".  Just kidding, but you get the idea.


> Performance improvement for merging stored, compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-629
>                 URL: http://issues.apache.org/jira/browse/LUCENE-629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: optimized_merge_compressed_fields.patch
>
>
> Hello everyone,
> currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.
> This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
>    * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
>    * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
>    * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
>    * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.
> To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.
> Here are the performance results:
> old version:
>    * Time for Indexing:  36.7 minutes
>    * Time for Optimizing: 4.6 minutes
> patched version:
>    * Time for Indexing:  20.8 minutes
>    * Time for Optimizing: 0.5 minutes
> The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 
> A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 
> Regards,
>   Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-629) Performance improvement for merging stored, compressed fields

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-629?page=comments#action_12421770 ] 
            
Yonik Seeley commented on LUCENE-629:
-------------------------------------

Thanks Michael, very impressive results!
It might take a little while to review the patch, but the goals and general approach certainly seem correct.

What is the impact (if any) for non-compressed fields?  

> Performance improvement for merging stored, compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-629
>                 URL: http://issues.apache.org/jira/browse/LUCENE-629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: optimized_merge_compressed_fields.patch
>
>
> Hello everyone,
> currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.
> This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
>    * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
>    * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
>    * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
>    * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.
> To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.
> Here are the performance results:
> old version:
>    * Time for Indexing:  36.7 minutes
>    * Time for Optimizing: 4.6 minutes
> patched version:
>    * Time for Indexing:  20.8 minutes
>    * Time for Optimizing: 0.5 minutes
> The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 
> A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 
> Regards,
>   Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-629) Performance improvement for merging stored, compressed fields

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-629?page=comments#action_12421920 ] 
            
Michael Busch commented on LUCENE-629:
--------------------------------------

Yonik,

in the original version the FieldsReader maps the boolean values "compressed", "tokenize", "binary", "storeOffsetWithTermVector", "storePositionWithTermVector", and "storeTermVector" to the Parameters Field.Index, Field.Store, and Field.TermVector. For writing, the FieldsWriter again maps those Parameters to the boolean values. My patched version avoids these mappings, thus we save some if-statements even while merging non-compressed fields. However, the overall merge performance does not benefit significantly in the non-compressed case.

That should be the only impact for non-compressed fields. 

Michael

> Performance improvement for merging stored, compressed fields
> -------------------------------------------------------------
>
>                 Key: LUCENE-629
>                 URL: http://issues.apache.org/jira/browse/LUCENE-629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: optimized_merge_compressed_fields.patch
>
>
> Hello everyone,
> currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.
> This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:
>    * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
>    * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
>    * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
>    * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.
> To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.
> Here are the performance results:
> old version:
>    * Time for Indexing:  36.7 minutes
>    * Time for Optimizing: 4.6 minutes
> patched version:
>    * Time for Indexing:  20.8 minutes
>    * Time for Optimizing: 0.5 minutes
> The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster. 
> A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version. 
> Regards,
>   Michael Busch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org