You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/12/03 12:58:08 UTC

[GitHub] [lucene] wjp719 opened a new pull request, #11995: enable fully directly copy merge/flush fdt files when index sorting

wjp719 opened a new pull request, #11995:
URL: https://github.com/apache/lucene/pull/11995

   when index sorting, fdt files needs to be decompressed and compressed according to  new doc id order. This pr wants to add a docId offset index, so that we only copy origin fdt files to a new fdt file, and we only need to  main the doc offset index according to the new doc id order. This can work in flush and merge process.
   
   This pr has two benefits:
   1. now if index sorting, before flush, we need to write all origin uncompressed data to temp file, then read data back when flush. This pr can write final fdt file before flush, then write doc offset index when flush. This can reduce 30% IO throughput in our log scenario
   2. improve 30% doc indexing performance in our log scenario
   
   the additional overhead is the new doc offset index files storage, 1% in our log scenario
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] wjp719 commented on pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

Posted by GitBox <gi...@apache.org>.
wjp719 commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1340308711

   > It's probably a good trade-off in your case and maybe something you can do in a custom codec
   
   Thanks for your reply,  does that mean   I can add a new custom codec in Lucene? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] wjp719 commented on pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

Posted by GitBox <gi...@apache.org>.
wjp719 commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1336157571

   @jpountz Hi, can you help to review this pr, thanks a lot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

Posted by GitBox <gi...@apache.org>.
wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting  
URL: https://github.com/apache/lucene/pull/11995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1339962667

   Thanks for the explanation of what this PR does. I'm not comfortable with the fact that with your change, stored fields are no longer stored in doc ID order. It's probably a good trade-off in your case and maybe something you can do in a custom codec, but I don't like doing it in the default codec as it would also mean that users can no longer leverage index sorting to improve data locality within stored fields, and that feeding this stored fields reader into another writer for a merge would trigger lots of random access.
   
   While it certainly wouldn't result in speedup that is as good, I'll point out that using a merge policy that only merges adjacent segments like LogByteMergePolicy should help better take advantage of https://github.com/apache/lucene/pull/134 and get slightly better merging performance when an index sort on timestamp is configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11995: draft pr

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11995:
URL: https://github.com/apache/lucene/pull/11995#issuecomment-1340620512

   > Thanks for your reply, does that mean I can add a new custom codec in Lucene?
   
   I was more thinking you could maintain a codec with your own stored fields format in your own codebase and take advantage of the fact that codecs are pluggable in Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting

Posted by GitBox <gi...@apache.org>.
wjp719 closed pull request #11995: enable fully directly copy merge/flush fdt files when index sorting  
URL: https://github.com/apache/lucene/pull/11995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org