You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/01/07 15:53:34 UTC

[jira] Created: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Use bulk-byte-copy when merging term vectors
--------------------------------------------

                 Key: LUCENE-1120
                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor


Indexing all of Wikipedia, with term vectors on, under the YourKit
profiler, shows that 26% of the time (!!) was spent merging the
vectors.  This was without offsets & positions, which would make
matters even worse.

Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
fact keep up with the flushing of new segments in this test, and this
is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
cores).

So, just like Robert's idea to merge stored fields with bulk copying
whenever the field name->number mapping is "congruent" (LUCENE-1043),
we can do the same with term vectors.

It's a little trickier because the term vectors format doesn't quite
make it easy to bulk-copy because it doesn't directly encode the
offset into the tvf file.

I worked out a patch that changes the tvx format slightly, by storing
the absolute position in the tvf file for the start of each document
into the tvx file, just like it does for tvd now.  This adds an extra
8 bytes (long) in the tvx file, per document.

Then, I removed a vLong (the first "position" stored inside the tvd
file), which makes tvd contents fully position independent (so you can
just copy the bytes).

This adds up to 7 bytes per document (less for larger indices) that
have term vectors enabled, but I think this small increase in index
size is acceptable for the gains in indexing performance?

With this change, the time spent merging term vectors dropped from 26%
to 3%.  Of course, this only applies if your documents are "regular".
I think in the future we could have Lucene try hard to assign the same
field number for a given field name, if it had been seen before in the
index...

Merging terms now dominates the merge cost (~20% over overall time
building the Wikipedia index).

I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
term vector fields to these indices.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556590#action_12556590 ] 

Michael Busch commented on LUCENE-1120:
---------------------------------------

{quote}
Indexing all of Wikipedia, with term vectors on, under the YourKit
profiler, shows that 26% of the time (!!) was spent merging the
vectors. 
{quote}

I wonder how accurate these profiling numbers are? Java profiling slows
down (non-native) method calls, but not I/O operations. Did you measure
the merge time before and after applying this patch without a profiling
tool?

-Michael

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1120:
---------------------------------------

    Attachment: LUCENE-1120.patch

Attached patch updated to current trunk.  All tests pass.  I plan to
commit after 2.3 is out...

OK I ran a performance test with this patch, indexing the first 200K
docs of Wikipedia using this alg:

  analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  
  doc.stored = true
  doc.term.vector = true
  doc.term.vector.positions = true
  doc.term.vector.offsets = true
  
  docs.file=/Volumes/External/lucene/wiki.txt
  
  directory=FSDirectory
  
  merge.scheduler=org.apache.lucene.index.SerialMergeScheduler
  
  { "Rounds"
    ResetSystemErase
    { "BuildIndex"
      CreateIndex
      { "AddDocs" AddDoc > : 200000
      CloseIndex
    }
    NewRound
  } : 3
  
  RepSumByPrefRound BuildIndex
  
I used SerialMergeScheduler so that I could measure time saved due to
faster merging.

Without the patch, best of 3 was 509.0 sec; with patch, best of 3 was
448.8 sec = 11.8% overall speedup.


> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch, LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556593#action_12556593 ] 

Michael McCandless commented on LUCENE-1120:
--------------------------------------------

{quote}
I wonder how accurate these profiling numbers are? Java profiling slows
down (non-native) method calls, but not I/O operations. Did you measure
the merge time before and after applying this patch without a profiling
tool?
{quote}
Good point -- I haven't measured outside of profiling.  I plan to build a full Wiki index with and without this change to test....

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1120.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.4

I just committed this.  Note that this is a [small] change to the index format, so if you use trunk to build an index, 2.3 won't be able to read it!

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1120.patch, LUCENE-1120.take2.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1120:
---------------------------------------

    Attachment: LUCENE-1120.patch

Attached patch.  All tests pass.

(Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch).

I think we should commit this for 2.3?  It's a sizable gain in merging
performance.


> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1120:
---------------------------------------

    Attachment: LUCENE-1120.take2.patch

Attaching the right patch this time...

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch, LUCENE-1120.take2.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556586#action_12556586 ] 

Michael Busch commented on LUCENE-1120:
---------------------------------------

{quote}
I think we should commit this for 2.3? It's a sizable gain in merging
{quote}

HI Mike,

I'm planning to branch the trunk today. Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?

-Michael

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556589#action_12556589 ] 

Michael McCandless commented on LUCENE-1120:
--------------------------------------------

{quote}
Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?
{quote}

OK, I agree, it is somewhat risky, so let's wait.  (Though it is a sizable gain in performance!).

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1120:
---------------------------------------

    Attachment:     (was: LUCENE-1120.patch)

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch, LUCENE-1120.take2.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org