You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/05/10 12:35:55 UTC

[jira] Created: (LUCENE-1283) Factor out ByteSliceWriter from DocumentsWriterFieldData

Factor out ByteSliceWriter from DocumentsWriterFieldData
--------------------------------------------------------

                 Key: LUCENE-1283
                 URL: https://issues.apache.org/jira/browse/LUCENE-1283
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.3.1, 2.3
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.4
         Attachments: LUCENE-1283.patch

DocumentsWriter uses byte slices into shared byte[]'s to hold the
growing postings data for many different terms in memory.  This is
probably the trickiest (most confusing) part of DocumentsWriter.

Right now it's not cleanly factored out and not easy to separately
test.  In working on this issue:

  http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/%3c126142c0805061426n1168421ya5594ef854fae5e4@mail.gmail.com%3e

which eventually turned out to be a bug in Oracle JRE's JIT compiler,
I factored out ByteSliceWriter and created a unit test to stress test
the writing & reading of byte slices.  The test just randomly writes N
streams interleaved into shared byte[]'s, then reads them back
verifying the results are correct.

I created the stress test to try to find any bugs in that code.  The
test ran fine (no bugs were found) but I think the refactoring is
still very much worthwhile.

I expected the changes to reduce indexing throughput, so I ran a test
indexing first 200K Wikipedia docs using this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000

directory=FSDirectory
autocommit=false
compound=true

ram.flush.mb=256

{ "Rounds"
  ResetSystemErase
  { "BuildIndex"
    - CreateIndex
     { "AddDocs" AddDoc > : 200000
    - CloseIndex
  }
  NewRound
} : 4

RepSumByPrefRound BuildIndex

{code}

Ok trunk it produces these results:
{code}
Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
BuildIndex      0        1       200000        791.7      252.63   338,552,096  1,061,814,272
BuildIndex -  - 1 -  -   1 -  -  200000 -  -   793.1 -  - 252.18 - 605,262,080  1,061,814,272
BuildIndex      2        1       200000        794.8      251.63   601,966,528  1,061,814,272
BuildIndex -  - 3 -  -   1 -  -  200000 -  -   782.5 -  - 255.58 - 608,699,712  1,061,814,272
{code}

and with the patch:

{code}
Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
BuildIndex      0        1       200000        745.0      268.47   338,318,784  1,061,814,272
BuildIndex -  - 1 -  -   1 -  -  200000 -  -   792.7 -  - 252.30 - 605,331,776  1,061,814,272
BuildIndex      2        1       200000        786.7      254.24   602,915,712  1,061,814,272
BuildIndex -  - 3 -  -   1 -  -  200000 -  -   795.3 -  - 251.48 - 602,378,624  1,061,814,272
{code}

So it looks like the performance cost of this change is negligible (in
the noise).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1283) Factor out ByteSliceWriter from DocumentsWriterFieldData

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1283.
----------------------------------------

    Resolution: Fixed

> Factor out ByteSliceWriter from DocumentsWriterFieldData
> --------------------------------------------------------
>
>                 Key: LUCENE-1283
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1283
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1283.patch
>
>
> DocumentsWriter uses byte slices into shared byte[]'s to hold the
> growing postings data for many different terms in memory.  This is
> probably the trickiest (most confusing) part of DocumentsWriter.
> Right now it's not cleanly factored out and not easy to separately
> test.  In working on this issue:
>   http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/%3c126142c0805061426n1168421ya5594ef854fae5e4@mail.gmail.com%3e
> which eventually turned out to be a bug in Oracle JRE's JIT compiler,
> I factored out ByteSliceWriter and created a unit test to stress test
> the writing & reading of byte slices.  The test just randomly writes N
> streams interleaved into shared byte[]'s, then reads them back
> verifying the results are correct.
> I created the stress test to try to find any bugs in that code.  The
> test ran fine (no bugs were found) but I think the refactoring is
> still very much worthwhile.
> I expected the changes to reduce indexing throughput, so I ran a test
> indexing first 200K Wikipedia docs using this alg:
> {code}
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> docs.file=/Volumes/External/lucene/wiki.txt
> doc.stored = true
> doc.term.vector = true
> doc.add.log.step=2000
> directory=FSDirectory
> autocommit=false
> compound=true
> ram.flush.mb=256
> { "Rounds"
>   ResetSystemErase
>   { "BuildIndex"
>     - CreateIndex
>      { "AddDocs" AddDoc > : 200000
>     - CloseIndex
>   }
>   NewRound
> } : 4
> RepSumByPrefRound BuildIndex
> {code}
> Ok trunk it produces these results:
> {code}
> Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> BuildIndex      0        1       200000        791.7      252.63   338,552,096  1,061,814,272
> BuildIndex -  - 1 -  -   1 -  -  200000 -  -   793.1 -  - 252.18 - 605,262,080  1,061,814,272
> BuildIndex      2        1       200000        794.8      251.63   601,966,528  1,061,814,272
> BuildIndex -  - 3 -  -   1 -  -  200000 -  -   782.5 -  - 255.58 - 608,699,712  1,061,814,272
> {code}
> and with the patch:
> {code}
> Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> BuildIndex      0        1       200000        745.0      268.47   338,318,784  1,061,814,272
> BuildIndex -  - 1 -  -   1 -  -  200000 -  -   792.7 -  - 252.30 - 605,331,776  1,061,814,272
> BuildIndex      2        1       200000        786.7      254.24   602,915,712  1,061,814,272
> BuildIndex -  - 3 -  -   1 -  -  200000 -  -   795.3 -  - 251.48 - 602,378,624  1,061,814,272
> {code}
> So it looks like the performance cost of this change is negligible (in
> the noise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1283) Factor out ByteSliceWriter from DocumentsWriterFieldData

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1283:
---------------------------------------

    Attachment: LUCENE-1283.patch

Attached patch.  I plan to commit in a day or two.

> Factor out ByteSliceWriter from DocumentsWriterFieldData
> --------------------------------------------------------
>
>                 Key: LUCENE-1283
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1283
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1283.patch
>
>
> DocumentsWriter uses byte slices into shared byte[]'s to hold the
> growing postings data for many different terms in memory.  This is
> probably the trickiest (most confusing) part of DocumentsWriter.
> Right now it's not cleanly factored out and not easy to separately
> test.  In working on this issue:
>   http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/%3c126142c0805061426n1168421ya5594ef854fae5e4@mail.gmail.com%3e
> which eventually turned out to be a bug in Oracle JRE's JIT compiler,
> I factored out ByteSliceWriter and created a unit test to stress test
> the writing & reading of byte slices.  The test just randomly writes N
> streams interleaved into shared byte[]'s, then reads them back
> verifying the results are correct.
> I created the stress test to try to find any bugs in that code.  The
> test ran fine (no bugs were found) but I think the refactoring is
> still very much worthwhile.
> I expected the changes to reduce indexing throughput, so I ran a test
> indexing first 200K Wikipedia docs using this alg:
> {code}
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> docs.file=/Volumes/External/lucene/wiki.txt
> doc.stored = true
> doc.term.vector = true
> doc.add.log.step=2000
> directory=FSDirectory
> autocommit=false
> compound=true
> ram.flush.mb=256
> { "Rounds"
>   ResetSystemErase
>   { "BuildIndex"
>     - CreateIndex
>      { "AddDocs" AddDoc > : 200000
>     - CloseIndex
>   }
>   NewRound
> } : 4
> RepSumByPrefRound BuildIndex
> {code}
> Ok trunk it produces these results:
> {code}
> Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> BuildIndex      0        1       200000        791.7      252.63   338,552,096  1,061,814,272
> BuildIndex -  - 1 -  -   1 -  -  200000 -  -   793.1 -  - 252.18 - 605,262,080  1,061,814,272
> BuildIndex      2        1       200000        794.8      251.63   601,966,528  1,061,814,272
> BuildIndex -  - 3 -  -   1 -  -  200000 -  -   782.5 -  - 255.58 - 608,699,712  1,061,814,272
> {code}
> and with the patch:
> {code}
> Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> BuildIndex      0        1       200000        745.0      268.47   338,318,784  1,061,814,272
> BuildIndex -  - 1 -  -   1 -  -  200000 -  -   792.7 -  - 252.30 - 605,331,776  1,061,814,272
> BuildIndex      2        1       200000        786.7      254.24   602,915,712  1,061,814,272
> BuildIndex -  - 3 -  -   1 -  -  200000 -  -   795.3 -  - 251.48 - 602,378,624  1,061,814,272
> {code}
> So it looks like the performance cost of this change is negligible (in
> the noise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org