You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2009/03/27 20:52:50 UTC

[jira] Created: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Benchmark of different in RAM realtime techniques
-------------------------------------------------

                 Key: LUCENE-1577
                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
    Affects Versions: 2.4.1
            Reporter: Jason Rutherglen
            Priority: Minor
             Fix For: 2.9


A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1577:
---------------------------------------

    Fix Version/s:     (was: 2.9)

Moving out.

> Benchmark of different in RAM realtime techniques
> -------------------------------------------------
>
>                 Key: LUCENE-1577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693414#action_12693414 ] 

Michael McCandless commented on LUCENE-1577:
--------------------------------------------

Are these tests measuring adding a single doc, then searching on it?  What are the numbers you measure in the results (eg 25882 for LuceneRealtimeWriter)?

I think we need a more realistic test for near real-time search, but I'm not sure exactly what that is.

In LUCENE-1516 I've added a benchmark task to periodically open a new near real-time reader from the writer, and then tested it while doing bulk indexing.  But that's not a typical test, I think (normally bulk indexing is done up front, and only a "trickle" of updates to doc are then done for near real-time search).  Maybe we just need an updateDocument task, which randomly picks a doc (identified by a primary-key "docid" field) and replaces it.  Then, benchmark already has the ability to rate-limit how frequently docs are updated.

> Benchmark of different in RAM realtime techniques
> -------------------------------------------------
>
>                 Key: LUCENE-1577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742101#action_12742101 ] 

Mark Miller commented on LUCENE-1577:
-------------------------------------

bq. normally bulk indexing is done up front, and only a "trickle" of updates to doc are then done for near real-time search

Really depends though I think - I would bet that many users that want real time are dealing with a huge amount of updates at given times, and that type of thing seems likely to grow. A lot of times its I think it could be much more than a trickle.

A lot of installations I have seen have certain times when a lot of documents are coming in (certain times, certain days). Social Networking type sites likely see a constant stream of updates at most times. Press releases have hotspots for release - newspaper data all comes in at once in the morning - etc.

> Benchmark of different in RAM realtime techniques
> -------------------------------------------------
>
>                 Key: LUCENE-1577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1577:
-------------------------------------

    Attachment: LUCENE-1577.patch

This patch performs a benchmark of 3 different techniques for RAM based realtime indexing where after an update, the new document is searchable.  It performs multiple rounds of indexing and calculates the percentage difference with fastest of each of the 3 techniques.  The document source is the Wikipedia English XML used by contrib/benchmark.

* RealtimeWriter uses InstantiatedIndex
* LuceneWriter adds documents to an IndexWriter
* LuceneRealtimeWriter creates a RAMDirectory, opens an IndexWriter, adds a document, then closes the writer.

I found it odd that RealtimeWriter is faster than LuceneWriter and so perhaps the benchmark is incorrect somehow.  Otherwise the results look highly promising in that we can implement realtime search with no impact to existing indexing performance.  

Summary of the results:

numRounds:3 docs indexed:50000
lowest of each, percent compared with lowest
RealtimeWriter:7597 dif:0% 
LuceneWriter:12940 dif:70%
LuceneRealtimeWriter:25882 dif:241%


> Benchmark of different in RAM realtime techniques
> -------------------------------------------------
>
>                 Key: LUCENE-1577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742084#action_12742084 ] 

Jason Rutherglen commented on LUCENE-1577:
------------------------------------------

We need a benchmark that simply measures the indexing of
1,5,10,100,1000 docs + (reopen + query). The first benchmark can
use IW.getReader as is (meaning the newly created segments are
written to disk), the other LUCENE-1313 (which stores newly
created segments in RAM). This way we can accurately say which
method works best and in what situation. The use case
LUCENE-1313 is designed for is sub 100 document updates. 

I'll update LUCENE-1313, and give this a try. 

> Benchmark of different in RAM realtime techniques
> -------------------------------------------------
>
>                 Key: LUCENE-1577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1577
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org