You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/07 14:03:32 UTC
O/S Search Comparisons
Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
Has some interesting comparisons. Obviously, the comparison of Lucene
indexing is done w/ 1.9 so it probably needs to be done again. Just
wondering if people see any opportunities to improve Lucene from
it. I am going to try and contact the authors to see if I can get
what there setup values were (mergeFactor, Analyzer, etc.) as I think
it would be interesting to run the tests again on 2.3.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
My testing experience has shown around 100 to be good for things like
Wikipedia, etc. That is an interesting point to think about in
regards to paying the cost once optimize is undertaken and may be
worth exploring more. I also wonder how partial optimizes may help.
The Javadocs say:
Determines how often segment indices are merged by addDocument().
With
* smaller values, less RAM is used while indexing, and searches on
* unoptimized indices are faster, but indexing speed is slower.
With larger
* values, more RAM is used during indexing, and while searches on
unoptimized
* indices are slower, indexing is faster. Thus larger values (>
10) are best
* for batch index creation, and smaller values (< 10) for indices
that are
* interactively maintained.
*
* <p>Note that this method is a convenience method: it
* just calls mergePolicy.setMergeFactor as long as
* mergePolicy is an instance of {@link LogMergePolicy}.
* Otherwise an IllegalArgumentException is thrown.</p>
*
* <p>This must never be less than 2. The default value is 10.
I'd like to append to the last line to say something like:
Empirical testing suggests a maximum value around 100, but this
depends on the collection. Really large values (>>> 100) are
discouraged.
On Dec 18, 2007, at 12:10 AM, Doron Cohen wrote:
> On Dec 18, 2007 2:38 AM, Mark Miller <ma...@gmail.com> wrote:
>
>> For the data that I normally work with (short articles), I found that
>> the sweet spot was around 80-120. I actually saw a slight decrease
>> going
>> above that...not sure if that held forever though. That was testing
>> on
>> an earlier release (I think 2.1?). However, if you want to test
>> searching it would seem that you are going to want to optimize the
>> index. I have always found that whatever I save by changing the merge
>> factor is paid back when you optimize. I have not "scientifically"
>> tested this, but found it to be the case in every speed test I ran.
>> This
>> is an interesting thing to me for this test. Do you test with a full
>> optimize for indexing? If you don't, can you really test the search
>> performance with the advantage of a full optimize? So, if you are
>> going
>> to optimize, why mess with the merge factor? It may still play a
>> small
>> role, but at best I think its a pretty weak lever.
>
>
> I had similar experience - set merge factor to ~maxint and optimized
> at the end, and "felt" like it was the same (never meassured though).
> In fact, with the new concurrent merges, I think it should be faster
> to
> merge on the fly?
>
> (One comment - it is important to set back merge factor to a
> reasonable
> number before the final optimize, otherwise you hit OutOfMem due to
> so many segments being merged at once.)
>
>
>> - Mark
>>
>> Grant Ingersoll wrote:
>>> I did hear back from the authors. Some of the issues were based on
>>> values chosen for mergeFactor (10,000) I think, but there also
>>> seemed
>>> to be some questions about parsing the TREC collection. It was
>>> split
>>> out into individual files, as opposed to trying to stream in the
>>> documents like we do with Wikipedia, so I/O overhead may be an
>>> issue.
>>> At the time, 1.9.1 did not have much TREC support, so splitting
>>> files
>>> is probably the easiest way to do it. There indexing code was based
>>> off the demo and some LIA reading.
>>>
>>> They thought they would try Lucene again when 2.3 comes out. From
>>> our
>>> end, I think we need to improve the docs around mergeFactor. We
>>> generally just say bigger is better, but my understanding is there
>>> is
>>> definitely a limit to this (100?? Maybe 1000) so we should probably
>>> suggest that in the docs. And, of course, I think the new
>>> contrib/benchmark has support for reading TREC (although I don't
>>> know
>>> if it handles streaming it) such that I think it shouldn't be a
>>> problem this time around.
>>
>
> Yes it does streaming - TREC compressed files are read with
> GZIPInputStream
> "on demand" - next doc's text is read/parsed only when the indexer
> requests
> it,
> and the indexable document is created, no doc files are created on
> disk.
>
>
>>>
>>> At any rate, I think we are for the most part doing the right
>>> things.
>>> Anyone have any thoughts on advice about an upper bound for
>>> mergeFactor?
>>>
>>> Cheers,
>>> Grant
>>>
>>>
>>> On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
>>>
>>>> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
>>>>
>>>>>> +1 I have been thinking about this too. Solr clearly
>>>>>> demonstrates
>>>>>> the benefits of this kind of approach, although even it doesn't
>>>>>> make
>>>>>> it seamless for users in the sense that they still need to
>>>>>> divvy up
>>>>>> the docs on the app side.
>>>>>
>>>>> Would be nice if this layer also took care of searchers/readers
>>>>> refreshing & warming.
>>>>
>>>> Solr has well-tested code that provides all this functionality and
>>>> more (except for automatically spawning extra indexing threads,
>>>> which
>>>> I agree would be a useful addition). It does heavily depend on
>>>> 1.5's
>>>> java.util.concurrent package, though. Many people seem like using
>>>> Solr as an embedded library layer on top of Lucene to do it all
>>>> in-process, as well.
>>>>
>>>> -Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Doron Cohen <cd...@gmail.com>.
On Dec 18, 2007 2:38 AM, Mark Miller <ma...@gmail.com> wrote:
> For the data that I normally work with (short articles), I found that
> the sweet spot was around 80-120. I actually saw a slight decrease going
> above that...not sure if that held forever though. That was testing on
> an earlier release (I think 2.1?). However, if you want to test
> searching it would seem that you are going to want to optimize the
> index. I have always found that whatever I save by changing the merge
> factor is paid back when you optimize. I have not "scientifically"
> tested this, but found it to be the case in every speed test I ran. This
> is an interesting thing to me for this test. Do you test with a full
> optimize for indexing? If you don't, can you really test the search
> performance with the advantage of a full optimize? So, if you are going
> to optimize, why mess with the merge factor? It may still play a small
> role, but at best I think its a pretty weak lever.
I had similar experience - set merge factor to ~maxint and optimized
at the end, and "felt" like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster to
merge on the fly?
(One comment - it is important to set back merge factor to a reasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)
> - Mark
>
> Grant Ingersoll wrote:
> > I did hear back from the authors. Some of the issues were based on
> > values chosen for mergeFactor (10,000) I think, but there also seemed
> > to be some questions about parsing the TREC collection. It was split
> > out into individual files, as opposed to trying to stream in the
> > documents like we do with Wikipedia, so I/O overhead may be an issue.
> > At the time, 1.9.1 did not have much TREC support, so splitting files
> > is probably the easiest way to do it. There indexing code was based
> > off the demo and some LIA reading.
> >
> > They thought they would try Lucene again when 2.3 comes out. From our
> > end, I think we need to improve the docs around mergeFactor. We
> > generally just say bigger is better, but my understanding is there is
> > definitely a limit to this (100?? Maybe 1000) so we should probably
> > suggest that in the docs. And, of course, I think the new
> > contrib/benchmark has support for reading TREC (although I don't know
> > if it handles streaming it) such that I think it shouldn't be a
> > problem this time around.
>
Yes it does streaming - TREC compressed files are read with GZIPInputStream
"on demand" - next doc's text is read/parsed only when the indexer requests
it,
and the indexable document is created, no doc files are created on disk.
> >
> > At any rate, I think we are for the most part doing the right things.
> > Anyone have any thoughts on advice about an upper bound for mergeFactor?
> >
> > Cheers,
> > Grant
> >
> >
> > On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
> >
> >> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
> >>
> >>>> +1 I have been thinking about this too. Solr clearly demonstrates
> >>>> the benefits of this kind of approach, although even it doesn't make
> >>>> it seamless for users in the sense that they still need to divvy up
> >>>> the docs on the app side.
> >>>
> >>> Would be nice if this layer also took care of searchers/readers
> >>> refreshing & warming.
> >>
> >> Solr has well-tested code that provides all this functionality and
> >> more (except for automatically spawning extra indexing threads, which
> >> I agree would be a useful addition). It does heavily depend on 1.5's
> >> java.util.concurrent package, though. Many people seem like using
> >> Solr as an embedded library layer on top of Lucene to do it all
> >> in-process, as well.
> >>
> >> -Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
Re: O/S Search Comparisons
Posted by Mark Miller <ma...@gmail.com>.
For the data that I normally work with (short articles), I found that
the sweet spot was around 80-120. I actually saw a slight decrease going
above that...not sure if that held forever though. That was testing on
an earlier release (I think 2.1?). However, if you want to test
searching it would seem that you are going to want to optimize the
index. I have always found that whatever I save by changing the merge
factor is paid back when you optimize. I have not "scientifically"
tested this, but found it to be the case in every speed test I ran. This
is an interesting thing to me for this test. Do you test with a full
optimize for indexing? If you don't, can you really test the search
performance with the advantage of a full optimize? So, if you are going
to optimize, why mess with the merge factor? It may still play a small
role, but at best I think its a pretty weak lever.
- Mark
Grant Ingersoll wrote:
> I did hear back from the authors. Some of the issues were based on
> values chosen for mergeFactor (10,000) I think, but there also seemed
> to be some questions about parsing the TREC collection. It was split
> out into individual files, as opposed to trying to stream in the
> documents like we do with Wikipedia, so I/O overhead may be an issue.
> At the time, 1.9.1 did not have much TREC support, so splitting files
> is probably the easiest way to do it. There indexing code was based
> off the demo and some LIA reading.
>
> They thought they would try Lucene again when 2.3 comes out. From our
> end, I think we need to improve the docs around mergeFactor. We
> generally just say bigger is better, but my understanding is there is
> definitely a limit to this (100?? Maybe 1000) so we should probably
> suggest that in the docs. And, of course, I think the new
> contrib/benchmark has support for reading TREC (although I don't know
> if it handles streaming it) such that I think it shouldn't be a
> problem this time around.
>
> At any rate, I think we are for the most part doing the right things.
> Anyone have any thoughts on advice about an upper bound for mergeFactor?
>
> Cheers,
> Grant
>
>
> On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
>
>> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
>>
>>>> +1 I have been thinking about this too. Solr clearly demonstrates
>>>> the benefits of this kind of approach, although even it doesn't make
>>>> it seamless for users in the sense that they still need to divvy up
>>>> the docs on the app side.
>>>
>>> Would be nice if this layer also took care of searchers/readers
>>> refreshing & warming.
>>
>> Solr has well-tested code that provides all this functionality and
>> more (except for automatically spawning extra indexing threads, which
>> I agree would be a useful addition). It does heavily depend on 1.5's
>> java.util.concurrent package, though. Many people seem like using
>> Solr as an embedded library layer on top of Lucene to do it all
>> in-process, as well.
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
I did hear back from the authors. Some of the issues were based on
values chosen for mergeFactor (10,000) I think, but there also seemed
to be some questions about parsing the TREC collection. It was split
out into individual files, as opposed to trying to stream in the
documents like we do with Wikipedia, so I/O overhead may be an issue.
At the time, 1.9.1 did not have much TREC support, so splitting files
is probably the easiest way to do it. There indexing code was based
off the demo and some LIA reading.
They thought they would try Lucene again when 2.3 comes out. From our
end, I think we need to improve the docs around mergeFactor. We
generally just say bigger is better, but my understanding is there is
definitely a limit to this (100?? Maybe 1000) so we should probably
suggest that in the docs. And, of course, I think the new contrib/
benchmark has support for reading TREC (although I don't know if it
handles streaming it) such that I think it shouldn't be a problem this
time around.
At any rate, I think we are for the most part doing the right things.
Anyone have any thoughts on advice about an upper bound for mergeFactor?
Cheers,
Grant
On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
> On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
>
>>> +1 I have been thinking about this too. Solr clearly demonstrates
>>> the benefits of this kind of approach, although even it doesn't make
>>> it seamless for users in the sense that they still need to divvy up
>>> the docs on the app side.
>>
>> Would be nice if this layer also took care of searchers/readers
>> refreshing & warming.
>
> Solr has well-tested code that provides all this functionality and
> more (except for automatically spawning extra indexing threads,
> which I agree would be a useful addition). It does heavily depend
> on 1.5's java.util.concurrent package, though. Many people seem
> like using Solr as an embedded library layer on top of Lucene to do
> it all in-process, as well.
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Mike Klaas <mi...@gmail.com>.
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
>> +1 I have been thinking about this too. Solr clearly demonstrates
>> the benefits of this kind of approach, although even it doesn't make
>> it seamless for users in the sense that they still need to divvy up
>> the docs on the app side.
>
> Would be nice if this layer also took care of searchers/readers
> refreshing & warming.
Solr has well-tested code that provides all this functionality and
more (except for automatically spawning extra indexing threads, which
I agree would be a useful addition). It does heavily depend on 1.5's
java.util.concurrent package, though. Many people seem like using
Solr as an embedded library layer on top of Lucene to do it all in-
process, as well.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Michael McCandless <lu...@mikemccandless.com>.
Well, at some point the answer is "use Solr". I think Lucene should
stay focused on being a good search library/component, and server
level capabilities should be handled by Solr or the application layer
on top of Lucene.
That said, I still think there is a need for a layer that handles/
hides threading, does search refreshing/warming, etc., on the Lucene
side. Actually, LuceneIndexAccessor (LUCENE-390) is already a step
in that direction.
Mike
On Dec 9, 2007, at 1:29 AM, robert engels wrote:
> This is along the lines of what I have tried to get the Lucene
> community to adopt for a long time.
>
> If you want to take Lucene to the next level, it needs a "server"
> implementation.
>
> Only with this can you get efficient locks, caching, transactions,
> which leads to more efficient indexing and searching.
>
> IMO, the "shared" storage nature of Lucene is its biggest
> weakness. A lot of changes have been made to improve this, when it
> probably just needs to be dropped. If you have a network, it is
> really no different to communicate with processes rather than storage.
>
> On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:
>
>> Grant Ingersoll <gs...@apache.org> wrote on 08/12/2007 16:02:31:
>>
>>>
>>> On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:
>>>
>>>>>> Sometimes, when something like this comes up, it gives you the
>>>>>> opportunity to take a step back and ask what are the things we
>>>>>> really want Lucene to be going forward (the New Year is good for
>>>>>> this kind of assessment as well) What are it's strengths and
>>>>>> weaknesses? What can we improve in the short term and what needs
>>>>>> to improve in the longer term? Maybe it's just that time of year
>>>>>> to send out your Lucene Wish List... :-)
>>>>
>>>> +1
>>>>
>>>> There is still something for us to learn & improve in Lucene, even
>>>> if the comparison is necessarily apples/oranges or unfair.
>>>>
>>>> Lucene was listed as not having "Result Excerpt" which isn't really
>>>> fair, though it is true you have to pull in contrib/highlighter to
>>>> enable it.
>>>
>>> Yeah, I noted that mentally, but didn't think it was a big deal
>>> since
>>> not everyone wants it. The other thing is, some of it comes down to
>>> how you structure your content. I think a lot of people use
>>> metadata
>>> fields to provide enough "summary" info about a document.
>>>
>>>>
>>>>
>>>>> Did it crash on the 10 GB? I thought it said that it just took way
>>>>> to long (7 times the best or something). Frankly, either case is
>>>>> suspect. Last summer I indexed about 5 million docs with a total
>>>>> size at the *very* least of 10 GB on my 3 year old desktop. It
>>>>> didn't take much more than 8 hours to index and searches where
>>>>> still lightning fast. Maybe they forgot to give the JVM more than
>>>>> the default amount of RAM <g>
>>>>
>>>> The paper just said "ht://Dig and Lucene degraded considerably
>>>> their
>>>> indexing time, and we excluded them from the final comparison".
>>>>
>>>> Maybe Lucene just hit a very large segment merge and the author
>>>> incorrectly thought something had gone wrong since the addDocument
>>>> call was taking incredibly long? In which case the new default
>>>> ConcurrentMergeScheduler should improve that. I would expect
>>>> Lucene
>>>> 2.3 to now have an advantage in that it makes use of concurrency in
>>>> the hardware, out of the box, whereas likely other older engines
>>>> are
>>>> single threaded.
>>>
>>> Yep.
>>>
>>>>
>>>>
>>>> I've also thought about creating a simple optional threaded
>>>> layer on
>>>> top of IndexWriter which uses multiple threads to add documents,
>>>> under the hood. Such a class would expose all of the methods of
>>>> IndexWriter (would feel just like IndexWriter), except calls to
>>>> add/
>>>> updateDocument would drop into a queue which multiple threads
>>>> (maintained by this class) would pull from and execute. This would
>>>> then let Lucene make use of even more concurrency ... and saves the
>>>> "complexity" of application writers having to manage threads above
>>>> Lucene.
>>>
>>> +1 I have been thinking about this too. Solr clearly demonstrates
>>> the benefits of this kind of approach, although even it doesn't make
>>> it seamless for users in the sense that they still need to divvy up
>>> the docs on the app side.
>>
>> Would be nice if this layer also took care of searchers/readers
>> refreshing & warming.
>>
>>>
>>> Here's some of my wishes:
>>>
>>> 1. Better Demo
>>>
>>> 2. Alternate scoring algorithms (which implies indexing too) that
>>> perform at or near the same level as the current ones
>>
>> +1
>>
>>>
>>> 3. A way of announcing improvements to Interfaces such that we have
>>> better ability to add methods to interfaces, knowing full well it
>>> will
>>> break some people. Same goes for deprecated. In this day and
>>> age of
>>> agile programming, it seems a bit restrictive to me that we wait 1+
>>> years (the average time between major releases) to remove what we
>>> consider to be cruft in our code or add new capabilities to
>>> interfaces. I would suggest we announce a deprecated method,
>>> version
>>> it, mark it to when it is going away (i.e. This will be removed in
>>> version 2.6) and then do so in that version. So, if we deprecate
>>> something in 2.3, we could, assuming consecutive numbered releases,
>>> remove it in 2.5. This would presumably move things up a bit to
>>> about
>>> the 6 mos. time range. Just a thought... :-)
>>>
>>> -Grant
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by robert engels <re...@ix.netcom.com>.
This is along the lines of what I have tried to get the Lucene
community to adopt for a long time.
If you want to take Lucene to the next level, it needs a "server"
implementation.
Only with this can you get efficient locks, caching, transactions,
which leads to more efficient indexing and searching.
IMO, the "shared" storage nature of Lucene is its biggest weakness.
A lot of changes have been made to improve this, when it probably
just needs to be dropped. If you have a network, it is really no
different to communicate with processes rather than storage.
On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:
> Grant Ingersoll <gs...@apache.org> wrote on 08/12/2007 16:02:31:
>
>>
>> On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:
>>
>>>>> Sometimes, when something like this comes up, it gives you the
>>>>> opportunity to take a step back and ask what are the things we
>>>>> really want Lucene to be going forward (the New Year is good for
>>>>> this kind of assessment as well) What are it's strengths and
>>>>> weaknesses? What can we improve in the short term and what needs
>>>>> to improve in the longer term? Maybe it's just that time of year
>>>>> to send out your Lucene Wish List... :-)
>>>
>>> +1
>>>
>>> There is still something for us to learn & improve in Lucene, even
>>> if the comparison is necessarily apples/oranges or unfair.
>>>
>>> Lucene was listed as not having "Result Excerpt" which isn't really
>>> fair, though it is true you have to pull in contrib/highlighter to
>>> enable it.
>>
>> Yeah, I noted that mentally, but didn't think it was a big deal since
>> not everyone wants it. The other thing is, some of it comes down to
>> how you structure your content. I think a lot of people use metadata
>> fields to provide enough "summary" info about a document.
>>
>>>
>>>
>>>> Did it crash on the 10 GB? I thought it said that it just took way
>>>> to long (7 times the best or something). Frankly, either case is
>>>> suspect. Last summer I indexed about 5 million docs with a total
>>>> size at the *very* least of 10 GB on my 3 year old desktop. It
>>>> didn't take much more than 8 hours to index and searches where
>>>> still lightning fast. Maybe they forgot to give the JVM more than
>>>> the default amount of RAM <g>
>>>
>>> The paper just said "ht://Dig and Lucene degraded considerably their
>>> indexing time, and we excluded them from the final comparison".
>>>
>>> Maybe Lucene just hit a very large segment merge and the author
>>> incorrectly thought something had gone wrong since the addDocument
>>> call was taking incredibly long? In which case the new default
>>> ConcurrentMergeScheduler should improve that. I would expect Lucene
>>> 2.3 to now have an advantage in that it makes use of concurrency in
>>> the hardware, out of the box, whereas likely other older engines are
>>> single threaded.
>>
>> Yep.
>>
>>>
>>>
>>> I've also thought about creating a simple optional threaded layer on
>>> top of IndexWriter which uses multiple threads to add documents,
>>> under the hood. Such a class would expose all of the methods of
>>> IndexWriter (would feel just like IndexWriter), except calls to add/
>>> updateDocument would drop into a queue which multiple threads
>>> (maintained by this class) would pull from and execute. This would
>>> then let Lucene make use of even more concurrency ... and saves the
>>> "complexity" of application writers having to manage threads above
>>> Lucene.
>>
>> +1 I have been thinking about this too. Solr clearly demonstrates
>> the benefits of this kind of approach, although even it doesn't make
>> it seamless for users in the sense that they still need to divvy up
>> the docs on the app side.
>
> Would be nice if this layer also took care of searchers/readers
> refreshing & warming.
>
>>
>> Here's some of my wishes:
>>
>> 1. Better Demo
>>
>> 2. Alternate scoring algorithms (which implies indexing too) that
>> perform at or near the same level as the current ones
>
> +1
>
>>
>> 3. A way of announcing improvements to Interfaces such that we have
>> better ability to add methods to interfaces, knowing full well it
>> will
>> break some people. Same goes for deprecated. In this day and age of
>> agile programming, it seems a bit restrictive to me that we wait 1+
>> years (the average time between major releases) to remove what we
>> consider to be cruft in our code or add new capabilities to
>> interfaces. I would suggest we announce a deprecated method, version
>> it, mark it to when it is going away (i.e. This will be removed in
>> version 2.6) and then do so in that version. So, if we deprecate
>> something in 2.3, we could, assuming consecutive numbered releases,
>> remove it in 2.5. This would presumably move things up a bit to
>> about
>> the 6 mos. time range. Just a thought... :-)
>>
>> -Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Doron Cohen <DO...@il.ibm.com>.
Grant Ingersoll <gs...@apache.org> wrote on 08/12/2007 16:02:31:
>
> On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:
>
> >>> Sometimes, when something like this comes up, it gives you the
> >>> opportunity to take a step back and ask what are the things we
> >>> really want Lucene to be going forward (the New Year is good for
> >>> this kind of assessment as well) What are it's strengths and
> >>> weaknesses? What can we improve in the short term and what needs
> >>> to improve in the longer term? Maybe it's just that time of year
> >>> to send out your Lucene Wish List... :-)
> >
> > +1
> >
> > There is still something for us to learn & improve in Lucene, even
> > if the comparison is necessarily apples/oranges or unfair.
> >
> > Lucene was listed as not having "Result Excerpt" which isn't really
> > fair, though it is true you have to pull in contrib/highlighter to
> > enable it.
>
> Yeah, I noted that mentally, but didn't think it was a big deal since
> not everyone wants it. The other thing is, some of it comes down to
> how you structure your content. I think a lot of people use metadata
> fields to provide enough "summary" info about a document.
>
> >
> >
> >> Did it crash on the 10 GB? I thought it said that it just took way
> >> to long (7 times the best or something). Frankly, either case is
> >> suspect. Last summer I indexed about 5 million docs with a total
> >> size at the *very* least of 10 GB on my 3 year old desktop. It
> >> didn't take much more than 8 hours to index and searches where
> >> still lightning fast. Maybe they forgot to give the JVM more than
> >> the default amount of RAM <g>
> >
> > The paper just said "ht://Dig and Lucene degraded considerably their
> > indexing time, and we excluded them from the final comparison".
> >
> > Maybe Lucene just hit a very large segment merge and the author
> > incorrectly thought something had gone wrong since the addDocument
> > call was taking incredibly long? In which case the new default
> > ConcurrentMergeScheduler should improve that. I would expect Lucene
> > 2.3 to now have an advantage in that it makes use of concurrency in
> > the hardware, out of the box, whereas likely other older engines are
> > single threaded.
>
> Yep.
>
> >
> >
> > I've also thought about creating a simple optional threaded layer on
> > top of IndexWriter which uses multiple threads to add documents,
> > under the hood. Such a class would expose all of the methods of
> > IndexWriter (would feel just like IndexWriter), except calls to add/
> > updateDocument would drop into a queue which multiple threads
> > (maintained by this class) would pull from and execute. This would
> > then let Lucene make use of even more concurrency ... and saves the
> > "complexity" of application writers having to manage threads above
> > Lucene.
>
> +1 I have been thinking about this too. Solr clearly demonstrates
> the benefits of this kind of approach, although even it doesn't make
> it seamless for users in the sense that they still need to divvy up
> the docs on the app side.
Would be nice if this layer also took care of searchers/readers
refreshing & warming.
>
> Here's some of my wishes:
>
> 1. Better Demo
>
> 2. Alternate scoring algorithms (which implies indexing too) that
> perform at or near the same level as the current ones
+1
>
> 3. A way of announcing improvements to Interfaces such that we have
> better ability to add methods to interfaces, knowing full well it will
> break some people. Same goes for deprecated. In this day and age of
> agile programming, it seems a bit restrictive to me that we wait 1+
> years (the average time between major releases) to remove what we
> consider to be cruft in our code or add new capabilities to
> interfaces. I would suggest we announce a deprecated method, version
> it, mark it to when it is going away (i.e. This will be removed in
> version 2.6) and then do so in that version. So, if we deprecate
> something in 2.3, we could, assuming consecutive numbered releases,
> remove it in 2.5. This would presumably move things up a bit to about
> the 6 mos. time range. Just a thought... :-)
>
> -Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:
>>> Sometimes, when something like this comes up, it gives you the
>>> opportunity to take a step back and ask what are the things we
>>> really want Lucene to be going forward (the New Year is good for
>>> this kind of assessment as well) What are it's strengths and
>>> weaknesses? What can we improve in the short term and what needs
>>> to improve in the longer term? Maybe it's just that time of year
>>> to send out your Lucene Wish List... :-)
>
> +1
>
> There is still something for us to learn & improve in Lucene, even
> if the comparison is necessarily apples/oranges or unfair.
>
> Lucene was listed as not having "Result Excerpt" which isn't really
> fair, though it is true you have to pull in contrib/highlighter to
> enable it.
Yeah, I noted that mentally, but didn't think it was a big deal since
not everyone wants it. The other thing is, some of it comes down to
how you structure your content. I think a lot of people use metadata
fields to provide enough "summary" info about a document.
>
>
>> Did it crash on the 10 GB? I thought it said that it just took way
>> to long (7 times the best or something). Frankly, either case is
>> suspect. Last summer I indexed about 5 million docs with a total
>> size at the *very* least of 10 GB on my 3 year old desktop. It
>> didn't take much more than 8 hours to index and searches where
>> still lightning fast. Maybe they forgot to give the JVM more than
>> the default amount of RAM <g>
>
> The paper just said "ht://Dig and Lucene degraded considerably their
> indexing time, and we excluded them from the final comparison".
>
> Maybe Lucene just hit a very large segment merge and the author
> incorrectly thought something had gone wrong since the addDocument
> call was taking incredibly long? In which case the new default
> ConcurrentMergeScheduler should improve that. I would expect Lucene
> 2.3 to now have an advantage in that it makes use of concurrency in
> the hardware, out of the box, whereas likely other older engines are
> single threaded.
Yep.
>
>
> I've also thought about creating a simple optional threaded layer on
> top of IndexWriter which uses multiple threads to add documents,
> under the hood. Such a class would expose all of the methods of
> IndexWriter (would feel just like IndexWriter), except calls to add/
> updateDocument would drop into a queue which multiple threads
> (maintained by this class) would pull from and execute. This would
> then let Lucene make use of even more concurrency ... and saves the
> "complexity" of application writers having to manage threads above
> Lucene.
+1 I have been thinking about this too. Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.
Here's some of my wishes:
1. Better Demo
2. Alternate scoring algorithms (which implies indexing too) that
perform at or near the same level as the current ones
3. A way of announcing improvements to Interfaces such that we have
better ability to add methods to interfaces, knowing full well it will
break some people. Same goes for deprecated. In this day and age of
agile programming, it seems a bit restrictive to me that we wait 1+
years (the average time between major releases) to remove what we
consider to be cruft in our code or add new capabilities to
interfaces. I would suggest we announce a deprecated method, version
it, mark it to when it is going away (i.e. This will be removed in
version 2.6) and then do so in that version. So, if we deprecate
something in 2.3, we could, assuming consecutive numbered releases,
remove it in 2.5. This would presumably move things up a bit to about
the 6 mos. time range. Just a thought... :-)
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Michael McCandless <lu...@mikemccandless.com>.
>> Sometimes, when something like this comes up, it gives you the
>> opportunity to take a step back and ask what are the things we
>> really want Lucene to be going forward (the New Year is good for
>> this kind of assessment as well) What are it's strengths and
>> weaknesses? What can we improve in the short term and what needs
>> to improve in the longer term? Maybe it's just that time of year
>> to send out your Lucene Wish List... :-)
+1
There is still something for us to learn & improve in Lucene, even if
the comparison is necessarily apples/oranges or unfair.
Lucene was listed as not having "Result Excerpt" which isn't really
fair, though it is true you have to pull in contrib/highlighter to
enable it.
> Did it crash on the 10 GB? I thought it said that it just took way
> to long (7 times the best or something). Frankly, either case is
> suspect. Last summer I indexed about 5 million docs with a total
> size at the *very* least of 10 GB on my 3 year old desktop. It
> didn't take much more than 8 hours to index and searches where
> still lightning fast. Maybe they forgot to give the JVM more than
> the default amount of RAM <g>
The paper just said "ht://Dig and Lucene degraded considerably their
indexing time, and we excluded them from the final comparison".
Maybe Lucene just hit a very large segment merge and the author
incorrectly thought something had gone wrong since the addDocument
call was taking incredibly long? In which case the new default
ConcurrentMergeScheduler should improve that. I would expect Lucene
2.3 to now have an advantage in that it makes use of concurrency in
the hardware, out of the box, whereas likely other older engines are
single threaded.
I've also thought about creating a simple optional threaded layer on
top of IndexWriter which uses multiple threads to add documents,
under the hood. Such a class would expose all of the methods of
IndexWriter (would feel just like IndexWriter), except calls to add/
updateDocument would drop into a queue which multiple threads
(maintained by this class) would pull from and execute. This would
then let Lucene make use of even more concurrency ... and saves the
"complexity" of application writers having to manage threads above
Lucene.
It is also possible the collection size is such that the merge cost
was very high (too high), because the LogMergePolicy inadvertently
optimizes every so often. Ie, for certain "unlucky" ranges of
collection sizes (number of documents "just above" maxBuffereDocs *
powers-of-mergeFactor, in log-space) you will indeed see that
amortized merge cost was far too high. This is because
LogMergePolicy is "pay it forward": it pays up front for continuing
growth of the index, vs paying as-you-go which would be better. I
opened LUCENE-854 for this issue a while back, but it's still open.
Eg KinoSearch's merging doesn't "inadvertently optimize" I think.
>> a) missing something in our defaults setup
I do think we've improved "out of the box defaults" in 2.3, not only
with the speedups to indexing in LUCENE-843, but also changing the
default to flushing at 16 MB instead of every 10 documents. This
ought to be a sizable improvement for users who just rely on Lucene's
defaults (which is presumably the vast majority of users).
> - Mark
>
> Grant Ingersoll wrote:
>> All true and good points. Lucene held up quite nicely in the
>> search aspect (at least perf. wise) and I generally don't think
>> making these kinds of comparisons are all that useful (we call it
>> apple and oranges in English :-) ).
>>
>> What I am trying to get at is if this paper was just about Lucene
>> and never mentioned a single other system, what, if anything, can
>> we take from it that can help us make Lucene better. I know, for
>> instance, from my own personal experience, that 2.3 is somewhere
>> in the range of 3-5+ times faster than 2.2 (which I know is faster
>> than 1.9). That being said, the paper clearly states that Lucene
>> was not capable of doing the WT10g docs because performance
>> degraded too much. Now, I know Lucene is pretty darn capable of a
>> lot of things and people are using it to do web search, etc. at
>> very large scales (I have personally talked w/ people doing it).
>> So, what I worry about is that either we are:
>> a) missing something in our defaults setup
>> b) missing something in our docs and our education efforts, or
>> c) we are missing some capability in our indexing such that it is
>> crashing
>>
>> Now, what is to be done? It may well be nothing, but I just want
>> to make sure we are comfortable with that decision or whether it
>> is worth asking for a volunteer who has access to the WT10g docs
>> to go have a look at it and see what happens. I personally don't
>> have access to these docs, otherwise I would try it out. What we
>> don't want to happen is for potential supporters/contributors to
>> read that paper and say "Lucene isn't for me because of this."
>>
>> Sometimes, when something like this comes up, it gives you the
>> opportunity to take a step back and ask what are the things we
>> really want Lucene to be going forward (the New Year is good for
>> this kind of assessment as well) What are it's strengths and
>> weaknesses? What can we improve in the short term and what needs
>> to improve in the longer term? Maybe it's just that time of year
>> to send out your Lucene Wish List... :-)
>>
>> Cheers,
>> Grant
>>
>> PS: Samir, any chance of contributing back your ranking
>> algorithms? :-)
>>
>>
>> On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:
>>
>>> There is an expression in French that says "comparer des pommes
>>> et des
>>> poires" which literally means "to compare apples and pears".
>>> That's what
>>> this paper is about. For my point of view, such a comparison
>>> would be
>>> interesting only if a cross analysis of different criterions (for
>>> example,
>>> retrieval effectiveness (aka search quality), search time,
>>> indexing time,
>>> index size, query language, index structure, and so on...) is done.
>>> Comparing different systems based only on one criterion is not
>>> well-grounded. There is always a kind of trade-off: for example,
>>> beside
>>> other parameters (ranking algorithm, frequencies statistics,
>>> document
>>> structure, etc.), indexing with zettair is much faster than
>>> indexing with
>>> lucene but if we consider searching time lucene is better than
>>> zettair. Why?
>>> Because of many reasons but probably zettair hasn't the complex
>>> document
>>> structure of lucene besides the ranking algorithm (Okapi BM25 vs.
>>> tf-idf).
>>> Some systems computes and stores the scores at indexing time
>>> which make them
>>> faster at searching time but less flexible if you want to change/
>>> implement a
>>> new ranking algorithm.
>>>
>>>>> Still, when a well-respected researcher in the field says
>>>>> Lucene didn't do
>>> so hot in certain areas,
>>>
>>> If we consider the search quality, that's simply not true if we
>>> know how to
>>> implement in Lucene popular ranking algorithm such OkapiBM25 (at
>>> least).
>>> I've been working with Lucene for four years now, all experiments
>>> of my
>>> thesis have been done using Lucene (with many adaptations to
>>> implement the
>>> most recent ranking algorithm including different language model,
>>> divergence
>>> from randomness, etc.). I also participated to major IR
>>> campaigns (NTCIR,
>>> CLEF and TREC) and the results are not bad at all (see
>>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/
>>> CLIR/NTCIR5
>>> -OV-CLIR-KishidaK.pdf for NTCIR-5 or
>>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/
>>> NTCIR6-OVE
>>> RVIEW.pdf for NTCIR-6, for CLEF have a look at
>>> http://www.clef-campaign.org/2006/working_notes/workingnotes2006/
>>> dinunzioOCL
>>> EF2006.pdf, ...) for other information search the web ;-)
>>>
>>> Samir
>>>
>>>
>>>> -----Message d'origine-----
>>>> De : Mark Miller [mailto:markrmiller@gmail.com]
>>>> Envoyé : vendredi 7 décembre 2007 21:01
>>>> À : java-dev@lucene.apache.org
>>>> Objet : Re: O/S Search Comparisons
>>>>
>>>> Yes, and even if they did not use the stock defaults, I would
>>>> bet there
>>>> would be complaints about what was done wrong at every turn.
>>>> This seems
>>>> like a very difficult thing to do. How long does it take to
>>>> fully learn
>>>> how to correctly utilize each search engine for the task at
>>>> hand? I am
>>>> sure longer than these busy men could possibly take. It seems
>>>> that such
>>>> a comparison could only be done legitimately if experts for each
>>>> search
>>>> engine set up the indexing/searching processes. Even then the
>>>> results
>>>> seem like they could be difficult to measure...eg was each search
>>>> engine
>>>> configured so that they would only break on spaces for indexing
>>>> and do
>>>> nothing else special at all? So many small settings and
>>>> knowledge need
>>>> to ensure each engine is on level ground...
>>>>
>>>> I doubt it will ever happen, but some sort of open source search
>>>> off
>>>> would be pretty cool <g>. Then each camp could properly
>>>> configure their
>>>> search engine for each task.
>>>>
>>>> - Mark
>>>>
>>>> Mike Klaas wrote:
>>>>> There is a good chance that they were using stock indexing
>>>>> defaults,
>>>>> based on:
>>>>>
>>>>> Lucene:
>>>>> " In the present work, the simple applications
>>>>> bundled with the library were used to index the collection. "
>>>>>
>>>>> On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
>>>>>
>>>>>> Yeah, I wasn't too excited over it and I certainly didn't lose
>>>>>> any
>>>>>> sleep over it, but there are some interesting things of note in
>>>> there
>>>>>> concerning Lucene, including the claim that it fell over on
>>>>>> indexing
>>>>>> WT10g docs (page 40) and I am always looking for ways to improve
>>>>>> things. Overall, I think Lucene held up pretty well in the
>>>>>> evaluation, and I know how suspect _any_ evaluation is given the
>>>>>> myriad ways of doing search. Still, when a well-respected
>>>> researcher
>>>>>> in the field says Lucene didn't do so hot in certain areas, I
>>>>>> don't
>>>>>> think we can dismiss them out of hand. So regardless of the
>>>>>> tests
>>>>>> being right or wrong, they are worth either addressing the
>>>>>> failures
>>>>>> in Lucene or the failures in the test such that we make sure
>>>>>> we are
>>>>>> properly educating our users on how best to use Lucene.
>>>>>>
>>>>>> I emailed the authors asking for information on how the test
>>>>>> was run
>>>>>> etc., so we'll see if anything comes of it.
>>>>>>
>>>>>> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>>>>>>
>>>>>>> I wouldn't get too excited over this. Once again, it does not
>>>>>>> seem
>>>>>>> the evaluator understands the nature of GC based systems, and
>>>>>>> the
>>>>>>> memory statistics are quite out of whack. But it is hard to tell
>>>>>>> because there is no data on how memory consumption was actually
>>>>>>> measured.
>>>>>>>
>>>>>>> A far better way of measuring memory consumption is to cap the
>>>>>>> process at different levels (max ram sizes), and compare the
>>>>>>> performance at each level.
>>>>>>>
>>>>>>> There is also fact that a process takes memory from disk
>>>>>>> cache, and
>>>>>>> visa versa, that heavily affects search performance, etc.
>>>>>>>
>>>>>>> Since there is no detailed data (that I could find) about system
>>>>>>> configuration, etc. the results are highly suspect.
>>>>>>>
>>>>>>> There is also no mention of performance on multi-processor
>>>>>>> systems.
>>>>>>> Some systems (like Lucene) pay a penalty to support multi-
>>>> processing
>>>>>>> (both in Java and Lucene), and only realize this benefit when
>>>>>>> operating in a multi-processor environment.
>>>>>>>
>>>>>>> Based on the shear speed of XMLSearch and Zettair those seem
>>>>>>> likely
>>>>>>> candidates to inspect their design.
>>>>>>>
>>>>>>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>>>>>>
>>>>>>>> Was wondering if people have seen
>>>>>>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>>>>>>>
>>>>>>>> Has some interesting comparisons. Obviously, the comparison of
>>>>>>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>>>>>>>> again. Just wondering if people see any opportunities to
>>>>>>>> improve
>>>>>>>> Lucene from it. I am going to try and contact the authors to
>>>> see
>>>>>>>> if I can get what there setup values were (mergeFactor,
>>>>>>>> Analyzer,
>>>>>>>> etc.) as I think it would be interesting to run the tests
>>>>>>>> again on
>>>>>>>> 2.3.
>>>>>>>>
>>>>>>>> -Grant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> ---
>>>> ---
>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-dev-
>>>>>>>> help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> ---
>>>> --
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---
>>>> -
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Mark Miller <ma...@gmail.com>.
Did it crash on the 10 GB? I thought it said that it just took way to
long (7 times the best or something). Frankly, either case is suspect.
Last summer I indexed about 5 million docs with a total size at the
*very* least of 10 GB on my 3 year old desktop. It didn't take much more
than 8 hours to index and searches where still lightning fast. Maybe
they forgot to give the JVM more than the default amount of RAM <g>
- Mark
Grant Ingersoll wrote:
> All true and good points. Lucene held up quite nicely in the search
> aspect (at least perf. wise) and I generally don't think making these
> kinds of comparisons are all that useful (we call it apple and oranges
> in English :-) ).
>
> What I am trying to get at is if this paper was just about Lucene and
> never mentioned a single other system, what, if anything, can we take
> from it that can help us make Lucene better. I know, for instance,
> from my own personal experience, that 2.3 is somewhere in the range of
> 3-5+ times faster than 2.2 (which I know is faster than 1.9). That
> being said, the paper clearly states that Lucene was not capable of
> doing the WT10g docs because performance degraded too much. Now, I
> know Lucene is pretty darn capable of a lot of things and people are
> using it to do web search, etc. at very large scales (I have
> personally talked w/ people doing it). So, what I worry about is that
> either we are:
> a) missing something in our defaults setup
> b) missing something in our docs and our education efforts, or
> c) we are missing some capability in our indexing such that it is
> crashing
>
> Now, what is to be done? It may well be nothing, but I just want to
> make sure we are comfortable with that decision or whether it is worth
> asking for a volunteer who has access to the WT10g docs to go have a
> look at it and see what happens. I personally don't have access to
> these docs, otherwise I would try it out. What we don't want to
> happen is for potential supporters/contributors to read that paper and
> say "Lucene isn't for me because of this."
>
> Sometimes, when something like this comes up, it gives you the
> opportunity to take a step back and ask what are the things we really
> want Lucene to be going forward (the New Year is good for this kind of
> assessment as well) What are it's strengths and weaknesses? What can
> we improve in the short term and what needs to improve in the longer
> term? Maybe it's just that time of year to send out your Lucene Wish
> List... :-)
>
> Cheers,
> Grant
>
> PS: Samir, any chance of contributing back your ranking algorithms? :-)
>
>
> On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:
>
>> There is an expression in French that says "comparer des pommes et des
>> poires" which literally means "to compare apples and pears". That's
>> what
>> this paper is about. For my point of view, such a comparison would be
>> interesting only if a cross analysis of different criterions (for
>> example,
>> retrieval effectiveness (aka search quality), search time, indexing
>> time,
>> index size, query language, index structure, and so on...) is done.
>> Comparing different systems based only on one criterion is not
>> well-grounded. There is always a kind of trade-off: for example, beside
>> other parameters (ranking algorithm, frequencies statistics, document
>> structure, etc.), indexing with zettair is much faster than indexing
>> with
>> lucene but if we consider searching time lucene is better than
>> zettair. Why?
>> Because of many reasons but probably zettair hasn't the complex document
>> structure of lucene besides the ranking algorithm (Okapi BM25 vs.
>> tf-idf).
>> Some systems computes and stores the scores at indexing time which
>> make them
>> faster at searching time but less flexible if you want to
>> change/implement a
>> new ranking algorithm.
>>
>>>> Still, when a well-respected researcher in the field says Lucene
>>>> didn't do
>> so hot in certain areas,
>>
>> If we consider the search quality, that's simply not true if we know
>> how to
>> implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
>> I've been working with Lucene for four years now, all experiments of my
>> thesis have been done using Lucene (with many adaptations to
>> implement the
>> most recent ranking algorithm including different language model,
>> divergence
>> from randomness, etc.). I also participated to major IR campaigns
>> (NTCIR,
>> CLEF and TREC) and the results are not bad at all (see
>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
>>
>> -OV-CLIR-KishidaK.pdf for NTCIR-5 or
>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
>>
>> RVIEW.pdf for NTCIR-6, for CLEF have a look at
>> http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
>>
>> EF2006.pdf, ...) for other information search the web ;-)
>>
>> Samir
>>
>>
>>> -----Message d'origine-----
>>> De : Mark Miller [mailto:markrmiller@gmail.com]
>>> Envoyé : vendredi 7 décembre 2007 21:01
>>> À : java-dev@lucene.apache.org
>>> Objet : Re: O/S Search Comparisons
>>>
>>> Yes, and even if they did not use the stock defaults, I would bet there
>>> would be complaints about what was done wrong at every turn. This seems
>>> like a very difficult thing to do. How long does it take to fully learn
>>> how to correctly utilize each search engine for the task at hand? I am
>>> sure longer than these busy men could possibly take. It seems that such
>>> a comparison could only be done legitimately if experts for each search
>>> engine set up the indexing/searching processes. Even then the results
>>> seem like they could be difficult to measure...eg was each search
>>> engine
>>> configured so that they would only break on spaces for indexing and do
>>> nothing else special at all? So many small settings and knowledge need
>>> to ensure each engine is on level ground...
>>>
>>> I doubt it will ever happen, but some sort of open source search off
>>> would be pretty cool <g>. Then each camp could properly configure their
>>> search engine for each task.
>>>
>>> - Mark
>>>
>>> Mike Klaas wrote:
>>>> There is a good chance that they were using stock indexing defaults,
>>>> based on:
>>>>
>>>> Lucene:
>>>> " In the present work, the simple applications
>>>> bundled with the library were used to index the collection. "
>>>>
>>>> On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
>>>>
>>>>> Yeah, I wasn't too excited over it and I certainly didn't lose any
>>>>> sleep over it, but there are some interesting things of note in
>>> there
>>>>> concerning Lucene, including the claim that it fell over on indexing
>>>>> WT10g docs (page 40) and I am always looking for ways to improve
>>>>> things. Overall, I think Lucene held up pretty well in the
>>>>> evaluation, and I know how suspect _any_ evaluation is given the
>>>>> myriad ways of doing search. Still, when a well-respected
>>> researcher
>>>>> in the field says Lucene didn't do so hot in certain areas, I don't
>>>>> think we can dismiss them out of hand. So regardless of the tests
>>>>> being right or wrong, they are worth either addressing the failures
>>>>> in Lucene or the failures in the test such that we make sure we are
>>>>> properly educating our users on how best to use Lucene.
>>>>>
>>>>> I emailed the authors asking for information on how the test was run
>>>>> etc., so we'll see if anything comes of it.
>>>>>
>>>>> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>>>>>
>>>>>> I wouldn't get too excited over this. Once again, it does not seem
>>>>>> the evaluator understands the nature of GC based systems, and the
>>>>>> memory statistics are quite out of whack. But it is hard to tell
>>>>>> because there is no data on how memory consumption was actually
>>>>>> measured.
>>>>>>
>>>>>> A far better way of measuring memory consumption is to cap the
>>>>>> process at different levels (max ram sizes), and compare the
>>>>>> performance at each level.
>>>>>>
>>>>>> There is also fact that a process takes memory from disk cache, and
>>>>>> visa versa, that heavily affects search performance, etc.
>>>>>>
>>>>>> Since there is no detailed data (that I could find) about system
>>>>>> configuration, etc. the results are highly suspect.
>>>>>>
>>>>>> There is also no mention of performance on multi-processor systems.
>>>>>> Some systems (like Lucene) pay a penalty to support multi-
>>> processing
>>>>>> (both in Java and Lucene), and only realize this benefit when
>>>>>> operating in a multi-processor environment.
>>>>>>
>>>>>> Based on the shear speed of XMLSearch and Zettair those seem likely
>>>>>> candidates to inspect their design.
>>>>>>
>>>>>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>>>>>
>>>>>>> Was wondering if people have seen
>>>>>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>>>>>>
>>>>>>> Has some interesting comparisons. Obviously, the comparison of
>>>>>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>>>>>>> again. Just wondering if people see any opportunities to improve
>>>>>>> Lucene from it. I am going to try and contact the authors to
>>> see
>>>>>>> if I can get what there setup values were (mergeFactor, Analyzer,
>>>>>>> etc.) as I think it would be interesting to run the tests again on
>>>>>>> 2.3.
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>> ---
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>> --
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------
>>> -
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
All true and good points. Lucene held up quite nicely in the search
aspect (at least perf. wise) and I generally don't think making these
kinds of comparisons are all that useful (we call it apple and oranges
in English :-) ).
What I am trying to get at is if this paper was just about Lucene and
never mentioned a single other system, what, if anything, can we take
from it that can help us make Lucene better. I know, for instance,
from my own personal experience, that 2.3 is somewhere in the range of
3-5+ times faster than 2.2 (which I know is faster than 1.9). That
being said, the paper clearly states that Lucene was not capable of
doing the WT10g docs because performance degraded too much. Now, I
know Lucene is pretty darn capable of a lot of things and people are
using it to do web search, etc. at very large scales (I have
personally talked w/ people doing it). So, what I worry about is that
either we are:
a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is
crashing
Now, what is to be done? It may well be nothing, but I just want to
make sure we are comfortable with that decision or whether it is worth
asking for a volunteer who has access to the WT10g docs to go have a
look at it and see what happens. I personally don't have access to
these docs, otherwise I would try it out. What we don't want to
happen is for potential supporters/contributors to read that paper and
say "Lucene isn't for me because of this."
Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we really
want Lucene to be going forward (the New Year is good for this kind of
assessment as well) What are it's strengths and weaknesses? What can
we improve in the short term and what needs to improve in the longer
term? Maybe it's just that time of year to send out your Lucene Wish
List... :-)
Cheers,
Grant
PS: Samir, any chance of contributing back your ranking
algorithms? :-)
On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:
> There is an expression in French that says "comparer des pommes et des
> poires" which literally means "to compare apples and pears". That's
> what
> this paper is about. For my point of view, such a comparison would be
> interesting only if a cross analysis of different criterions (for
> example,
> retrieval effectiveness (aka search quality), search time, indexing
> time,
> index size, query language, index structure, and so on...) is done.
> Comparing different systems based only on one criterion is not
> well-grounded. There is always a kind of trade-off: for example,
> beside
> other parameters (ranking algorithm, frequencies statistics, document
> structure, etc.), indexing with zettair is much faster than indexing
> with
> lucene but if we consider searching time lucene is better than
> zettair. Why?
> Because of many reasons but probably zettair hasn't the complex
> document
> structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-
> idf).
> Some systems computes and stores the scores at indexing time which
> make them
> faster at searching time but less flexible if you want to change/
> implement a
> new ranking algorithm.
>
>>> Still, when a well-respected researcher in the field says Lucene
>>> didn't do
> so hot in certain areas,
>
> If we consider the search quality, that's simply not true if we know
> how to
> implement in Lucene popular ranking algorithm such OkapiBM25 (at
> least).
> I've been working with Lucene for four years now, all experiments of
> my
> thesis have been done using Lucene (with many adaptations to
> implement the
> most recent ranking algorithm including different language model,
> divergence
> from randomness, etc.). I also participated to major IR campaigns
> (NTCIR,
> CLEF and TREC) and the results are not bad at all (see
> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
> -OV-CLIR-KishidaK.pdf for NTCIR-5 or
> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
> RVIEW.pdf for NTCIR-6, for CLEF have a look at
> http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
> EF2006.pdf, ...) for other information search the web ;-)
>
> Samir
>
>
>> -----Message d'origine-----
>> De : Mark Miller [mailto:markrmiller@gmail.com]
>> Envoyé : vendredi 7 décembre 2007 21:01
>> À : java-dev@lucene.apache.org
>> Objet : Re: O/S Search Comparisons
>>
>> Yes, and even if they did not use the stock defaults, I would bet
>> there
>> would be complaints about what was done wrong at every turn. This
>> seems
>> like a very difficult thing to do. How long does it take to fully
>> learn
>> how to correctly utilize each search engine for the task at hand? I
>> am
>> sure longer than these busy men could possibly take. It seems that
>> such
>> a comparison could only be done legitimately if experts for each
>> search
>> engine set up the indexing/searching processes. Even then the results
>> seem like they could be difficult to measure...eg was each search
>> engine
>> configured so that they would only break on spaces for indexing and
>> do
>> nothing else special at all? So many small settings and knowledge
>> need
>> to ensure each engine is on level ground...
>>
>> I doubt it will ever happen, but some sort of open source search off
>> would be pretty cool <g>. Then each camp could properly configure
>> their
>> search engine for each task.
>>
>> - Mark
>>
>> Mike Klaas wrote:
>>> There is a good chance that they were using stock indexing defaults,
>>> based on:
>>>
>>> Lucene:
>>> " In the present work, the simple applications
>>> bundled with the library were used to index the collection. "
>>>
>>> On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
>>>
>>>> Yeah, I wasn't too excited over it and I certainly didn't lose any
>>>> sleep over it, but there are some interesting things of note in
>> there
>>>> concerning Lucene, including the claim that it fell over on
>>>> indexing
>>>> WT10g docs (page 40) and I am always looking for ways to improve
>>>> things. Overall, I think Lucene held up pretty well in the
>>>> evaluation, and I know how suspect _any_ evaluation is given the
>>>> myriad ways of doing search. Still, when a well-respected
>> researcher
>>>> in the field says Lucene didn't do so hot in certain areas, I don't
>>>> think we can dismiss them out of hand. So regardless of the tests
>>>> being right or wrong, they are worth either addressing the failures
>>>> in Lucene or the failures in the test such that we make sure we are
>>>> properly educating our users on how best to use Lucene.
>>>>
>>>> I emailed the authors asking for information on how the test was
>>>> run
>>>> etc., so we'll see if anything comes of it.
>>>>
>>>> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>>>>
>>>>> I wouldn't get too excited over this. Once again, it does not seem
>>>>> the evaluator understands the nature of GC based systems, and the
>>>>> memory statistics are quite out of whack. But it is hard to tell
>>>>> because there is no data on how memory consumption was actually
>>>>> measured.
>>>>>
>>>>> A far better way of measuring memory consumption is to cap the
>>>>> process at different levels (max ram sizes), and compare the
>>>>> performance at each level.
>>>>>
>>>>> There is also fact that a process takes memory from disk cache,
>>>>> and
>>>>> visa versa, that heavily affects search performance, etc.
>>>>>
>>>>> Since there is no detailed data (that I could find) about system
>>>>> configuration, etc. the results are highly suspect.
>>>>>
>>>>> There is also no mention of performance on multi-processor
>>>>> systems.
>>>>> Some systems (like Lucene) pay a penalty to support multi-
>> processing
>>>>> (both in Java and Lucene), and only realize this benefit when
>>>>> operating in a multi-processor environment.
>>>>>
>>>>> Based on the shear speed of XMLSearch and Zettair those seem
>>>>> likely
>>>>> candidates to inspect their design.
>>>>>
>>>>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>>>>
>>>>>> Was wondering if people have seen
>>>>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>>>>>
>>>>>> Has some interesting comparisons. Obviously, the comparison of
>>>>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>>>>>> again. Just wondering if people see any opportunities to improve
>>>>>> Lucene from it. I am going to try and contact the authors to
>> see
>>>>>> if I can get what there setup values were (mergeFactor, Analyzer,
>>>>>> etc.) as I think it would be interesting to run the tests again
>>>>>> on
>>>>>> 2.3.
>>>>>>
>>>>>> -Grant
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------
>> ---
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>> --
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>> -
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: O/S Search Comparisons
Posted by Samir Abdou <Sa...@unine.ch>.
There is an expression in French that says "comparer des pommes et des
poires" which literally means "to compare apples and pears". That's what
this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for example,
retrieval effectiveness (aka search quality), search time, indexing time,
index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded. There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing with
lucene but if we consider searching time lucene is better than zettair. Why?
Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-idf).
Some systems computes and stores the scores at indexing time which make them
faster at searching time but less flexible if you want to change/implement a
new ranking algorithm.
>>Still, when a well-respected researcher in the field says Lucene didn't do
so hot in certain areas,
If we consider the search quality, that's simply not true if we know how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to implement the
most recent ranking algorithm including different language model, divergence
from randomness, etc.). I also participated to major IR campaigns (NTCIR,
CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...) for other information search the web ;-)
Samir
> -----Message d'origine-----
> De : Mark Miller [mailto:markrmiller@gmail.com]
> Envoyé : vendredi 7 décembre 2007 21:01
> À : java-dev@lucene.apache.org
> Objet : Re: O/S Search Comparisons
>
> Yes, and even if they did not use the stock defaults, I would bet there
> would be complaints about what was done wrong at every turn. This seems
> like a very difficult thing to do. How long does it take to fully learn
> how to correctly utilize each search engine for the task at hand? I am
> sure longer than these busy men could possibly take. It seems that such
> a comparison could only be done legitimately if experts for each search
> engine set up the indexing/searching processes. Even then the results
> seem like they could be difficult to measure...eg was each search
> engine
> configured so that they would only break on spaces for indexing and do
> nothing else special at all? So many small settings and knowledge need
> to ensure each engine is on level ground...
>
> I doubt it will ever happen, but some sort of open source search off
> would be pretty cool <g>. Then each camp could properly configure their
> search engine for each task.
>
> - Mark
>
> Mike Klaas wrote:
> > There is a good chance that they were using stock indexing defaults,
> > based on:
> >
> > Lucene:
> > " In the present work, the simple applications
> > bundled with the library were used to index the collection. "
> >
> > On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
> >
> >> Yeah, I wasn't too excited over it and I certainly didn't lose any
> >> sleep over it, but there are some interesting things of note in
> there
> >> concerning Lucene, including the claim that it fell over on indexing
> >> WT10g docs (page 40) and I am always looking for ways to improve
> >> things. Overall, I think Lucene held up pretty well in the
> >> evaluation, and I know how suspect _any_ evaluation is given the
> >> myriad ways of doing search. Still, when a well-respected
> researcher
> >> in the field says Lucene didn't do so hot in certain areas, I don't
> >> think we can dismiss them out of hand. So regardless of the tests
> >> being right or wrong, they are worth either addressing the failures
> >> in Lucene or the failures in the test such that we make sure we are
> >> properly educating our users on how best to use Lucene.
> >>
> >> I emailed the authors asking for information on how the test was run
> >> etc., so we'll see if anything comes of it.
> >>
> >> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
> >>
> >>> I wouldn't get too excited over this. Once again, it does not seem
> >>> the evaluator understands the nature of GC based systems, and the
> >>> memory statistics are quite out of whack. But it is hard to tell
> >>> because there is no data on how memory consumption was actually
> >>> measured.
> >>>
> >>> A far better way of measuring memory consumption is to cap the
> >>> process at different levels (max ram sizes), and compare the
> >>> performance at each level.
> >>>
> >>> There is also fact that a process takes memory from disk cache, and
> >>> visa versa, that heavily affects search performance, etc.
> >>>
> >>> Since there is no detailed data (that I could find) about system
> >>> configuration, etc. the results are highly suspect.
> >>>
> >>> There is also no mention of performance on multi-processor systems.
> >>> Some systems (like Lucene) pay a penalty to support multi-
> processing
> >>> (both in Java and Lucene), and only realize this benefit when
> >>> operating in a multi-processor environment.
> >>>
> >>> Based on the shear speed of XMLSearch and Zettair those seem likely
> >>> candidates to inspect their design.
> >>>
> >>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
> >>>
> >>>> Was wondering if people have seen
> >>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
> >>>>
> >>>> Has some interesting comparisons. Obviously, the comparison of
> >>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
> >>>> again. Just wondering if people see any opportunities to improve
> >>>> Lucene from it. I am going to try and contact the authors to
> see
> >>>> if I can get what there setup values were (mergeFactor, Analyzer,
> >>>> etc.) as I think it would be interesting to run the tests again on
> >>>> 2.3.
> >>>>
> >>>> -Grant
> >>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------------
> ---
> >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>
> >>>
> >>>
> >>> -------------------------------------------------------------------
> --
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>
> >>
> >>
> >> --------------------------------------------------------------------
> -
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 7, 2007, at 3:01 PM, Mark Miller wrote:
> Yes, and even if they did not use the stock defaults, I would bet
> there would be complaints about what was done wrong at every turn.
> This seems like a very difficult thing to do. How long does it take
> to fully learn how to correctly utilize each search engine for the
> task at hand? I am sure longer than these busy men could possibly
> take. It seems that such a comparison could only be done
> legitimately if experts for each search engine set up the indexing/
> searching processes. Even then the results seem like they could be
> difficult to measure...eg was each search engine configured so that
> they would only break on spaces for indexing and do nothing else
> special at all? So many small settings and knowledge need to ensure
> each engine is on level ground...
This is why I have called on NIST/TREC to open source their
collections. Until then, Lucene and the other O/S search engines will
be reliant on those contributors who have access to them, which is
spotty at best. (And, yes, I know, TREC is not the be all, end all of
IR evaluations, but it is a common ground for doing research) See http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Mark Miller <ma...@gmail.com>.
Yes, and even if they did not use the stock defaults, I would bet there
would be complaints about what was done wrong at every turn. This seems
like a very difficult thing to do. How long does it take to fully learn
how to correctly utilize each search engine for the task at hand? I am
sure longer than these busy men could possibly take. It seems that such
a comparison could only be done legitimately if experts for each search
engine set up the indexing/searching processes. Even then the results
seem like they could be difficult to measure...eg was each search engine
configured so that they would only break on spaces for indexing and do
nothing else special at all? So many small settings and knowledge need
to ensure each engine is on level ground...
I doubt it will ever happen, but some sort of open source search off
would be pretty cool <g>. Then each camp could properly configure their
search engine for each task.
- Mark
Mike Klaas wrote:
> There is a good chance that they were using stock indexing defaults,
> based on:
>
> Lucene:
> " In the present work, the simple applications
> bundled with the library were used to index the collection. "
>
> On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
>
>> Yeah, I wasn't too excited over it and I certainly didn't lose any
>> sleep over it, but there are some interesting things of note in there
>> concerning Lucene, including the claim that it fell over on indexing
>> WT10g docs (page 40) and I am always looking for ways to improve
>> things. Overall, I think Lucene held up pretty well in the
>> evaluation, and I know how suspect _any_ evaluation is given the
>> myriad ways of doing search. Still, when a well-respected researcher
>> in the field says Lucene didn't do so hot in certain areas, I don't
>> think we can dismiss them out of hand. So regardless of the tests
>> being right or wrong, they are worth either addressing the failures
>> in Lucene or the failures in the test such that we make sure we are
>> properly educating our users on how best to use Lucene.
>>
>> I emailed the authors asking for information on how the test was run
>> etc., so we'll see if anything comes of it.
>>
>> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>>
>>> I wouldn't get too excited over this. Once again, it does not seem
>>> the evaluator understands the nature of GC based systems, and the
>>> memory statistics are quite out of whack. But it is hard to tell
>>> because there is no data on how memory consumption was actually
>>> measured.
>>>
>>> A far better way of measuring memory consumption is to cap the
>>> process at different levels (max ram sizes), and compare the
>>> performance at each level.
>>>
>>> There is also fact that a process takes memory from disk cache, and
>>> visa versa, that heavily affects search performance, etc.
>>>
>>> Since there is no detailed data (that I could find) about system
>>> configuration, etc. the results are highly suspect.
>>>
>>> There is also no mention of performance on multi-processor systems.
>>> Some systems (like Lucene) pay a penalty to support multi-processing
>>> (both in Java and Lucene), and only realize this benefit when
>>> operating in a multi-processor environment.
>>>
>>> Based on the shear speed of XMLSearch and Zettair those seem likely
>>> candidates to inspect their design.
>>>
>>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>>
>>>> Was wondering if people have seen
>>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>>>
>>>> Has some interesting comparisons. Obviously, the comparison of
>>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>>>> again. Just wondering if people see any opportunities to improve
>>>> Lucene from it. I am going to try and contact the authors to see
>>>> if I can get what there setup values were (mergeFactor, Analyzer,
>>>> etc.) as I think it would be interesting to run the tests again on
>>>> 2.3.
>>>>
>>>> -Grant
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Mike Klaas <mi...@gmail.com>.
There is a good chance that they were using stock indexing defaults,
based on:
Lucene:
" In the present work, the simple applications
bundled with the library were used to index the collection. "
On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
> Yeah, I wasn't too excited over it and I certainly didn't lose any
> sleep over it, but there are some interesting things of note in
> there concerning Lucene, including the claim that it fell over on
> indexing WT10g docs (page 40) and I am always looking for ways to
> improve things. Overall, I think Lucene held up pretty well in the
> evaluation, and I know how suspect _any_ evaluation is given the
> myriad ways of doing search. Still, when a well-respected
> researcher in the field says Lucene didn't do so hot in certain
> areas, I don't think we can dismiss them out of hand. So
> regardless of the tests being right or wrong, they are worth either
> addressing the failures in Lucene or the failures in the test such
> that we make sure we are properly educating our users on how best
> to use Lucene.
>
> I emailed the authors asking for information on how the test was
> run etc., so we'll see if anything comes of it.
>
> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>
>> I wouldn't get too excited over this. Once again, it does not seem
>> the evaluator understands the nature of GC based systems, and the
>> memory statistics are quite out of whack. But it is hard to tell
>> because there is no data on how memory consumption was actually
>> measured.
>>
>> A far better way of measuring memory consumption is to cap the
>> process at different levels (max ram sizes), and compare the
>> performance at each level.
>>
>> There is also fact that a process takes memory from disk cache,
>> and visa versa, that heavily affects search performance, etc.
>>
>> Since there is no detailed data (that I could find) about system
>> configuration, etc. the results are highly suspect.
>>
>> There is also no mention of performance on multi-processor
>> systems. Some systems (like Lucene) pay a penalty to support multi-
>> processing (both in Java and Lucene), and only realize this
>> benefit when operating in a multi-processor environment.
>>
>> Based on the shear speed of XMLSearch and Zettair those seem
>> likely candidates to inspect their design.
>>
>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>
>>> Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/
>>> Middleton-Baeza.pdf
>>>
>>> Has some interesting comparisons. Obviously, the comparison of
>>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>>> again. Just wondering if people see any opportunities to improve
>>> Lucene from it. I am going to try and contact the authors to
>>> see if I can get what there setup values were (mergeFactor,
>>> Analyzer, etc.) as I think it would be interesting to run the
>>> tests again on 2.3.
>>>
>>> -Grant
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by Grant Ingersoll <gs...@apache.org>.
Yeah, I wasn't too excited over it and I certainly didn't lose any
sleep over it, but there are some interesting things of note in there
concerning Lucene, including the claim that it fell over on indexing
WT10g docs (page 40) and I am always looking for ways to improve
things. Overall, I think Lucene held up pretty well in the
evaluation, and I know how suspect _any_ evaluation is given the
myriad ways of doing search. Still, when a well-respected researcher
in the field says Lucene didn't do so hot in certain areas, I don't
think we can dismiss them out of hand. So regardless of the tests
being right or wrong, they are worth either addressing the failures in
Lucene or the failures in the test such that we make sure we are
properly educating our users on how best to use Lucene.
I emailed the authors asking for information on how the test was run
etc., so we'll see if anything comes of it.
On Dec 7, 2007, at 12:04 PM, robert engels wrote:
> I wouldn't get too excited over this. Once again, it does not seem
> the evaluator understands the nature of GC based systems, and the
> memory statistics are quite out of whack. But it is hard to tell
> because there is no data on how memory consumption was actually
> measured.
>
> A far better way of measuring memory consumption is to cap the
> process at different levels (max ram sizes), and compare the
> performance at each level.
>
> There is also fact that a process takes memory from disk cache, and
> visa versa, that heavily affects search performance, etc.
>
> Since there is no detailed data (that I could find) about system
> configuration, etc. the results are highly suspect.
>
> There is also no mention of performance on multi-processor systems.
> Some systems (like Lucene) pay a penalty to support multi-processing
> (both in Java and Lucene), and only realize this benefit when
> operating in a multi-processor environment.
>
> Based on the shear speed of XMLSearch and Zettair those seem likely
> candidates to inspect their design.
>
> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>
>> Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>
>> Has some interesting comparisons. Obviously, the comparison of
>> Lucene indexing is done w/ 1.9 so it probably needs to be done
>> again. Just wondering if people see any opportunities to improve
>> Lucene from it. I am going to try and contact the authors to see
>> if I can get what there setup values were (mergeFactor, Analyzer,
>> etc.) as I think it would be interesting to run the tests again on
>> 2.3.
>>
>> -Grant
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: O/S Search Comparisons
Posted by robert engels <re...@ix.netcom.com>.
I wouldn't get too excited over this. Once again, it does not seem
the evaluator understands the nature of GC based systems, and the
memory statistics are quite out of whack. But it is hard to tell
because there is no data on how memory consumption was actually
measured.
A far better way of measuring memory consumption is to cap the
process at different levels (max ram sizes), and compare the
performance at each level.
There is also fact that a process takes memory from disk cache, and
visa versa, that heavily affects search performance, etc.
Since there is no detailed data (that I could find) about system
configuration, etc. the results are highly suspect.
There is also no mention of performance on multi-processor systems.
Some systems (like Lucene) pay a penalty to support multi-processing
(both in Java and Lucene), and only realize this benefit when
operating in a multi-processor environment.
Based on the shear speed of XMLSearch and Zettair those seem likely
candidates to inspect their design.
On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
> Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/
> Middleton-Baeza.pdf
>
> Has some interesting comparisons. Obviously, the comparison of
> Lucene indexing is done w/ 1.9 so it probably needs to be done
> again. Just wondering if people see any opportunities to improve
> Lucene from it. I am going to try and contact the authors to see
> if I can get what there setup values were (mergeFactor, Analyzer,
> etc.) as I think it would be interesting to run the tests again on
> 2.3.
>
> -Grant
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org