You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by hu...@gmail.com on 2006/08/08 08:05:36 UTC

Poor performance "race condition" in FieldSortedHitQueue

Hey all, just want to run an issue that I've recently identified while
looking at some performance issues we are having with our larger
indexes past you all.

Basically what we are seeing is that when there are a number of
concurrent searches being executed over a new IndexSearcher, the quite
expensive ScoreDocComparator generation that is done in the
FieldSortedHitQueue#getCachedComparator method ends up executing
multiple times rather the ideal case of once. This issue does not
effect the correctness of the searches only performance.

For my relatively weak understanding of the code the core of this
issue appears to lie with the FieldCacheImpl#getStringIndex method
which allows multiple concurrent requests to each generate their own
StringIndex rather than allowing the first request to do the
generation and then blocking subsequent requests until the first
request has finished.

Is this a know problem? Should I raise this as an issue or is this
"expected" behaviour. A solution would naturally require more
synchronization than is currently used but nothing particularly
complex.

Thanks,

Oliver

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Paul Smith <ps...@aconex.com>.

On 09/08/2006, at 12:47 PM, Yonik Seeley wrote:

> The nature of the field cache itself means that the first sort on a
> particular field can take a long, long time.  Synchronization won't
> really help that much.
>

I'm not so sure I agree with that.  If you have, say, 4 threads  
concurrently starting a search on a cold index, they will _all_  
effectively do a warm of the searcher, chewing up CPU and disk, which  
may be better utilised by other threads.  Wouldn't it be better for 1  
thread to do the warming while the others block waiting?

The option to warm-up the index before making it available to  
concurrent searches is effectively the same thing as this.  I would  
have thought it would be nicer to have it part of the search  
mechanism rather than rely on coders to constantly build that warm- 
ing thread into their application.

My 5 Australian cents (currently 3.75 US cents).

Paul Smith

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

On 8/8/06, Oliver Hutchison <oh...@aconex.com> wrote:
> > The nature of the field cache itself means that the first
> > sort on a particular field can take a long, long time.
> > Synchronization won't really help that much.
>
> I think you may be misunderstanding my description (probably because it was
> not clear enough :). The issue is not that the first search is going to take
> a while as this is clearly unavoidable.

Sorry, I understood your problem perfectly, I just wasn't clear on
what I was saying.

My point was that for many uses, even the first-sorted-search delay is
not acceptable (and fixing multiple threads trying to fill the same
cache entry wouldn't solve that).  The warming-in-the-background that
Solr currenty uses solves both.

For those who can't warm in the background though, synchronizing
per-fieldcache entry would probably be a good idea.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

> The nature of the field cache itself means that the first 
> sort on a particular field can take a long, long time.  
> Synchronization won't really help that much.

I think you may be misunderstanding my description (probably because it was
not clear enough :). The issue is not that the first search is going to take
a while as this is clearly unavoidable. The issue I'm seeing is that when
there are a number of concurrent searches that start executing before the
cache has been populated they *all* end up doing the very expensive
ScoreDocComparator generation rather than just one of them doing the
generation and the rest simply blocking until that one is done. The more
concurrent searches and the longer the generation takes the worse the effect
becomes.

> There are two ways around this...
> 2) warm searchers in the background before exposing them to 
> live queries (the approach Solr takes).

This is basically how we are working around this issue, we don't actually
pre-warm the search results as we don't have a window in which to do this
but we do synchronize the FieldSortedHitQueue cache generation so it can
never gets executed more than once per index reader:

    private final Set<String> primedSortFields = new HashSet<String>();

    protected void primeCache(Sort sort) throws IOException {
        // This synchronized block allows us to be sure that a given sort
field is only primed once
        // per searcher rather than the multiple times Lucene may prime the
field if left to its
        // own devices (something we *really* want to avoid for big
indexes).
        synchronized (primedSortFields) {
            SortField[] sortFields = sort.getSort();
            for (int i = 0; i < sortFields.length; i++) {
                SortField sortField = sortFields[i];
                if (!primedSortFields.contains(sortField.getField())) {
                    primedSortFields.add(sortField.getField());
                    new FieldSortedHitQueue(getIndexReader(), new
SortField[] { sortField }, 0);
                }
            }
        }
    }

obviously this not ideal.

Thanks, 

Oliver





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

The nature of the field cache itself means that the first sort on a
particular field can take a long, long time.  Synchronization won't
really help that much.

There are two ways around this...
1) incrementally generate the field cache (hard... not currently
supported by Lucene)
2) warm searchers in the background before exposing them to live
queries (the approach Solr takes).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 8/8/06, hutchiko@gmail.com <hu...@gmail.com> wrote:
> Hey all, just want to run an issue that I've recently identified while
> looking at some performance issues we are having with our larger
> indexes past you all.
>
> Basically what we are seeing is that when there are a number of
> concurrent searches being executed over a new IndexSearcher, the quite
> expensive ScoreDocComparator generation that is done in the
> FieldSortedHitQueue#getCachedComparator method ends up executing
> multiple times rather the ideal case of once. This issue does not
> effect the correctness of the searches only performance.
>
> For my relatively weak understanding of the code the core of this
> issue appears to lie with the FieldCacheImpl#getStringIndex method
> which allows multiple concurrent requests to each generate their own
> StringIndex rather than allowing the first request to do the
> generation and then blocking subsequent requests until the first
> request has finished.
>
> Is this a know problem? Should I raise this as an issue or is this
> "expected" behaviour. A solution would naturally require more
> synchronization than is currently used but nothing particularly
> complex.
>
> Thanks,
>
> Oliver

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Doron Cohen <DO...@il.ibm.com>.

yseeley@gmail.com wrote on 09/08/2006 11:22:12:
> Assuming "field" wasn't being used to synchronize on something else,
> this would still block *all* IndexReaders/Searchers trying to sort on
> that field.
>
> In Solr, it would make the situation worse.  If I had my warmed-up
> IndexSearcher serving live requests, and a new Searcher is opened in
> the background to be warmed, a getStringIndex(warming,"foo") would
> also block all getStringIndex(live,"foo").

Right, this is what I had in mind with "by-field (and by-reader)" (a few
lines further).

- Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

On 8/9/06, Doron Cohen <DO...@il.ibm.com> wrote:
>   public StringIndex getStringIndex (IndexReader reader, String field)
>   throws IOException {
>     field = field.intern();
>     synchronize(field) {  // < ----------- line added
>       Object ret = lookup (reader, field, STRING_INDEX, null);
>       if (ret == null) {
>          final int[] retArray = new int[reader.maxDoc()];
>          ... load field to cache ...
>       }

Assuming "field" wasn't being used to synchronize on something else,
this would still block *all* IndexReaders/Searchers trying to sort on
that field.

In Solr, it would make the situation worse.  If I had my warmed-up
IndexSearcher serving live requests, and a new Searcher is opened in
the background to be warmed, a getStringIndex(warming,"foo") would
also block all getStringIndex(live,"foo").

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Chris Hostetter <ho...@fucit.org>.

: Frustrated is the word :) I have looked at Solr...what I am worried
: about there is this: Solr says it requires an OS that supports hard
: links. Currently Windows does not to my knowledge. Someone seemed to
: make a comment that Windows could be supported...from what I know I
: don't think so. Not a deal breaker per say but then there is this: I

Solr does not require hardlinks, what the FAQ says is...

  "The Replication features of Solr currently require an OS with the
  ability to create hard links and rsync."

...which means if you want to use the replication system provided with
Solr as is, you need hardlinks and rsync.  Solr is designed with the
replication as a very external portion of the system (it's just executing
shell calls specific in a config file) so it should be possible to plug in
a different replication system and use the existing hooks for generating
snapshoots on the master and loading snapshots on the slave ... it just
hasn't been a priority.

: have done a lot with the lucene API. I have created a custom query
: language to lucene query parser. I have changed the standard parser. I
: have made heavy use of Multi-Searchers. I am really tied into the Lucene
: API. I am worried about how easy it will be to integrate that into Solr.

Anything thing you do at "query time" with the Lucene API, can be done in
a SolrRequestHandler (which you write in Java and register in the solr
config file) -- change just a few method calls and you'll get a lot of
great caching features as well.

none of which really addresses the crux of your question....

: Can I index 30 million+ docs that range in size form 2-10kb on a single
: server in a Windows environment (accesss to a max of about 1.5 gig of
: RAM). The average search will need to be sorted by field not relevancy.

I can't say that I've personally built/used a lucene index of 30 million
docs, but i have talked to people who have done it .. they certainly had
some performance issues, but those issues were mainly related to the
volume of queries they got, not so much the size of their index.  That
said: you are seriously hindering yourself with the Windows/RAM (The
FieldCache for your sort field (assuming it's an int) alone will be over
100MB), not to mention the fact that your index isn't static, so creating
a new searcher after you've made updates esentially halves the amount of
usable RAM you have to work with unless you're willing to close one
searcher before you're willing to open the new one.

I haven't played with Remote/Multi Searchers, but perhaps you should
open yourself up to the possibility of partitionining your index on
several boxes.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Mark Miller <ma...@gmail.com>.

Frustrated is the word :) I have looked at Solr...what I am worried 
about there is this: Solr says it requires an OS that supports hard 
links. Currently Windows does not to my knowledge. Someone seemed to 
make a comment that Windows could be supported...from what I know I 
don't think so. Not a deal breaker per say but then there is this: I 
have done a lot with the lucene API. I have created a custom query 
language to lucene query parser. I have changed the standard parser. I 
have made heavy use of Multi-Searchers. I am really tied into the Lucene 
API. I am worried about how easy it will be to integrate that into Solr. 
Perhaps I can just grab the distributed part of Solr but I do not know. 
I have so much to do that worrying about a distributed search seems like 
too big a scope for now. It seemed to me that breaking up the index with 
an RMI searcher was the easiest approach anyway. In the end...I would 
really like to stay on one server. This server will prob have multiple 
procs...should I make sure I incorporate a parallel searcher option?

In the end I am really just hoping for some more insight into this exact 
question:

Can I index 30 million+ docs that range in size form 2-10kb on a single 
server in a Windows environment (accesss to a max of about 1.5 gig of 
RAM). The average search will need to be sorted by field not relevancy.

Do you think its possible or a pipe dream? I realize I need to test to 
find out...but I am looking for someone with experience to pipe in 
before I get to that point.

Thanks for the response so far...I love the lucene mailing list.

Thanks,
Mark


Ray Tsang wrote:
> i've indexed 80m records and now up to 200m.. it can be done, and 
> could've
> been done better.  like the other said, architecture is important.  
> have you
> considered looking into solr?  i haven't kept up with it (and many of the
> mailing lists...), but looks very interesting.
>
> ray,
>
> On 8/12/06, Jason Polites <ja...@gmail.com> wrote:
>>
>> Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
>> engineering and business rarely see eye-to-eye.  Just focus on the fact
>> that
>> what you have learnt from the process will help you, and they paid 
>> for it
>> ;)
>>
>> On the issue at hand...Lucene should scale to this level, but you need a
>> good architecture behind it.  Google has good indexing tech, but it's
>> their
>> architecture that allows them to spread the index across thousands of
>> servers which really gives it grunt (to the point that they invented 
>> their
>> own RAID-style file system).
>>
>> Just think very carefully about the architecture underpinning the index.
>> Lucene is core-tech.  It's up to you to provide the framework to make it
>> hum.
>>
>> On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>> >
>> > Tomi NA wrote:
>> > > On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>> > >> I've made a nice little archive application with lucene. I made 
>> it to
>> > >> handle our largest need: 2.5 million docs or so on a single server.
>> Now
>> > >> the powers that be say: lets use it for a 30+ million document
>> archive
>> > >> on a single server! (each doc size maybe 10k max...as small as a 
>> 1 or
>> > >> 2k) Please tell me why we are in trouble...please tell me why we 
>> are
>> > >> not. I have tested up to 2 million docs without much trouble but 30
>> > >> million...the average search will include a sort on a field as
>> > >> well...can I search 30+ million docs with a sort? Man am I worried
>> > about
>> > >> that. Maybe the server will have 8 procs and 12 billion gigs of 
>> RAM.
>> > >> Mabye. Even still, Tomcat seems to be able to launch with a max of
>> 1.5
>> > >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
>> > like
>> > >> too much of a load to me for a single server. Not that they care 
>> what
>> I
>> > >> think...I only wrote the thing (man I hate my job, offer me a 
>> new one
>> > :)
>> > >> )...please...comments?
>> > >>
>> > >> Cheers,
>> > >>
>> > >> Miserable Mark
>> > >
>> > > I don't really understand what you're so worried about. Either it'll
>> > > work well with the setup you have, or it won't. It's really the size
>> > > of it. ;)
>> > > Seriously, you have a number of relatively cheap possibilities at 
>> hand
>> > > to improve search performance: storing the index on a RAID 5 disk
>> > > array will let you read the indices very fast, using multicore CPUs,
>> > > adding memory and even if all that isn't good enough, you can always
>> > > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
>> > > filled with a GB of RAM. You don't have to keep them inside the
>> > > regular UPS/backup/voult-protected area as the indices can always be
>> > > rebuilt (unlike e.g. data in transactional systems) and between 4 of
>> > > them they might cost like an entry-level server.
>> > > I'll let the experts speak now. :)
>> > >
>> > > t.n.a.
>> > >
>> > > 
>> ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> > Thanks for the tip...I am not too worried...I am miserable because I
>> > live in Dilbert land, not this particular incident. Spreading to
>> > multiple servers is a possibility but one I want to avoid...I wrote 
>> this
>> > app on the side since our current product is crap...it still needs 
>> a lot
>> > of work and thinking about distributing lucene at this point is a 
>> little
>> > much...I never even have time to work on this project as it is 
>> becuase I
>> > am currently tasked with porting the crap old project to Windows. I 
>> need
>> > to do a bunch to shore up what I have. No one cares though...they 
>> think
>> > that I have done nothing (or can't understand what I have done) 
>> while at
>> > the same time they want to use what I havn't done to do what I made it
>> > for as well as this new super archive of 30 million + docs...in the 
>> end
>> > I'll be looking for a new job...still curious about lucene scaling 
>> to 30
>> > million docs with a sort on every search though (yes I know the 
>> sort is
>> > cached...worries me too though...the sort will be on multiple and
>> > different fields depending no what the searcher wants...uggg...the 
>> size
>> > of the caches....)
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Ray Tsang <sa...@gmail.com>.

i've indexed 80m records and now up to 200m.. it can be done, and could've
been done better.  like the other said, architecture is important.  have you
considered looking into solr?  i haven't kept up with it (and many of the
mailing lists...), but looks very interesting.

ray,

On 8/12/06, Jason Polites <ja...@gmail.com> wrote:
>
> Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
> engineering and business rarely see eye-to-eye.  Just focus on the fact
> that
> what you have learnt from the process will help you, and they paid for it
> ;)
>
> On the issue at hand...Lucene should scale to this level, but you need a
> good architecture behind it.  Google has good indexing tech, but it's
> their
> architecture that allows them to spread the index across thousands of
> servers which really gives it grunt (to the point that they invented their
> own RAID-style file system).
>
> Just think very carefully about the architecture underpinning the index.
> Lucene is core-tech.  It's up to you to provide the framework to make it
> hum.
>
> On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
> >
> > Tomi NA wrote:
> > > On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
> > >> I've made a nice little archive application with lucene. I made it to
> > >> handle our largest need: 2.5 million docs or so on a single server.
> Now
> > >> the powers that be say: lets use it for a 30+ million document
> archive
> > >> on a single server! (each doc size maybe 10k max...as small as a 1 or
> > >> 2k) Please tell me why we are in trouble...please tell me why we are
> > >> not. I have tested up to 2 million docs without much trouble but 30
> > >> million...the average search will include a sort on a field as
> > >> well...can I search 30+ million docs with a sort? Man am I worried
> > about
> > >> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> > >> Mabye. Even still, Tomcat seems to be able to launch with a max of
> 1.5
> > >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
> > like
> > >> too much of a load to me for a single server. Not that they care what
> I
> > >> think...I only wrote the thing (man I hate my job, offer me a new one
> > :)
> > >> )...please...comments?
> > >>
> > >> Cheers,
> > >>
> > >> Miserable Mark
> > >
> > > I don't really understand what you're so worried about. Either it'll
> > > work well with the setup you have, or it won't. It's really the size
> > > of it. ;)
> > > Seriously, you have a number of relatively cheap possibilities at hand
> > > to improve search performance: storing the index on a RAID 5 disk
> > > array will let you read the indices very fast, using multicore CPUs,
> > > adding memory and even if all that isn't good enough, you can always
> > > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> > > filled with a GB of RAM. You don't have to keep them inside the
> > > regular UPS/backup/voult-protected area as the indices can always be
> > > rebuilt (unlike e.g. data in transactional systems) and between 4 of
> > > them they might cost like an entry-level server.
> > > I'll let the experts speak now. :)
> > >
> > > t.n.a.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > Thanks for the tip...I am not too worried...I am miserable because I
> > live in Dilbert land, not this particular incident. Spreading to
> > multiple servers is a possibility but one I want to avoid...I wrote this
> > app on the side since our current product is crap...it still needs a lot
> > of work and thinking about distributing lucene at this point is a little
> > much...I never even have time to work on this project as it is becuase I
> > am currently tasked with porting the crap old project to Windows. I need
> > to do a bunch to shore up what I have. No one cares though...they think
> > that I have done nothing (or can't understand what I have done) while at
> > the same time they want to use what I havn't done to do what I made it
> > for as well as this new super archive of 30 million + docs...in the end
> > I'll be looking for a new job...still curious about lucene scaling to 30
> > million docs with a sort on every search though (yes I know the sort is
> > cached...worries me too though...the sort will be on multiple and
> > different fields depending no what the searcher wants...uggg...the size
> > of the caches....)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Re: 30 milllion+ docs on a single server

Posted by Jason Polites <ja...@gmail.com>.

Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
engineering and business rarely see eye-to-eye.  Just focus on the fact that
what you have learnt from the process will help you, and they paid for it ;)

On the issue at hand...Lucene should scale to this level, but you need a
good architecture behind it.  Google has good indexing tech, but it's their
architecture that allows them to spread the index across thousands of
servers which really gives it grunt (to the point that they invented their
own RAID-style file system).

Just think very carefully about the architecture underpinning the index.
Lucene is core-tech.  It's up to you to provide the framework to make it
hum.

On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>
> Tomi NA wrote:
> > On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
> >> I've made a nice little archive application with lucene. I made it to
> >> handle our largest need: 2.5 million docs or so on a single server. Now
> >> the powers that be say: lets use it for a 30+ million document archive
> >> on a single server! (each doc size maybe 10k max...as small as a 1 or
> >> 2k) Please tell me why we are in trouble...please tell me why we are
> >> not. I have tested up to 2 million docs without much trouble but 30
> >> million...the average search will include a sort on a field as
> >> well...can I search 30+ million docs with a sort? Man am I worried
> about
> >> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> >> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
> >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
> like
> >> too much of a load to me for a single server. Not that they care what I
> >> think...I only wrote the thing (man I hate my job, offer me a new one
> :)
> >> )...please...comments?
> >>
> >> Cheers,
> >>
> >> Miserable Mark
> >
> > I don't really understand what you're so worried about. Either it'll
> > work well with the setup you have, or it won't. It's really the size
> > of it. ;)
> > Seriously, you have a number of relatively cheap possibilities at hand
> > to improve search performance: storing the index on a RAID 5 disk
> > array will let you read the indices very fast, using multicore CPUs,
> > adding memory and even if all that isn't good enough, you can always
> > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> > filled with a GB of RAM. You don't have to keep them inside the
> > regular UPS/backup/voult-protected area as the indices can always be
> > rebuilt (unlike e.g. data in transactional systems) and between 4 of
> > them they might cost like an entry-level server.
> > I'll let the experts speak now. :)
> >
> > t.n.a.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> Thanks for the tip...I am not too worried...I am miserable because I
> live in Dilbert land, not this particular incident. Spreading to
> multiple servers is a possibility but one I want to avoid...I wrote this
> app on the side since our current product is crap...it still needs a lot
> of work and thinking about distributing lucene at this point is a little
> much...I never even have time to work on this project as it is becuase I
> am currently tasked with porting the crap old project to Windows. I need
> to do a bunch to shore up what I have. No one cares though...they think
> that I have done nothing (or can't understand what I have done) while at
> the same time they want to use what I havn't done to do what I made it
> for as well as this new super archive of 30 million + docs...in the end
> I'll be looking for a new job...still curious about lucene scaling to 30
> million docs with a sort on every search though (yes I know the sort is
> cached...worries me too though...the sort will be on multiple and
> different fields depending no what the searcher wants...uggg...the size
> of the caches....)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: 30 milllion+ docs on a single server

Posted by Mark Miller <ma...@gmail.com>.

Thanks for all of the useful info on this topic. You have been very 
enlightening. My RAM requirements where obviously off the mark. Here is 
my current understanding of this issue:

A standard 32-bit processor has access to 4GB of RAM. If your CPU 
supports Physical Address Extension (PAE) the OS can access up to 64GB 
of RAM, although a single application is still limited to 4GB. In 
Windows, Address Windowing Extensions (AWE) solves this limitation but 
you must use OS specific calls to manage your memory. Unix/Linux has 
it's own memory mapping scheme that accomplishes something similar to AWE.

 I have no problem requiring that the server i use be maxed to the 
gills. I just want to support Windows as well as Unix/Linux/Sun. I was 
shortsighted in my RAM comments.

As far as Solr: it looks like this will not necessarily solve my problem 
of handling an index so big on a single server--the entire index is 
replicated across each slave searcher--but if load is a factor 
(according to Hoss, load will probably be the big issue, not the size of 
the index) than this distribution is exactly what I need. I think that 
the average load I can expect will be pretty low though. Updates to the 
large index will also be relatively rare. 30-200 items added every 
morning with the occasional update scattered throughout the day. There 
will only be a few hundred searchers in total and they will not be 
hammering the system but consulting it here and there throughout the day.

When I said that Solr required hardlinks I was referring to the 
replication feature. My fault--replication was the benefit I was seeing 
from solar and I was unclear. I bet solr's caching helps a lot too and I 
will be looking into that. What I did not know was that Windows supports 
hard links when using ntfs. Fsutil.exe creates them. This gives me hope 
that cygwin might support them. I do not know if  I can use the same cp 
-l -r trick, but it at least it looks hopeful. I did not realize that 
solr's replication feature was so external and easily changeable.

as far as rmi: splitting the index across several boxes seems relatively 
easy (the index can be naturally partitioned without much trouble) but I 
would still have to deal with sorting on a single box.  I am not so 
worried about that anymore though. Warming a new searcher and tons of 
RAM should handle my sorts in an acceptable way.

It seems to me that one way or another I can make this happen.

I am  wondering whether  I should try and use a RAM directory. While I 
am sure it might be faster, what happens when the power goes out? System 
reboots? I would have to occasionally flush to disk anyway. Would it be 
better to use a normal directory and turn the maxbuffered docs way up?


Thanks,

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: 30 milllion+ docs on a single server

Posted by Dejan Nenov <de...@jollyobject.com>.

The important detail here is what you mean by "single server"?
A high-end server will work just fine - you want 4GB+ or RAM and the fastest
disk/IO you can get; CPU speed is far less important; A nice Linux software
RAID and 5+ 15K SCSI disks will get you superb performance, at a reasonable
price.



-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: Friday, August 11, 2006 4:23 PM
To: java-user@lucene.apache.org
Subject: Re: 30 milllion+ docs on a single server

Tomi NA wrote:
> On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>> I've made a nice little archive application with lucene. I made it to
>> handle our largest need: 2.5 million docs or so on a single server. Now
>> the powers that be say: lets use it for a 30+ million document archive
>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>> 2k) Please tell me why we are in trouble...please tell me why we are
>> not. I have tested up to 2 million docs without much trouble but 30
>> million...the average search will include a sort on a field as
>> well...can I search 30+ million docs with a sort? Man am I worried about
>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>> too much of a load to me for a single server. Not that they care what I
>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>> )...please...comments?
>>
>> Cheers,
>>
>> Miserable Mark
>
> I don't really understand what you're so worried about. Either it'll
> work well with the setup you have, or it won't. It's really the size
> of it. ;)
> Seriously, you have a number of relatively cheap possibilities at hand
> to improve search performance: storing the index on a RAID 5 disk
> array will let you read the indices very fast, using multicore CPUs,
> adding memory and even if all that isn't good enough, you can always
> use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> filled with a GB of RAM. You don't have to keep them inside the
> regular UPS/backup/voult-protected area as the indices can always be
> rebuilt (unlike e.g. data in transactional systems) and between 4 of
> them they might cost like an entry-level server.
> I'll let the experts speak now. :)
>
> t.n.a.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Thanks for the tip...I am not too worried...I am miserable because I 
live in Dilbert land, not this particular incident. Spreading to 
multiple servers is a possibility but one I want to avoid...I wrote this 
app on the side since our current product is crap...it still needs a lot 
of work and thinking about distributing lucene at this point is a little 
much...I never even have time to work on this project as it is becuase I 
am currently tasked with porting the crap old project to Windows. I need 
to do a bunch to shore up what I have. No one cares though...they think 
that I have done nothing (or can't understand what I have done) while at 
the same time they want to use what I havn't done to do what I made it 
for as well as this new super archive of 30 million + docs...in the end 
I'll be looking for a new job...still curious about lucene scaling to 30 
million docs with a sort on every search though (yes I know the sort is 
cached...worries me too though...the sort will be on multiple and 
different fields depending no what the searcher wants...uggg...the size 
of the caches....)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Mark Miller <ma...@gmail.com>.

Tomi NA wrote:
> On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>> I've made a nice little archive application with lucene. I made it to
>> handle our largest need: 2.5 million docs or so on a single server. Now
>> the powers that be say: lets use it for a 30+ million document archive
>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>> 2k) Please tell me why we are in trouble...please tell me why we are
>> not. I have tested up to 2 million docs without much trouble but 30
>> million...the average search will include a sort on a field as
>> well...can I search 30+ million docs with a sort? Man am I worried about
>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>> too much of a load to me for a single server. Not that they care what I
>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>> )...please...comments?
>>
>> Cheers,
>>
>> Miserable Mark
>
> I don't really understand what you're so worried about. Either it'll
> work well with the setup you have, or it won't. It's really the size
> of it. ;)
> Seriously, you have a number of relatively cheap possibilities at hand
> to improve search performance: storing the index on a RAID 5 disk
> array will let you read the indices very fast, using multicore CPUs,
> adding memory and even if all that isn't good enough, you can always
> use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> filled with a GB of RAM. You don't have to keep them inside the
> regular UPS/backup/voult-protected area as the indices can always be
> rebuilt (unlike e.g. data in transactional systems) and between 4 of
> them they might cost like an entry-level server.
> I'll let the experts speak now. :)
>
> t.n.a.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Thanks for the tip...I am not too worried...I am miserable because I 
live in Dilbert land, not this particular incident. Spreading to 
multiple servers is a possibility but one I want to avoid...I wrote this 
app on the side since our current product is crap...it still needs a lot 
of work and thinking about distributing lucene at this point is a little 
much...I never even have time to work on this project as it is becuase I 
am currently tasked with porting the crap old project to Windows. I need 
to do a bunch to shore up what I have. No one cares though...they think 
that I have done nothing (or can't understand what I have done) while at 
the same time they want to use what I havn't done to do what I made it 
for as well as this new super archive of 30 million + docs...in the end 
I'll be looking for a new job...still curious about lucene scaling to 30 
million docs with a sort on every search though (yes I know the sort is 
cached...worries me too though...the sort will be on multiple and 
different fields depending no what the searcher wants...uggg...the size 
of the caches....)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Tomi NA <he...@gmail.com>.

On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
> I've made a nice little archive application with lucene. I made it to
> handle our largest need: 2.5 million docs or so on a single server. Now
> the powers that be say: lets use it for a 30+ million document archive
> on a single server! (each doc size maybe 10k max...as small as a 1 or
> 2k) Please tell me why we are in trouble...please tell me why we are
> not. I have tested up to 2 million docs without much trouble but 30
> million...the average search will include a sort on a field as
> well...can I search 30+ million docs with a sort? Man am I worried about
> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
> too much of a load to me for a single server. Not that they care what I
> think...I only wrote the thing (man I hate my job, offer me a new one :)
> )...please...comments?
>
> Cheers,
>
> Miserable Mark

I don't really understand what you're so worried about. Either it'll
work well with the setup you have, or it won't. It's really the size
of it. ;)
Seriously, you have a number of relatively cheap possibilities at hand
to improve search performance: storing the index on a RAID 5 disk
array will let you read the indices very fast, using multicore CPUs,
adding memory and even if all that isn't good enough, you can always
use a small cluster (say, 4 nodes) of very, very inexpensive PCs
filled with a GB of RAM. You don't have to keep them inside the
regular UPS/backup/voult-protected area as the indices can always be
rebuilt (unlike e.g. data in transactional systems) and between 4 of
them they might cost like an entry-level server.
I'll let the experts speak now. :)

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re[2]: 30 milllion+ docs on a single server

Posted by Artem Vasiliev <ar...@gmail.com>.

Hi guys!

I have noticed many questions on the list vonsidering Lucene sorting
memory consumption and hope my solution can help someone.

I faced a memory/time consumption problem on sorting in Lucene back in
April. With a help of this list's experts I came to solution which I
like: documents from the sorting set (instead of given the field's
values from the whole index) are lazy-cached in a WeakHashMap so the
cached items are candidates for GC. I didn't create a patch yet (and
I'm going to make it) but the code is ready to be reused and it's used
for a while as a part of my open-source project, sharehound
(http://sharehound.sourceforge.net). The code can be now reached by a
CVS browser, it's in 4 classes in subdirectories of
http://sharehound.cvs.sourceforge.net/sharehound/jNetCrawler/src/java/org/apache/lucene/.

They (both classes, as a part of sharehound.jar, and sources) can also
be downloaded with the latest (1.1.3 alpha) sharehound release zip
file.

LazyCachingSortFactory class have an example of use in its header
comments, I duplicate it here:

/**
 * Creates a Sort that doesn't use FieldCache and doesn't therefore load all the field values from the index while sorting.
 * Instead it uses CachingIndexReader (via CachingDocFieldComparatorSource) to cache documents from which field values
 * are fetched. If the caller Searcher is based on CachingIndexReader itself its document cache will be used here.
 *
 *
 * An example of use:
  ...
  hits = getQueryHits(query, getSort(listSorting.getSortFieldName(), listSorting.isSortDescending()));
  ...

  public Sort getSort(String sortFieldName, boolean sortDescending) {
    return LazyCachingSortFactory.create(sortFieldName, sortDescending);
  }
  ...
  protected Hits getQueryHits(Query query, Sort sort) throws IOException {
     IndexSearcher indexSearcher = new IndexSearcher(CachingIndexReader.decorateIfNeeded(IndexReader.open(getIndexDir())));
     return indexSearcher.search(query, sort);
  }

*/

This solution does great work saving a memory and in case of
relatively search resultsets it's almost as fast as the default
implementation. Note that this solution is ready only for single-field
sorting currently.

Best regards,
Artem

OG> This is unlikely to work well/fast.  It will depend on the
OG> size of the index (not in terms of the number of docs, but its
OG> physical size), the number of queries/second and desired query
OG> latency.  If you can wait 10 seconds to get a query and if only a
OG> few queries are hitting the server at any one time, then you may
OG> be Ok.  Having things be up to date with non-relevancy sorting
OG> will be quite tough.  FieldCache will consume some RAM.  Warming
OG> it up will take some number of seconds.  Re-opening an
OG> IndexSearcher after index changes will also cost you a bit of time.

OG> Consider a 64-bit server with more RAM that allowed larger
OG> Java heaps, and try to fit your index into RAM.

OG> Otis

OG> ----- Original Message ----
OG> From: Mark Miller <ma...@gmail.com>
OG> To: java-user@lucene.apache.org
OG> Sent: Saturday, August 12, 2006 7:45:15 PM
OG> Subject: Re: 30 milllion+ docs on a single server

OG> The single server is important because I think it will take a lot of
OG> work to scale it to multiple servers. The index must allow for close to
OG> real-time updates and additions. It must also remain searchable at all
OG> times (other than than during the brief period of single updates and
OG> additions). If it is easy to scale this to multiple servers please tell
OG> me how.

OG> - Mark
>> Why is a single server so important?  I can scale horizontally much
>> cheaper
>> than I scale vertically.
>>
>>
>>
>> On 8/11/06, Mark Miller <ma...@gmail.com> wrote:
>>>
>>> I've made a nice little archive application with lucene. I made it to
>>> handle our largest need: 2.5 million docs or so on a single server. Now
>>> the powers that be say: lets use it for a 30+ million document archive
>>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>>> 2k) Please tell me why we are in trouble...please tell me why we are
>>> not. I have tested up to 2 million docs without much trouble but 30
>>> million...the average search will include a sort on a field as
>>> well...can I search 30+ million docs with a sort? Man am I worried about
>>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>>> too much of a load to me for a single server. Not that they care what I
>>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>>> )...please...comments?
>>>
>>> Cheers,
>>>
>>> Miserable Mark
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
OG> For additional commands, e-mail: java-user-help@lucene.apache.org

OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
OG> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Jeff Rodenburg <je...@gmail.com>.

On 8/12/06, Mark Miller <ma...@gmail.com> wrote:
>
> The single server is important because I think it will take a lot of
> work to scale it to multiple servers. The index must allow for close to
> real-time updates and additions. It must also remain searchable at all
> times (other than than during the brief period of single updates and
> additions). If it is easy to scale this to multiple servers please tell
> me how.
>

It can take quite a bit of work to implement a multiple-server index system;
we did it last year, building an operational wrapper around Lucene.  Wish
Solr had been around then.  ;-)

I've done both the Windows and the Linux route.  Windows certainly comes
from a scale-up mentality, though we made it work in a scale-out model.  Our
requirements were the same as yours: near real-time updates & additions,
always-on searchability, etc.  It takes work, but it can be done.  We're
serving searches across 6 different types of indexes, with the indexes
spread across the server farm (no single server has the full composite
index).  Our search availability for this year is damn near 5 nines.  If you
haven't looked at Windows 64-bit, let me save you some time.  You don't gain
as much as you might expect; the point of diminishing returns appears to
have certainly been met with Windows Server.  We'll apply a similar strategy
to Solr, in that we'll likely run Solr clusters for our composite index.

The best way to explain "how" is to simply refer you to Solr, from an
operational perspective.  The only thing that Solr doesn't have that we do
is rolling together results from multiple searchers, and that's simply an
out-of-the-box configuration; it's not a major ordeal to change that to meet
our needs.

Hope this helps.

-- j

Re: 30 milllion+ docs on a single server

Posted by Otis Gospodnetic <ot...@yahoo.com>.

This is unlikely to work well/fast.  It will depend on the size of the index (not in terms of the number of docs, but its physical size), the number of queries/second and desired query latency.  If you can wait 10 seconds to get a query and if only a few queries are hitting the server at any one time, then you may be Ok.  Having things be up to date with non-relevancy sorting will be quite tough.  FieldCache will consume some RAM.  Warming it up will take some number of seconds.  Re-opening an IndexSearcher after index changes will also cost you a bit of time.

Consider a 64-bit server with more RAM that allowed larger Java heaps, and try to fit your index into RAM.

Otis

----- Original Message ----
From: Mark Miller <ma...@gmail.com>
To: java-user@lucene.apache.org
Sent: Saturday, August 12, 2006 7:45:15 PM
Subject: Re: 30 milllion+ docs on a single server

The single server is important because I think it will take a lot of 
work to scale it to multiple servers. The index must allow for close to 
real-time updates and additions. It must also remain searchable at all 
times (other than than during the brief period of single updates and 
additions). If it is easy to scale this to multiple servers please tell 
me how.

- Mark
> Why is a single server so important?  I can scale horizontally much 
> cheaper
> than I scale vertically.
>
>
>
> On 8/11/06, Mark Miller <ma...@gmail.com> wrote:
>>
>> I've made a nice little archive application with lucene. I made it to
>> handle our largest need: 2.5 million docs or so on a single server. Now
>> the powers that be say: lets use it for a 30+ million document archive
>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>> 2k) Please tell me why we are in trouble...please tell me why we are
>> not. I have tested up to 2 million docs without much trouble but 30
>> million...the average search will include a sort on a field as
>> well...can I search 30+ million docs with a sort? Man am I worried about
>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>> too much of a load to me for a single server. Not that they care what I
>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>> )...please...comments?
>>
>> Cheers,
>>
>> Miserable Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Mark Miller <ma...@gmail.com>.

The single server is important because I think it will take a lot of 
work to scale it to multiple servers. The index must allow for close to 
real-time updates and additions. It must also remain searchable at all 
times (other than than during the brief period of single updates and 
additions). If it is easy to scale this to multiple servers please tell 
me how.

- Mark
> Why is a single server so important?  I can scale horizontally much 
> cheaper
> than I scale vertically.
>
>
>
> On 8/11/06, Mark Miller <ma...@gmail.com> wrote:
>>
>> I've made a nice little archive application with lucene. I made it to
>> handle our largest need: 2.5 million docs or so on a single server. Now
>> the powers that be say: lets use it for a 30+ million document archive
>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>> 2k) Please tell me why we are in trouble...please tell me why we are
>> not. I have tested up to 2 million docs without much trouble but 30
>> million...the average search will include a sort on a field as
>> well...can I search 30+ million docs with a sort? Man am I worried about
>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>> too much of a load to me for a single server. Not that they care what I
>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>> )...please...comments?
>>
>> Cheers,
>>
>> Miserable Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 30 milllion+ docs on a single server

Posted by Jeff Rodenburg <je...@gmail.com>.

Why is a single server so important?  I can scale horizontally much cheaper
than I scale vertically.



On 8/11/06, Mark Miller <ma...@gmail.com> wrote:
>
> I've made a nice little archive application with lucene. I made it to
> handle our largest need: 2.5 million docs or so on a single server. Now
> the powers that be say: lets use it for a 30+ million document archive
> on a single server! (each doc size maybe 10k max...as small as a 1 or
> 2k) Please tell me why we are in trouble...please tell me why we are
> not. I have tested up to 2 million docs without much trouble but 30
> million...the average search will include a sort on a field as
> well...can I search 30+ million docs with a sort? Man am I worried about
> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
> too much of a load to me for a single server. Not that they care what I
> think...I only wrote the thing (man I hate my job, offer me a new one :)
> )...please...comments?
>
> Cheers,
>
> Miserable Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

30 milllion+ docs on a single server

Posted by Mark Miller <ma...@gmail.com>.

I've made a nice little archive application with lucene. I made it to 
handle our largest need: 2.5 million docs or so on a single server. Now 
the powers that be say: lets use it for a 30+ million document archive 
on a single server! (each doc size maybe 10k max...as small as a 1 or 
2k) Please tell me why we are in trouble...please tell me why we are 
not. I have tested up to 2 million docs without much trouble but 30 
million...the average search will include a sort on a field as 
well...can I search 30+ million docs with a sort? Man am I worried about 
that. Maybe the server will have 8 procs and 12 billion gigs of RAM. 
Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5 
or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like 
too much of a load to me for a single server. Not that they care what I 
think...I only wrote the thing (man I hate my job, offer me a new one :) 
)...please...comments?

Cheers,

Miserable Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

I've created an issue (LUCENE-651) and attached a patch. Hopefully this will
help you guys with whatever approach you end up using to solve this.

Thanks,

Oliver


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Chris Hostetter <ho...@fucit.org>.

: ... right, thanks, now I see what you mean. In other words, IndexReader
: provides the ability to read/iterate terms and docs, but caching the term
: values per doc is for a higher layer - this way keeping IndexReader simpler
: and maintainable. So I guess Oliver can continue with the change as he
: proposed it.

if any "heavy" changes were to be made to the way FieldCaches refrences
are handled (ie; being able to specify which FieldCacheImpl you want to
use) it would make more sense to put that in the Searcher API ... but even
then it might make sense to use a WeakHashRef on the IndexReader since
multiple IndexSearchers can refrence the same IndexReader.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Doron Cohen <DO...@il.ibm.com>.

> On 8/10/06, Doron Cohen <DO...@il.ibm.com> wrote:
> Sorting was introduced to Lucene before my time, so I don't know the
> reasons behind it.  Maybe it was seen as non-optimial or non-core and
> so was kept out of the IndexReader.
>
> I admit, it does feel like the level of abstraction that FieldCache is
> at is higher than that of the IndexReader (the lowest level).

... right, thanks, now I see what you mean. In other words, IndexReader
provides the ability to read/iterate terms and docs, but caching the term
values per doc is for a higher layer - this way keeping IndexReader simpler
and maintainable. So I guess Oliver can continue with the change as he
proposed it.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

On 8/10/06, Doron Cohen <DO...@il.ibm.com> wrote:
> I have one more comment on the cache implementation. It feels to me
> somewhat not right that a static system wide object (FieldCache.DEFAULT) is
> managing the field caching for all the indexReaders in the JVM (possibly of
> different indexes), when in fact there is no
> dependency/relation/cooperation between the different indexReaders, cache
> wise. It seems cleaner and simpler to have FieldCacheImpl take care of a
> single IndexReader, and so have that cache "belong" to the indexReader.
> This would make the cache implementation simpler. Synchronization would
> only need to be on field values. This way we also get rid of the
> WeakHashMap (which, btw, I never got to fully trust).

Sorting was introduced to Lucene before my time, so I don't know the
reasons behind it.  Maybe it was seen as non-optimial or non-core and
so was kept out of the IndexReader.

I admit, it does feel like the level of abstraction that FieldCache is
at is higher than that of the IndexReader (the lowest level).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NPE when sorting on a field that is missing from a doc

Posted by Oliver Hutchison <oh...@aconex.com>.

> i havne't seen it mentioned before ... i'm guessing it is 
> specific to the "explicit Locale" String comparator.

I've created an issue (LUCENE-650) with patch to fix this.

Thanks,

Oliver


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NPE when sorting on a field that is missing from a doc

Posted by Chris Hostetter <ho...@fucit.org>.

: we have recently noticed that doing a locale sensitive sort on a field that
: is missing from some docs causes an NPE inside the call to Collator#compare
: at FieldSortedHitQueue line 320 (Lucene 2.0 src):

: >From looking at the standard String, float and int sorting and reading
: LUCENE-406 I assume this in not expected behavior and that docs that do not
: include the field should be sorted to appear at the start of the list of
: results.

that is correct .. typically "no value" is interpreted as being the
"lowest possible value" (so in a reverse sort, they appear at the end of
the list and not the begining)

: Is this a know issue? If not I'll raise the issue and create a patch.

i havne't seen it mentioned before ... i'm guessing it is specific to the
"explicit Locale" String comparator.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

NPE when sorting on a field that is missing from a doc

Posted by Oliver Hutchison <oh...@aconex.com>.

Hi all, 

we have recently noticed that doing a locale sensitive sort on a field that
is missing from some docs causes an NPE inside the call to Collator#compare
at FieldSortedHitQueue line 320 (Lucene 2.0 src):

static ScoreDocComparator comparatorStringLocale (final IndexReader reader,
final String fieldname, final Locale locale)
  throws IOException {
    final Collator collator = Collator.getInstance (locale);
    final String field = fieldname.intern();
    final String[] index = FieldCache.DEFAULT.getStrings (reader, field);
    return new ScoreDocComparator() {

      public final int compare (final ScoreDoc i, final ScoreDoc j) {
		return collator.compare (index[i.doc], index[j.doc]);  <----
NPE in compare call one/both param null 
      }


>From looking at the standard String, float and int sorting and reading
LUCENE-406 I assume this in not expected behavior and that docs that do not
include the field should be sorted to appear at the start of the list of
results.

Is this a know issue? If not I'll raise the issue and create a patch.

Thanks again,

Oliver



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

> yseeley@gmail.com wrote on 09/08/2006 20:32:20:
> > Heh... interfaces strike again.
> >
> > Well then since we *know* that no one has their own implementation 
> > (because they would not have been able to register it), we 
> should be 
> > able to safely upgrade the interface to a class (anyone 
> want to supply 
> > a patch?)
> >
> > -Yonik
> 
> I'd be happy to do supply this patch - unless someone already 
> works on it (Oliver?).

I was intending to do this but perhaps this is not needed given your
following comments.

> I have one more comment on the cache implementation. It feels 
> to me somewhat not right that a static system wide object 
> (FieldCache.DEFAULT) is managing the field caching for all 
> the indexReaders in the JVM (possibly of different indexes), 
> when in fact there is no dependency/relation/cooperation 
> between the different indexReaders, cache wise. It seems 
> cleaner and simpler to have FieldCacheImpl take care of a 
> single IndexReader, and so have that cache "belong" to the 
> indexReader.
> This would make the cache implementation simpler. 
> Synchronization would only need to be on field values. This 
> way we also get rid of the WeakHashMap (which, btw, I never 
> got to fully trust).

This sounds like a much nicer solution than what I was proposing. I'm still
happy to produce a patch if that would be helpful?

Cheers,

Oliver



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Doron Cohen <DO...@il.ibm.com>.

yseeley@gmail.com wrote on 09/08/2006 20:32:20:
> Heh... interfaces strike again.
>
> Well then since we *know* that no one has their own implementation
> (because they would not have been able to register it), we should be
> able to safely upgrade the interface to a class (anyone want to supply
> a patch?)
>
> -Yonik

I'd be happy to do supply this patch - unless someone already works on it
(Oliver?).

I have one more comment on the cache implementation. It feels to me
somewhat not right that a static system wide object (FieldCache.DEFAULT) is
managing the field caching for all the indexReaders in the JVM (possibly of
different indexes), when in fact there is no
dependency/relation/cooperation between the different indexReaders, cache
wise. It seems cleaner and simpler to have FieldCacheImpl take care of a
single IndexReader, and so have that cache "belong" to the indexReader.
This would make the cache implementation simpler. Synchronization would
only need to be on field values. This way we also get rid of the
WeakHashMap (which, btw, I never got to fully trust).

Regards,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

On 8/9/06, Oliver Hutchison <oh...@aconex.com> wrote:
> > Well, there's FieldCache.DEFAULT
>
> I thought the exact same thing but what I'd forgotten was that all fields on
> an interface are implicitly final.

Heh... interfaces strike again.

Well then since we *know* that no one has their own implementation
(because they would not have been able to register it), we should be
able to safely upgrade the interface to a class (anyone want to supply
a patch?)

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

> Ah, right... I browsed your code a bit too fast.  It looks fine.

Great.

> > On a related note it would be great if there was a way to plug a 
> > custom FieldCache implementation into Lucene, given there is a 
> > FieldCache interface it's a shame there's no way to 
> actually provide 
> > an alternative implementation.
> 
> Well, there's FieldCache.DEFAULT

I thought the exact same thing but what I'd forgotten was that all fields on
an interface are implicitly final.

http://java.sun.com/docs/books/jls/third_edition/html/interfaces.html#9.3

Anyway, thanks for the feedback I'll raise a couple of issues shortly.

Cheers,

Oliver


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

On 8/9/06, Oliver Hutchison <oh...@aconex.com> wrote:
> Yonik,
>
> > most easily implemented in Java5 via Future.
>
> I didn't use Java5 as I had a feeling that code is Lucene needs to compile
> on Java1.3 right?

Lucene 2 currently requires Java 1.4

It was really just a side comment - people have implemented these
blocking maps before, and I've seen it done with Java5 concurrency
things like Future - a natural fit.  The way you were going about it
is perfectly fine though.

> > I don't think you need two maps though, right?  just stick a
> > placeholder in the outer map.
>
> I'm using 2 maps mainly because it simplifies the implementation.
> Technically all that is needed is a singe map with a key that is a composite
> of index reader and field name however, given that there is also the
> requirement that we only maintain a weak reference to the index reader and
> the associated need to clean up the cache if the reader gets gc'd, it was
> simpler for me to simulate the composite key using the 2 maps.

Ah, right... I browsed your code a bit too fast.  It looks fine.

> On a related note it would be great if there was a way to plug a custom
> FieldCache implementation into Lucene, given there is a FieldCache interface
> it's a shame there's no way to actually provide an alternative
> implementation.

Well, there's FieldCache.DEFAULT

  /** Expert: The cache used internally by sorting and range query classes. */
  public static FieldCache DEFAULT = new FieldCacheImpl();


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

Yonik, 

> most easily implemented in Java5 via Future.

I didn't use Java5 as I had a feeling that code is Lucene needs to compile
on Java1.3 right? 

> I don't think you need two maps though, right?  just stick a 
> placeholder in the outer map.

I'm using 2 maps mainly because it simplifies the implementation.
Technically all that is needed is a singe map with a key that is a composite
of index reader and field name however, given that there is also the
requirement that we only maintain a weak reference to the index reader and
the associated need to clean up the cache if the reader gets gc'd, it was
simpler for me to simulate the composite key using the 2 maps. 

On a related note it would be great if there was a way to plug a custom
FieldCache implementation into Lucene, given there is a FieldCache interface
it's a shame there's no way to actually provide an alternative
implementation.

Since there seems to be a decent level of interest in this I'll try to put
together a patch and raise an issue over the next few days. 

Cheers,

Oliver

> On 8/9/06, Oliver Hutchison <oh...@aconex.com> wrote:
> > Otis, Doron, thanks for the feedback.
> >
> > First up I'd just like to say that I totally agree with 
> Doron on this 
> > - any attempt to fix this issue needs to be done using as 
> fine grain 
> > synchronization as is possible or you'd just be introducing a new 
> > bottle neck.
> >
> > It terms of the level of granularity, the work around I 
> posted in my 
> > previous email and the approach suggested by Doron are 
> basically the 
> > same (though Doron's code is certainly preferable) and I 
> can certainly 
> > say that synchronizing the object creation against the 
> field name does 
> > solve the problem.
> >
> > However I have another solution that I'm working on that may be 
> > cleaner - by encapsulate the caching logic that is currently spread 
> > across FieldCacheImpl and FieldSortedHitQueue it becomes 
> quite easy to 
> > implement a more complex but certainly more fine grained level of 
> > synchronization and we don't have to worry about 
> synchronizing against 
> > an interned String or using some other trick to synchronize 
> on the field name.
> >
> > I currently have:
> >
> > public abstract class Cache {
> >
> >         private final Map readerCache = new WeakHashMap();
> >
> >         protected Cache() {
> >         }
> >
> >         protected abstract Object createValue(IndexReader reader, 
> > Object
> > key)
> >                         throws IOException;
> >
> >         public Object get(IndexReader reader, Object key) throws 
> > IOException {
> >                 Map innerCache;
> >                 Object value;
> >                 synchronized (readerCache) {
> >                         innerCache = (Map) readerCache.get(reader);
> >                         // no inner cache create it
> >                         if (innerCache == null) {
> >                                 innerCache = new HashMap();
> >                                 readerCache.put(reader, innerCache);
> >                                 value = null;
> >                         } else {
> >                                 value = innerCache.get(key);
> >                         }
> >                         if (value == null) {
> >                                 value = new CreationPlaceholder();
> >                                 readerCache.put(reader, value);
> >                         }
> >                 }
> >                 if (value instanceof CreationPlaceholder) {
> >                         // must be one of the first threads 
> to request 
> > this value,
> >                         // synchronize on the 
> CreationPlaceholder so 
> > we don't block
> >                         // any other calls for different values
> >                         CreationPlaceholder ph = 
> (CreationPlaceholder) 
> > value;
> >                         synchronized (ph) {
> >                                 // if this thread is the very first 
> > one to reach this point
> >                                 // then this test will be 
> true and we 
> > should do the creation
> >                                 if (ph.value == null) {
> >                                         ph.value = 
> createValue(reader, key);
> >                                         synchronized (readerCache) {
> >                                                 innerCache.put(key, 
> > ph.value);
> >                                         }
> >                                 }
> >                                 return ph.value;
> >                         }
> >                 }
> >                 return value;
> >         }
> >
> >         static final class CreationPlaceholder {
> >                 Object value;
> >         }
> > }
> >
> >
> > class FieldCacheImpl implements FieldCache {
> >
> > ...
> >
> >         public String[] getStrings(IndexReader reader, String field)
> >                         throws IOException {
> >                 return (String[]) stringsCache.get(reader, field);
> >         }
> >
> >         Cache stringsCache = new Cache() {
> >
> >                 protected Object createValue(IndexReader reader, 
> > Object
> > fieldKey)
> >                                 throws IOException {
> >                         String field = ((String) fieldKey).intern();
> >
> > ... create String[] ...
> >
> >                         return retArray;
> >                 }
> >         };
> >
> >         public StringIndex getStringIndex(IndexReader 
> reader, String field)
> >                         throws IOException {
> >                 return (StringIndex) 
> stringsIndexCache.get(reader, field);
> >         }
> >
> >         Cache stringsIndexCache = new Cache() {
> >
> >                 protected Object createValue(IndexReader reader, 
> > Object
> > fieldKey)
> >                                 throws IOException {
> >                         String field = ((String) fieldKey).intern();
> >
> > ... create StringIndex ...
> >
> >                         return value;
> >                 }
> >         };
> >
> > ... etc
> >
> > }
> >
> > Is this an avenue worth pursuing further? Or are you guys happy to 
> > simply synchronizing on the field?
> >
> > Thanks again,
> >
> > Oliver
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Yonik Seeley <yo...@apache.org>.

Definitely the right track Oliver... it's called a blocking map (most
easily implemented in Java5 via Future).  I don't think you need two
maps though, right?  just stick a placeholder in the outer map.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 8/9/06, Oliver Hutchison <oh...@aconex.com> wrote:
> Otis, Doron, thanks for the feedback.
>
> First up I'd just like to say that I totally agree with Doron on this - any
> attempt to fix this issue needs to be done using as fine grain
> synchronization as is possible or you'd just be introducing a new bottle
> neck.
>
> It terms of the level of granularity, the work around I posted in my
> previous email and the approach suggested by Doron are basically the same
> (though Doron's code is certainly preferable) and I can certainly say that
> synchronizing the object creation against the field name does solve the
> problem.
>
> However I have another solution that I'm working on that may be cleaner - by
> encapsulate the caching logic that is currently spread across FieldCacheImpl
> and FieldSortedHitQueue it becomes quite easy to implement a more complex
> but certainly more fine grained level of synchronization and we don't have
> to worry about synchronizing against an interned String or using some other
> trick to synchronize on the field name.
>
> I currently have:
>
> public abstract class Cache {
>
>         private final Map readerCache = new WeakHashMap();
>
>         protected Cache() {
>         }
>
>         protected abstract Object createValue(IndexReader reader, Object
> key)
>                         throws IOException;
>
>         public Object get(IndexReader reader, Object key) throws IOException
> {
>                 Map innerCache;
>                 Object value;
>                 synchronized (readerCache) {
>                         innerCache = (Map) readerCache.get(reader);
>                         // no inner cache create it
>                         if (innerCache == null) {
>                                 innerCache = new HashMap();
>                                 readerCache.put(reader, innerCache);
>                                 value = null;
>                         } else {
>                                 value = innerCache.get(key);
>                         }
>                         if (value == null) {
>                                 value = new CreationPlaceholder();
>                                 readerCache.put(reader, value);
>                         }
>                 }
>                 if (value instanceof CreationPlaceholder) {
>                         // must be one of the first threads to request this
> value,
>                         // synchronize on the CreationPlaceholder so we
> don't block
>                         // any other calls for different values
>                         CreationPlaceholder ph = (CreationPlaceholder)
> value;
>                         synchronized (ph) {
>                                 // if this thread is the very first one to
> reach this point
>                                 // then this test will be true and we should
> do the creation
>                                 if (ph.value == null) {
>                                         ph.value = createValue(reader, key);
>                                         synchronized (readerCache) {
>                                                 innerCache.put(key,
> ph.value);
>                                         }
>                                 }
>                                 return ph.value;
>                         }
>                 }
>                 return value;
>         }
>
>         static final class CreationPlaceholder {
>                 Object value;
>         }
> }
>
>
> class FieldCacheImpl implements FieldCache {
>
> ...
>
>         public String[] getStrings(IndexReader reader, String field)
>                         throws IOException {
>                 return (String[]) stringsCache.get(reader, field);
>         }
>
>         Cache stringsCache = new Cache() {
>
>                 protected Object createValue(IndexReader reader, Object
> fieldKey)
>                                 throws IOException {
>                         String field = ((String) fieldKey).intern();
>
> ... create String[] ...
>
>                         return retArray;
>                 }
>         };
>
>         public StringIndex getStringIndex(IndexReader reader, String field)
>                         throws IOException {
>                 return (StringIndex) stringsIndexCache.get(reader, field);
>         }
>
>         Cache stringsIndexCache = new Cache() {
>
>                 protected Object createValue(IndexReader reader, Object
> fieldKey)
>                                 throws IOException {
>                         String field = ((String) fieldKey).intern();
>
> ... create StringIndex ...
>
>                         return value;
>                 }
>         };
>
> ... etc
>
> }
>
> Is this an avenue worth pursuing further? Or are you guys happy to simply
> synchronizing on the field?
>
> Thanks again,
>
> Oliver

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Poor performance "race condition" in FieldSortedHitQueue

Posted by Oliver Hutchison <oh...@aconex.com>.

Otis, Doron, thanks for the feedback. 

First up I'd just like to say that I totally agree with Doron on this - any
attempt to fix this issue needs to be done using as fine grain
synchronization as is possible or you'd just be introducing a new bottle
neck. 

It terms of the level of granularity, the work around I posted in my
previous email and the approach suggested by Doron are basically the same
(though Doron's code is certainly preferable) and I can certainly say that
synchronizing the object creation against the field name does solve the
problem. 

However I have another solution that I'm working on that may be cleaner - by
encapsulate the caching logic that is currently spread across FieldCacheImpl
and FieldSortedHitQueue it becomes quite easy to implement a more complex
but certainly more fine grained level of synchronization and we don't have
to worry about synchronizing against an interned String or using some other
trick to synchronize on the field name. 

I currently have:

public abstract class Cache {

	private final Map readerCache = new WeakHashMap();

	protected Cache() {
	}

	protected abstract Object createValue(IndexReader reader, Object
key)
			throws IOException;

	public Object get(IndexReader reader, Object key) throws IOException
{
		Map innerCache;
		Object value;
		synchronized (readerCache) {
			innerCache = (Map) readerCache.get(reader);
			// no inner cache create it
			if (innerCache == null) {
				innerCache = new HashMap();
				readerCache.put(reader, innerCache);
				value = null;
			} else {
				value = innerCache.get(key);
			}
			if (value == null) {
				value = new CreationPlaceholder();
				readerCache.put(reader, value);
			}
		}
		if (value instanceof CreationPlaceholder) {
			// must be one of the first threads to request this
value,
			// synchronize on the CreationPlaceholder so we
don't block
			// any other calls for different values
			CreationPlaceholder ph = (CreationPlaceholder)
value;
			synchronized (ph) {
				// if this thread is the very first one to
reach this point
				// then this test will be true and we should
do the creation
 				if (ph.value == null) {
					ph.value = createValue(reader, key);
					synchronized (readerCache) {
						innerCache.put(key,
ph.value);
					}
				}
				return ph.value;
			}
		}
		return value;
	}

	static final class CreationPlaceholder {
		Object value;
	}
}


class FieldCacheImpl implements FieldCache {

...

	public String[] getStrings(IndexReader reader, String field)
			throws IOException {
		return (String[]) stringsCache.get(reader, field);
	}

	Cache stringsCache = new Cache() {

		protected Object createValue(IndexReader reader, Object
fieldKey)
				throws IOException {
			String field = ((String) fieldKey).intern();

... create String[] ...

			return retArray;
		}
	};

	public StringIndex getStringIndex(IndexReader reader, String field)
			throws IOException {
		return (StringIndex) stringsIndexCache.get(reader, field);
	}

	Cache stringsIndexCache = new Cache() {

		protected Object createValue(IndexReader reader, Object
fieldKey)
				throws IOException {
			String field = ((String) fieldKey).intern();

... create StringIndex ...

			return value;
		}
	};

... etc
 
}

Is this an avenue worth pursuing further? Or are you guys happy to simply
synchronizing on the field?

Thanks again,

Oliver


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Doron Cohen <DO...@il.ibm.com>.

Hi Otis,

I think that synchronizing the entire method would be an overkill - instead
it would be sufficient to synchronize on a "by field" object so that only
if two requests for the same "cold/missing" field are racing, one of them
would wait for the other to complete loading that field.  I think there is
no need to that a lookup() for field2 would wait while a different field1
is being loaded.  I am not sure if IO wise it makes sense to serialize the
loading of two different fields (i.e. the case that both field1 and field2
are not in the readerCache), I would prefer not to.
One fast way to do this, for testing performance impact in Oliver's test
case, would be to sync on the interned field name. as follows:

  public StringIndex getStringIndex (IndexReader reader, String field)
  throws IOException {
    field = field.intern();
    synchronize(field) {  // < ----------- line added
      Object ret = lookup (reader, field, STRING_INDEX, null);
      if (ret == null) {
         final int[] retArray = new int[reader.maxDoc()];
         ... load field to cache ...
      }

This way only requests for (loading) the same field would wait. But for the
working code, it wouls be better to maintain a by-field (and by-reader)
object to avoid messing up with a system wide string - who knows who else
is synchronizing on it...

Hope this makes sense,
Doron

Otis Gospodnetic <ot...@yahoo.com> wrote on 08/08/2006 21:07:41:

> Hi Oliver,
>
> I think Yonik simply misunderstood you in that earlier email.
> Have you tried modifying that FieldSortedHitQueue class and making
> the appropriate method(s) synchronized?
> It sounds like that would fix the issue. If it does, please let us know.
>
> Otis
>
> ----- Original Message ----
> From: hutchiko@gmail.com
> To: java-user@lucene.apache.org
> Sent: Tuesday, August 8, 2006 2:05:36 AM
> Subject: Poor performance "race condition" in FieldSortedHitQueue
>
> Hey all, just want to run an issue that I've recently identified while
> looking at some performance issues we are having with our larger
> indexes past you all.
>
> Basically what we are seeing is that when there are a number of
> concurrent searches being executed over a new IndexSearcher, the quite
> expensive ScoreDocComparator generation that is done in the
> FieldSortedHitQueue#getCachedComparator method ends up executing
> multiple times rather the ideal case of once. This issue does not
> effect the correctness of the searches only performance.
>
> For my relatively weak understanding of the code the core of this
> issue appears to lie with the FieldCacheImpl#getStringIndex method
> which allows multiple concurrent requests to each generate their own
> StringIndex rather than allowing the first request to do the
> generation and then blocking subsequent requests until the first
> request has finished.
>
> Is this a know problem? Should I raise this as an issue or is this
> "expected" behaviour. A solution would naturally require more
> synchronization than is currently used but nothing particularly
> complex.
>
> Thanks,
>
> Oliver
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor performance "race condition" in FieldSortedHitQueue

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Oliver,

I think Yonik simply misunderstood you in that earlier email.
Have you tried modifying that FieldSortedHitQueue class and making the appropriate method(s) synchronized?
It sounds like that would fix the issue. If it does, please let us know.

Otis

----- Original Message ----
From: hutchiko@gmail.com
To: java-user@lucene.apache.org
Sent: Tuesday, August 8, 2006 2:05:36 AM
Subject: Poor performance "race condition" in FieldSortedHitQueue

Hey all, just want to run an issue that I've recently identified while
looking at some performance issues we are having with our larger
indexes past you all.

Basically what we are seeing is that when there are a number of
concurrent searches being executed over a new IndexSearcher, the quite
expensive ScoreDocComparator generation that is done in the
FieldSortedHitQueue#getCachedComparator method ends up executing
multiple times rather the ideal case of once. This issue does not
effect the correctness of the searches only performance.

For my relatively weak understanding of the code the core of this
issue appears to lie with the FieldCacheImpl#getStringIndex method
which allows multiple concurrent requests to each generate their own
StringIndex rather than allowing the first request to do the
generation and then blocking subsequent requests until the first
request has finished.

Is this a know problem? Should I raise this as an issue or is this
"expected" behaviour. A solution would naturally require more
synchronization than is currently used but nothing particularly
complex.

Thanks,

Oliver

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org