You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/01/02 05:53:43 UTC

Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

Maybe I'm not following your situation 100%, but it sounded like pulling the values of purely stored fields is the slow part. *Perhaps* using a non-Lucene data store just for the saved fields would be faster.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Geert-Jan Brits <gb...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Monday, December 31, 2007 8:49:43 AM
Subject: Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

Hi Otis,

I don't really see how this would minimize my number of fields.
At the moment I have 1 pricefield (stored / indexed) and 1 multivalued
 field
(stored) per  product-variant. I have about 2000 product variants.

I could indeed replace each multivalued field by a singlevaluedfield
 with an
id pointing to a external store, where I get the needed fields. However
 this
would not change the number of fields in my index (correct?) and thus
wouldn't matter for the big scanning-time I'm seeing. Moreover, it
 wouldn't
matter for the query-time either I guess.

Thanks,
Geert-Jan





2007/12/29, Otis Gospodnetic <ot...@yahoo.com>:
>
> Hi Geert-Jan,
>
> Have you considered storing this data in an external data store and
 not
> Lucene index?  In other words, use the Lucene index only to index the
> content you need to search.  Then, when you search this index, just
 pull out
> the single stored fields, the unique ID for each of top N hits, and
 use
> those ID to pull the actual content for display purposes from the
 external
> store.  This external store could be a RDBMS, an ODBMS, a BDB, etc.
  I've
> worked with very large indices where we successfully used BDBs for
 this
> purpose.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Geert-Jan Brits <gb...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, December 27, 2007 11:44:13 AM
> Subject: Re: big perf-difference between solr-server vs. SOlrJ
 req.process
> (solrserver)
>
> yeah, that makes sense.
> so, in in all, could scanning all the fields and loading the 10
 fields
> add
> up to cost about the same or even more as performing the intial
 query?
> (Just
> making sure)
>
> I am wondering if the following change to the schema would help in
 this
> case:
>
> current setup:
> It's possible to have up to 2000 product-variants.
> each product-variant has:
> - 1 price field (stored / indexed)
> - 1 multivalued field which contains product-variant characteristics
> (strored / not indexed).
>
> This adds up to the 4000 fields described. Moreover there are some
> fields on
> the product level but these would contibute just a tiny bit to the
> overall
> scanning / loading costs (about 50 -stored and indexed- fields in
> total)
>
> possible new setup (only the changes) :
> - index but not store the price-field.
> - store the price as just another one of the product-variant
> characteristics
> in the multivalued product-variant field.
>
> as a result this would bring back the maximum number of stored fields
> to
> about 2050 from 4050 and thereby about halving scanning / loading
 costs
> while leaving the current quering-costs intact.
> Indexing costs would increase a bit.
>
> Would you expect the same performance gain?
>
> Thanks,
> Geert-Jan
>
> 2007/12/27, Yonik Seeley <yo...@apache.org>:
> >
> > On Dec 27, 2007 11:01 AM, Britske <gb...@gmail.com> wrote:
> > > after inspecting solrconfig.xml I see that I already have enabled
> lazy
> > field
> > > loading by:
> > > <enableLazyFieldLoading>true</enableLazyFieldLoading> (I guess it
> was
> > > enabled by default)
> > >
> > > Since any query returns about 10 fields (which differ from query
 to
> > query) ,
> > > would this mean that only these 10 of about 2000-4000 fields are
> > retrieved /
> > > loaded?
> >
> > Yes, but that's not the whole story.
> > Lucene stores all of the fields back-to-back with no index (there
 is
> > no random access to particular stored fields)... so all of the
 fields
> > must be at least scanned.
> >
> > -Yonik
> >
>
>
>
>




Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

Posted by Geert-Jan Brits <gb...@gmail.com>.
Hi Otis,

after some thought (I must have been sleeping or something)  it seems that
it is indeed possible to remove the 2000 product-variant fields from the
index and store them in an external store. I was doubting this option before
as I mistakingly thought that I would still need to have the 2000 stored
fields in place to store the product-variant keys for accessing the
database. However I have some way of identifying the product-variants
client-side, once Solr returns the products.

This however makes that an external datastore must have 1 row per
product-variant. Having an upper-range of about 200.000 products and up to
2000 product variants per product this would give a maximum of
400.000.000product-variant records in the external datastore. I really
don't have a
clue about possible performance given these numbers but it sounds rather
large to me, although it may sound peanuts to you ;-) . The query would be
to return 10 rows based on 10 product-variant id's. Any rough guestimates
whether this sounds doable? I guess I'm just going to find out.

Thanks for helping me think out of the box!

Geert-Jan

2008/1/2, Otis Gospodnetic <ot...@yahoo.com>:
>
> Maybe I'm not following your situation 100%, but it sounded like pulling
> the values of purely stored fields is the slow part. *Perhaps* using a
> non-Lucene data store just for the saved fields would be faster.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Geert-Jan Brits <gbrits@gmail.com >
> To: solr-user@lucene.apache.org
> Sent: Monday, December 31, 2007 8:49:43 AM
> Subject: Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)
>
>
> Hi Otis,
>
> I don't really see how this would minimize my number of fields.
> At the moment I have 1 pricefield (stored / indexed) and 1 multivalued
> field
> (stored) per  product-variant. I have about 2000 product variants.
>
> I could indeed replace each multivalued field by a singlevaluedfield
> with an
> id pointing to a external store, where I get the needed fields. However
> this
> would not change the number of fields in my index (correct?) and thus
> wouldn't matter for the big scanning-time I'm seeing. Moreover, it
> wouldn't
> matter for the query-time either I guess.
>
> Thanks,
> Geert-Jan
>
>
>
>
>
> 2007/12/29, Otis Gospodnetic < otis_gospodnetic@yahoo.com>:
> >
> > Hi Geert-Jan,
> >
> > Have you considered storing this data in an external data store and
> not
> > Lucene index?  In other words, use the Lucene index only to index the
> > content you need to search.  Then, when you search this index, just
> pull out
> > the single stored fields, the unique ID for each of top N hits, and
> use
> > those ID to pull the actual content for display purposes from the
> external
> > store.  This external store could be a RDBMS, an ODBMS, a BDB, etc.
>   I've
> > worked with very large indices where we successfully used BDBs for
> this
> > purpose.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Geert-Jan Brits < gbrits@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, December 27, 2007 11:44:13 AM
> > Subject: Re: big perf-difference between solr-server vs. SOlrJ
> req.process
> > (solrserver)
> >
> > yeah, that makes sense.
> > so, in in all, could scanning all the fields and loading the 10
> fields
> > add
> > up to cost about the same or even more as performing the intial
> query?
> > (Just
> > making sure)
> >
> > I am wondering if the following change to the schema would help in
> this
> > case:
> >
> > current setup:
> > It's possible to have up to 2000 product-variants.
> > each product-variant has:
> > - 1 price field (stored / indexed)
> > - 1 multivalued field which contains product-variant characteristics
> > (strored / not indexed).
> >
> > This adds up to the 4000 fields described. Moreover there are some
> > fields on
> > the product level but these would contibute just a tiny bit to the
> > overall
> > scanning / loading costs (about 50 -stored and indexed- fields in
> > total)
> >
> > possible new setup (only the changes) :
> > - index but not store the price-field.
> > - store the price as just another one of the product-variant
> > characteristics
> > in the multivalued product-variant field.
> >
> > as a result this would bring back the maximum number of stored fields
> > to
> > about 2050 from 4050 and thereby about halving scanning / loading
> costs
> > while leaving the current quering-costs intact.
> > Indexing costs would increase a bit.
> >
> > Would you expect the same performance gain?
> >
> > Thanks,
> > Geert-Jan
> >
> > 2007/12/27, Yonik Seeley <yo...@apache.org>:
> > >
> > > On Dec 27, 2007 11:01 AM, Britske < gbrits@gmail.com> wrote:
> > > > after inspecting solrconfig.xml I see that I already have enabled
> > lazy
> > > field
> > > > loading by:
> > > > <enableLazyFieldLoading>true</enableLazyFieldLoading> (I guess it
> > was
> > > > enabled by default)
> > > >
> > > > Since any query returns about 10 fields (which differ from query
> to
> > > query) ,
> > > > would this mean that only these 10 of about 2000-4000 fields are
> > > retrieved /
> > > > loaded?
> > >
> > > Yes, but that's not the whole story.
> > > Lucene stores all of the fields back-to-back with no index (there
> is
> > > no random access to particular stored fields)... so all of the
> fields
> > > must be at least scanned.
> > >
> > > -Yonik
> > >
> >
> >
> >
> >
>
>
>
>

Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Otis Gospodnetic wrote:
> Maybe I'm not following your situation 100%, but it sounded like
> pulling the values of purely stored fields is the slow part.
> *Perhaps* using a non-Lucene data store just for the saved fields
> would be faster.

For this purpose Nutch uses external files in Hadoop MapFile format. 
MapFile-s offer quick search & get by key (using binary search over an 
in-memory index of keys).

The benefit of this solution is that the bulky content is decoupled from 
Lucene indexes, and it can be put in a physically different location 
(e.g. a dedicated page content server).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com