You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pravin Agrawal <Pr...@persistent.co.in> on 2012/11/22 13:53:14 UTC

Performance improvement for solr faceting on large index

Hi All,

We are using solr 3.4 with following schema fields.

<schema.xml>---------------------------------------------------------------------------------------

<fieldType name="autosuggest_text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
                <filter class="solr.PatternReplaceFilterFactory" pattern="^([0-9. ])*$" replacement=""
                    replace="all"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

<field name="id" type="string" stored="true" indexed="true"/>
<field name="autoSuggestContent" type="autosuggest_text" stored="true" indexed="true" multiValued="true"/>
        <copyField source="content" dest="autoSuggestContent"/>
        <copyField source="original_title" dest="autoSuggestContent"/>

<field name="content" type="text" stored="true" indexed="true"/>
<field name="original_title" type="text" stored="true" indexed="true"/>
<field name="site" type="site" stored="false" indexed="true"/>

</schema.xml>---------------------------------------------------------------------------------------

The index on above schema is distributed on two solr shards with each index size of about 1.2 million, and size on disk of about 195GB per shard.

We want to retrieve (site, autoSuggestContent term, frequency of the term) information from our above main solr index. The site is a field in document and contains name of site to which that document belongs. The terms are retrieved from multivalued field autoSuggestContent which is created using shingles from content and title of the web page.

As of now, we are using facet query to retrieve (term, frequency of term)  for each site. Below is a sample query (you may ignore initial part of query)

http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index

The problem is that with increase in index size, this method has started taking huge time. It used to take 7 minutes per site with index size of
0.4 million docs but takes around 60-90 minutes for index size of 2.5 million(). With this speed, it will take around 5-6 days to index complete 1500 sites. Also we are expecting the index size to grow with more documents and more sites and as such time to get the above information will increase further.

Please let us know if there is any better way to extract (site, term, frequency) information compare to current method.

Thanks,
Pravin Agrawal




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Performance improvement for solr faceting on large index

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Pravin,

Those unigrams... how are you using them?  What are the queries like?
I wonder if it's the (probably) massive number of terms in your index
that's the problem.

When queries are in flight and your CPU is 100% busy, do a few thread dumps
(kill -3 PID) and look where the threads are.  That will point you in the
right direction.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Fri, Nov 23, 2012 at 5:14 AM, Pravin Agrawal <
Pravin_Agrawal@persistent.co.in> wrote:

> Thanks Yuval and Otis for the reply.
>
> Yuval: I tried different combination of facet.method (fc and enum) and
> filtercache size but there was not much improvement in the processing time.
>
> Otis: We have a plan in future to move this processing out of solr but it
> will be a large code change at this point in time.
> I know that outputting unitgram can be expensive, but we need to keep them
> :(.
> The memory of the solr server that we are using is 128GB out of which we
> have assigned 64 GB to solr. We observed that solr threads are using 100%
> CPU when request is in process.
> We are trying to divide this index further on 4 shards to reduce the index
> size per shard.
>
> Need to ask few more questions that we have a large number of unique terms
> in our index so whether facet method fc is better or enum? and
> Can a large facet.enum.cache.minDf value help ?
>
>
> Thanks,
> Pravin Agrawal
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
> Sent: Friday, November 23, 2012 6:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Performance improvement for solr faceting on large index
>
> Hi,
>
> I don't quite follow what you are trying gyroscope do, but it almost sounds
> like you may be better off using something other than Solr if all you are
> doing is filtering by site and counting something.
> I see unigrams in what looks like it could be a big field and that's a red
> flag.
> Your index is quite big - how much memory have you got?  Do those queries
> produce a lot of disk IO. I have a feeling they do. If so, your shards may
> be too large for your hardware.
>
> Otis
> --
> _________________________
> From: Yuval Dotan [yuvaldotan@gmail.com]
> Sent: Thursday, November 22, 2012 7:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Performance improvement for solr faceting on large index
>
> you could always try the fc facet method and maybe increase the filtercache
> size
>
> On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
> Pravin_Agrawal@persistent.co.in> wrote:
>
> > Hi All,
> >
> > We are using solr 3.4 with following schema fields.
> >
> >
> >
> <schema.xml>---------------------------------------------------------------------------------------
> >
> > <fieldType name="autosuggest_text" class="solr.TextField"
> >             positionIncrementGap="100">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.ShingleFilterFactory"
> > maxShingleSize="5" outputUnigrams="true"/>
> >                 <filter class="solr.PatternReplaceFilterFactory"
> > pattern="^([0-9. ])*$" replacement=""
> >                     replace="all"/>
> >                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> > <field name="id" type="string" stored="true" indexed="true"/>
> > <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> > indexed="true" multiValued="true"/>
> >         <copyField source="content" dest="autoSuggestContent"/>
> >         <copyField source="original_title" dest="autoSuggestContent"/>
> >
> > <field name="content" type="text" stored="true" indexed="true"/>
> > <field name="original_title" type="text" stored="true" indexed="true"/>
> > <field name="site" type="site" stored="false" indexed="true"/>
> >
> >
> >
> </schema.xml>---------------------------------------------------------------------------------------
> >
> > The index on above schema is distributed on two solr shards with each
> > index size of about 1.2 million, and size on disk of about 195GB per
> shard.
> >
> > We want to retrieve (site, autoSuggestContent term, frequency of the
> term)
> > information from our above main solr index. The site is a field in
> document
> > and contains name of site to which that document belongs. The terms are
> > retrieved from multivalued field autoSuggestContent which is created
> using
> > shingles from content and title of the web page.
> >
> > As of now, we are using facet query to retrieve (term, frequency of term)
> >  for each site. Below is a sample query (you may ignore initial part of
> > query)
> >
> >
> >
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
> >
> > The problem is that with increase in index size, this method has started
> > taking huge time. It used to take 7 minutes per site with index size of
> > 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> > million(). With this speed, it will take around 5-6 days to index
> complete
> > 1500 sites. Also we are expecting the index size to grow with more
> > documents and more sites and as such time to get the above information
> will
> > increase further.
> >
> > Please let us know if there is any better way to extract (site, term,
> > frequency) information compare to current method.
> >
> > Thanks,
> > Pravin Agrawal
> >
> >
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

RE: Performance improvement for solr faceting on large index

Posted by Pravin Agrawal <Pr...@persistent.co.in>.

Thanks Yuval and Otis for the reply.

Yuval: I tried different combination of facet.method (fc and enum) and filtercache size but there was not much improvement in the processing time. 

Otis: We have a plan in future to move this processing out of solr but it will be a large code change at this point in time. 
I know that outputting unitgram can be expensive, but we need to keep them :(. 
The memory of the solr server that we are using is 128GB out of which we have assigned 64 GB to solr. We observed that solr threads are using 100% CPU when request is in process.
We are trying to divide this index further on 4 shards to reduce the index size per shard. 

Need to ask few more questions that we have a large number of unique terms in our index so whether facet method fc is better or enum? and
Can a large facet.enum.cache.minDf value help ?


Thanks,
Pravin Agrawal

-----Original Message-----
From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com] 
Sent: Friday, November 23, 2012 6:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Performance improvement for solr faceting on large index

Hi,

I don't quite follow what you are trying gyroscope do, but it almost sounds
like you may be better off using something other than Solr if all you are
doing is filtering by site and counting something.
I see unigrams in what looks like it could be a big field and that's a red
flag.
Your index is quite big - how much memory have you got?  Do those queries
produce a lot of disk IO. I have a feeling they do. If so, your shards may
be too large for your hardware.

Otis
--
_________________________
From: Yuval Dotan [yuvaldotan@gmail.com]
Sent: Thursday, November 22, 2012 7:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance improvement for solr faceting on large index

you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
Pravin_Agrawal@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> <schema.xml>---------------------------------------------------------------------------------------
>
> <fieldType name="autosuggest_text" class="solr.TextField"
>             positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.ShingleFilterFactory"
> maxShingleSize="5" outputUnigrams="true"/>
>                 <filter class="solr.PatternReplaceFilterFactory"
> pattern="^([0-9. ])*$" replacement=""
>                     replace="all"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> <field name="id" type="string" stored="true" indexed="true"/>
> <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> indexed="true" multiValued="true"/>
>         <copyField source="content" dest="autoSuggestContent"/>
>         <copyField source="original_title" dest="autoSuggestContent"/>
>
> <field name="content" type="text" stored="true" indexed="true"/>
> <field name="original_title" type="text" stored="true" indexed="true"/>
> <field name="site" type="site" stored="false" indexed="true"/>
>
>
> </schema.xml>---------------------------------------------------------------------------------------
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Performance improvement for solr faceting on large index

Posted by Yuval Dotan <yu...@gmail.com>.

you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
Pravin_Agrawal@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> <schema.xml>---------------------------------------------------------------------------------------
>
> <fieldType name="autosuggest_text" class="solr.TextField"
>             positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.ShingleFilterFactory"
> maxShingleSize="5" outputUnigrams="true"/>
>                 <filter class="solr.PatternReplaceFilterFactory"
> pattern="^([0-9. ])*$" replacement=""
>                     replace="all"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> <field name="id" type="string" stored="true" indexed="true"/>
> <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> indexed="true" multiValued="true"/>
>         <copyField source="content" dest="autoSuggestContent"/>
>         <copyField source="original_title" dest="autoSuggestContent"/>
>
> <field name="content" type="text" stored="true" indexed="true"/>
> <field name="original_title" type="text" stored="true" indexed="true"/>
> <field name="site" type="site" stored="false" indexed="true"/>
>
>
> </schema.xml>---------------------------------------------------------------------------------------
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Re: Performance improvement for solr faceting on large index

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

I don't quite follow what you are trying gyroscope do, but it almost sounds
like you may be better off using something other than Solr if all you are
doing is filtering by site and counting something.
I see unigrams in what looks like it could be a big field and that's a red
flag.
Your index is quite big - how much memory have you got?  Do those queries
produce a lot of disk IO. I have a feeling they do. If so, your shards may
be too large for your hardware.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 22, 2012 7:53 AM, "Pravin Agrawal" <Pr...@persistent.co.in>
wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> <schema.xml>---------------------------------------------------------------------------------------
>
> <fieldType name="autosuggest_text" class="solr.TextField"
>             positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.ShingleFilterFactory"
> maxShingleSize="5" outputUnigrams="true"/>
>                 <filter class="solr.PatternReplaceFilterFactory"
> pattern="^([0-9. ])*$" replacement=""
>                     replace="all"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> <field name="id" type="string" stored="true" indexed="true"/>
> <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> indexed="true" multiValued="true"/>
>         <copyField source="content" dest="autoSuggestContent"/>
>         <copyField source="original_title" dest="autoSuggestContent"/>
>
> <field name="content" type="text" stored="true" indexed="true"/>
> <field name="original_title" type="text" stored="true" indexed="true"/>
> <field name="site" type="site" stored="false" indexed="true"/>
>
>
> </schema.xml>---------------------------------------------------------------------------------------
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>