You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ph...@free.fr on 2015/03/27 11:14:48 UTC

Tweaking SOLR memory and cull facet words

Hi,

my SOLR 5 solrconfig.xml file contains the following lines:

<!-- Faceting defaults -->
       <str name="facet">on</str>
       			<str name="facet.field">text</str>
			 <str name="facet.mincount">100</str>


where the 'text' field contains thousands of words.

When I start SOLR, the search engine takes several minutes to index the words in the 'text' field (although loading the browse template later only takes a few seconds because the 'text' field has already been indexed).

Here are my questions:

- should I increase SOLR's JVM memory to make initial indexing faster?

e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh

- how can I cull facet words according to certain criteria (length, case, etc.)? For instance, my facets are the following:

    application (22427)
    inytapdf0 (22427)
    pdf (22427)
    the (22334)
    new (22131)
    herald (21983)
    york (21975)
    paris (21780)
    a (21692)
    and (21298)
    of (21288)
    i (21247)
    in (21062)
    to (20918)
    on (20899)
    m (20857)
    by (20733)
    de (20664)
    for (20580)
    at (20417)
    with (20371) 
...

Obviously, words such as "the", "i", "to","m", etc. should not be indexed. Furthermore, I don't care about "nouns". I am only interested in people and location names.


Many thanks.

Philippe

Re: Tweaking SOLR memory and cull facet words

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/27/2015 8:10 AM, phiroc@free.fr wrote:
>> You must send indexing requests to Solr,
> 
> Are you referring to posting <add>....</add> queries to SOLR, or to something else?
> 
>> If you can set up multiple threads or processes...
> 
> How do you do that?

Yes, I am referring to posting requests to the /update handler.

Since you would be writing the program, making it multithreaded or
multi-process is up to you and the features of the language you are
writing in.

>> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
> 
> Can you update the stopwords.txt file, and then re-index the documents?
> 
> How?

http://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn

Re: Tweaking SOLR memory and cull facet words

Posted by ph...@free.fr.

Hi Shawn,

> You must send indexing requests to Solr,

Are you referring to posting <add>....</add> queries to SOLR, or to something else?

> If you can set up multiple threads or processes...

How do you do that?

> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

Can you update the stopwords.txt file, and then re-index the documents?

How?

Many thanks.

Philippe

----- Mail original -----
De: "Shawn Heisey" <ap...@elyograg.org>
À: solr-user@lucene.apache.org
Envoyé: Vendredi 27 Mars 2015 14:38:20
Objet: Re: Tweaking SOLR memory and cull facet words

On 3/27/2015 4:14 AM, phiroc@free.fr wrote:
> Hi,
> 
> my SOLR 5 solrconfig.xml file contains the following lines:
> 
> <!-- Faceting defaults -->
>        <str name="facet">on</str>
>        			<str name="facet.field">text</str>
> 			 <str name="facet.mincount">100</str>
> 
> 
> where the 'text' field contains thousands of words.
> 
> When I start SOLR, the search engine takes several minutes to index the words in the 'text' field (although loading the browse template later only takes a few seconds because the 'text' field has already been indexed).
> 
> Here are my questions:
> 
> - should I increase SOLR's JVM memory to make initial indexing faster?
> 
> e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh
> 
> - how can I cull facet words according to certain criteria (length, case, etc.)? For instance, my facets are the following:
> 
>     application (22427)
>     inytapdf0 (22427)
>     pdf (22427)
>     the (22334)
>     new (22131)
>     herald (21983)
>     york (21975)
>     paris (21780)
>     a (21692)
>     and (21298)
>     of (21288)
>     i (21247)
>     in (21062)
>     to (20918)
>     on (20899)
>     m (20857)
>     by (20733)
>     de (20664)
>     for (20580)
>     at (20417)
>     with (20371) 
> ...
> 
> Obviously, words such as "the", "i", "to","m", etc. should not be indexed. Furthermore, I don't care about "nouns". I am only interested in people and location names.

Starting Solr does not index anything, unless you are talking about one
of the sidecar indexes for spelling correction or suggestions.  You must
send indexing requests to Solr, and if you are experiencing slow
indexing, chances are that it's because of slowness in obtaining data
from the source, not Solr ... or that you are indexing with a single
thread.  If you can set up multiple threads or processes that are
indexing in parallel, it should go faster.

Thousands of terms are not hard for Solr to handle at all.  When the
number of terms gets into the millions or billions, then it starts
becoming a hard problem.

If you use the stopword filter on the index analysis chain for the field
that you are using for facets, then all the stopwords will be removed
from the facets.  That would change how searches work on the field, so
you will probably want to use copyField to create a new field that you
use for faceting.  There are other filters that can do things you have
mentioned, like LengthFilterFactory:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

As far as java heap sizing, trial and error is about the only way to
find the right size.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn

Re: Tweaking SOLR memory and cull facet words

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/27/2015 4:14 AM, phiroc@free.fr wrote:
> Hi,
> 
> my SOLR 5 solrconfig.xml file contains the following lines:
> 
> <!-- Faceting defaults -->
>        <str name="facet">on</str>
>        			<str name="facet.field">text</str>
> 			 <str name="facet.mincount">100</str>
> 
> 
> where the 'text' field contains thousands of words.
> 
> When I start SOLR, the search engine takes several minutes to index the words in the 'text' field (although loading the browse template later only takes a few seconds because the 'text' field has already been indexed).
> 
> Here are my questions:
> 
> - should I increase SOLR's JVM memory to make initial indexing faster?
> 
> e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh
> 
> - how can I cull facet words according to certain criteria (length, case, etc.)? For instance, my facets are the following:
> 
>     application (22427)
>     inytapdf0 (22427)
>     pdf (22427)
>     the (22334)
>     new (22131)
>     herald (21983)
>     york (21975)
>     paris (21780)
>     a (21692)
>     and (21298)
>     of (21288)
>     i (21247)
>     in (21062)
>     to (20918)
>     on (20899)
>     m (20857)
>     by (20733)
>     de (20664)
>     for (20580)
>     at (20417)
>     with (20371) 
> ...
> 
> Obviously, words such as "the", "i", "to","m", etc. should not be indexed. Furthermore, I don't care about "nouns". I am only interested in people and location names.

Starting Solr does not index anything, unless you are talking about one
of the sidecar indexes for spelling correction or suggestions.  You must
send indexing requests to Solr, and if you are experiencing slow
indexing, chances are that it's because of slowness in obtaining data
from the source, not Solr ... or that you are indexing with a single
thread.  If you can set up multiple threads or processes that are
indexing in parallel, it should go faster.

Thousands of terms are not hard for Solr to handle at all.  When the
number of terms gets into the millions or billions, then it starts
becoming a hard problem.

If you use the stopword filter on the index analysis chain for the field
that you are using for facets, then all the stopwords will be removed
from the facets.  That would change how searches work on the field, so
you will probably want to use copyField to create a new field that you
use for faceting.  There are other filters that can do things you have
mentioned, like LengthFilterFactory:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

As far as java heap sizing, trial and error is about the only way to
find the right size.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn