You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Agnieszka Kukałowicz <ag...@usable.pl> on 2012/03/12 16:42:31 UTC

RE: solr 3.5 and indexing performance

Hi guys,

I have hit the same problem with Hunspell.
Doing a few tests for 500 000 documents, I've got:

Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version -
125 documents per second
Build Hunspell from 4.0 trunk - 11 documents per second.

All the tests were made on 8 core CPU with 32 GB RAM and index on SSD
disks.
For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
mergefactor but the speed of indexing was about 10 -20 documents per
second.

Is it possible that there is some performance bug with Solr 4.0? According
to previous post the problem exists in 3.5 version.

Best regards
Agnieszka Kukałowicz


> -----Original Message-----
> From: mizayah [mailto:mizayah@gmail.com]
> Sent: Thursday, February 23, 2012 10:19 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr 3.5 and indexing performance
>
> Ok i found it.
>
> Its becouse of Hunspell which now is in solr. Somehow when im using it
> by myself in 3.4 it is a lot of faster then one from 3.5.
>
> Dont know about differences, but is there any way i use my old Google
> Hunspell jar?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/solr-
> 3-5-and-indexing-performance-tp3766653p3769139.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr 3.5 and indexing performance

Posted by Agnieszka Kukałowicz <ag...@usable.pl>.
Bug ticket created:
https://issues.apache.org/jira/browse/SOLR-3245

I also made test you ask with english dictionary.
The results are in the ticket.

Agnieszka

> -----Original Message-----
> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> Sent: Wednesday, March 14, 2012 12:54 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr 3.5 and indexing performance
>
> Hi,
>
> Thanks a lot for your detailed problem description. It definitely is an
> error. Would you be so kind to register it as a bug ticket, including
> your descriptions from this email?
> http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8
> -bug_tracker.29. Also please attach to the issue your polish hunspell
> dictionaries. Then we'll try to reproduce the error.
>
> I wonder if this performance decrease is also seen for English
> dictionaries?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
>
> > Hi,
> >
> > I did some more tests for Hunspell in solr 3.4, 4.0:
> >
> > Solr 3.4, full import 489017 documents:
> >
> > StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
> > HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> >
> > Solr 4.0, full import 489017 documents:
> >
> > StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
> > HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11
> docs/sec
> >
> > Server specification and Java settings are the same as before.
> >
> > Cheers
> > Agnieszka
> >
> >
> >> -----Original Message-----
> >> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalowicz@usable.pl]
> >> Sent: Tuesday, March 13, 2012 10:39 AM
> >> To: 'solr-user@lucene.apache.org'
> >> Subject: RE: solr 3.5 and indexing performance
> >>
> >> Hi,
> >>
> >> Yes, I confirmed that without Hunspell indexing has normal speed.
> >> I did tests in solr 4.0 with Hunspell and PolishStemmer.
> >> With StempelPolishStemFilterFactory the speed is normal.
> >>
> >> My schema is quit easy. For Hunspell I have one text field I copy 14
> >> text fields to:
> >>
> >> "<field name="text" type="text_pl_hunspell" indexed="true"
> >> stored="false" multiValued="true"/>"
> >>
> >>
> >> <copyField source="field1" dest="text"/>  <copyField source="field2"
> >> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
> >> source="field4" dest="text"/>  <copyField source="field5"
> dest="text"/>
> >> <copyField source="field6" dest="text"/>  <copyField source="field7"
> >> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
> >> source="field9" dest="text"/>  <copyField source="field10"
> dest="text"/>
> >> <copyField source="field11" dest="text"/>  <copyField
> source="field12"
> >> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
> >> source="field14" dest="text"/>
> >>
> >> The "text_pl_hunspell" configuration:
> >>
> >> <fieldType name="text_pl_hunspell" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.HunspellStemFilterFactory"
> >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> >>        <!--filter class="solr.KeywordMarkerFilterFactory"
> >> protected="protwords_pl.txt"/-->
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory"
> >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.HunspellStemFilterFactory"
> >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
> >> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
> >> files I used in 3.4 version.
> >>
> >> For Polish Stemmer the diffrence is only in definion text field:
> >>
> >> "<field name="text" type="text_pl" indexed="true" stored="false"
> >> multiValued="true"/>"
> >>
> >>    <fieldType name="text_pl" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.StempelPolishStemFilterFactory"/>
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory"
> >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.StempelPolishStemFilterFactory"/>
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >> One document has 23 fields:
> >> - 14 text fields copy to one text field (above) that is only indexed
> >> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
> >> size of one document is 3-4 kB.
> >> So, I think this is not very complicated schema.
> >>
> >> My environment is:
> >> - Linux, RedHat 6.2, kernel 2.6.32
> >> - 2 physical CPU Xeon 5606 (4 cores each)
> >> - 32 GB RAM
> >> - 2 SSD disks in RAID 0
> >> - java version:
> >>
> >> java -version
> >> java version "1.6.0_26"
> >> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
> >> 64-Bit Server VM (build 20.1-b02, mixed mode)
> >>
> >> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
> >> other settings and always I have the same effect)
> >> - solr has default configuration except Rambuffersize (128MB)
> >> - solr 4.0 from nightly builds (2012-02-21 build).
> >>
> >> If you need more information, please let me know.
> >> I also will try to use profile to see what happens.
> >>
> >> Agnieszka
> >>
> >>
> >>> -----Original Message-----
> >>> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> >>> Sent: Tuesday, March 13, 2012 9:47 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: solr 3.5 and indexing performance
> >>>
> >>> Hi,
> >>>
> >>> Have you confirmed that disabling Hunspell in solrconfig gets you
> back
> >>> to normal speed?
> >>> What Hunspell configuration and dictionaries do you have?
> >>> Can you share more about your environment and documents?
> >>> Do you have a chance to run a profiler on your Solr instance? Try
> i.e.
> >>> VisualVM and run the profiler to see what part of the code takes up
> >>> the time
> >>>
> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
> >>> t
> >>> ml
> >>>
> >>> --
> >>> Jan Høydahl, search solution architect Cominvent AS -
> >>> www.cominvent.com Solr Training - www.solrtraining.com
> >>>
> >>> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
> >>>
> >>>> Hi guys,
> >>>>
> >>>> I have hit the same problem with Hunspell.
> >>>> Doing a few tests for 500 000 documents, I've got:
> >>>>
> >>>> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
> >>>> version -
> >>>> 125 documents per second
> >>>> Build Hunspell from 4.0 trunk - 11 documents per second.
> >>>>
> >>>> All the tests were made on 8 core CPU with 32 GB RAM and index on
> >>>> SSD disks.
> >>>> For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> >>>> mergefactor but the speed of indexing was about 10 -20 documents
> per
> >>>> second.
> >>>>
> >>>> Is it possible that there is some performance bug with Solr 4.0?
> >>>> According to previous post the problem exists in 3.5 version.
> >>>>
> >>>> Best regards
> >>>> Agnieszka Kukałowicz
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: mizayah [mailto:mizayah@gmail.com]
> >>>>> Sent: Thursday, February 23, 2012 10:19 AM
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: solr 3.5 and indexing performance
> >>>>>
> >>>>> Ok i found it.
> >>>>>
> >>>>> Its becouse of Hunspell which now is in solr. Somehow when im
> using
> >>>>> it by myself in 3.4 it is a lot of faster then one from 3.5.
> >>>>>
> >>>>> Dont know about differences, but is there any way i use my old
> >>> Google
> >>>>> Hunspell jar?
> >>>>>
> >>>>> --
> >>>>> View this message in context:
> >>>>> http://lucene.472066.n3.nabble.com/solr-
> >>>>> 3-5-and-indexing-performance-tp3766653p3769139.html
> >>>>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.5 and indexing performance

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error.

I wonder if this performance decrease is also seen for English dictionaries?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:

> Hi,
> 
> I did some more tests for Hunspell in solr 3.4, 4.0:
> 
> Solr 3.4, full import 489017 documents:
> 
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> 
> Solr 4.0, full import 489017 documents:
> 
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> 
> Server specification and Java settings are the same as before.
> 
> Cheers
> Agnieszka
> 
> 
>> -----Original Message-----
>> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalowicz@usable.pl]
>> Sent: Tuesday, March 13, 2012 10:39 AM
>> To: 'solr-user@lucene.apache.org'
>> Subject: RE: solr 3.5 and indexing performance
>> 
>> Hi,
>> 
>> Yes, I confirmed that without Hunspell indexing has normal speed.
>> I did tests in solr 4.0 with Hunspell and PolishStemmer.
>> With StempelPolishStemFilterFactory the speed is normal.
>> 
>> My schema is quit easy. For Hunspell I have one text field I copy 14
>> text fields to:
>> 
>> "<field name="text" type="text_pl_hunspell" indexed="true"
>> stored="false" multiValued="true"/>"
>> 
>> 
>> <copyField source="field1" dest="text"/>  <copyField source="field2"
>> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
>> source="field4" dest="text"/>  <copyField source="field5" dest="text"/>
>> <copyField source="field6" dest="text"/>  <copyField source="field7"
>> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
>> source="field9" dest="text"/>  <copyField source="field10" dest="text"/>
>> <copyField source="field11" dest="text"/>  <copyField source="field12"
>> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
>> source="field14" dest="text"/>
>> 
>> The "text_pl_hunspell" configuration:
>> 
>> <fieldType name="text_pl_hunspell" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.HunspellStemFilterFactory"
>> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>>        <!--filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords_pl.txt"/-->
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.HunspellStemFilterFactory"
>> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
>> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
>> files I used in 3.4 version.
>> 
>> For Polish Stemmer the diffrence is only in definion text field:
>> 
>> "<field name="text" type="text_pl" indexed="true" stored="false"
>> multiValued="true"/>"
>> 
>>    <fieldType name="text_pl" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StempelPolishStemFilterFactory"/>
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StempelPolishStemFilterFactory"/>
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> One document has 23 fields:
>> - 14 text fields copy to one text field (above) that is only indexed
>> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
>> size of one document is 3-4 kB.
>> So, I think this is not very complicated schema.
>> 
>> My environment is:
>> - Linux, RedHat 6.2, kernel 2.6.32
>> - 2 physical CPU Xeon 5606 (4 cores each)
>> - 32 GB RAM
>> - 2 SSD disks in RAID 0
>> - java version:
>> 
>> java -version
>> java version "1.6.0_26"
>> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
>> 64-Bit Server VM (build 20.1-b02, mixed mode)
>> 
>> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
>> other settings and always I have the same effect)
>> - solr has default configuration except Rambuffersize (128MB)
>> - solr 4.0 from nightly builds (2012-02-21 build).
>> 
>> If you need more information, please let me know.
>> I also will try to use profile to see what happens.
>> 
>> Agnieszka
>> 
>> 
>>> -----Original Message-----
>>> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
>>> Sent: Tuesday, March 13, 2012 9:47 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: solr 3.5 and indexing performance
>>> 
>>> Hi,
>>> 
>>> Have you confirmed that disabling Hunspell in solrconfig gets you back
>>> to normal speed?
>>> What Hunspell configuration and dictionaries do you have?
>>> Can you share more about your environment and documents?
>>> Do you have a chance to run a profiler on your Solr instance? Try i.e.
>>> VisualVM and run the profiler to see what part of the code takes up
>>> the time
>>> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
>>> t
>>> ml
>>> 
>>> --
>>> Jan Høydahl, search solution architect Cominvent AS -
>>> www.cominvent.com Solr Training - www.solrtraining.com
>>> 
>>> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have hit the same problem with Hunspell.
>>>> Doing a few tests for 500 000 documents, I've got:
>>>> 
>>>> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
>>>> version -
>>>> 125 documents per second
>>>> Build Hunspell from 4.0 trunk - 11 documents per second.
>>>> 
>>>> All the tests were made on 8 core CPU with 32 GB RAM and index on
>>>> SSD disks.
>>>> For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
>>>> mergefactor but the speed of indexing was about 10 -20 documents per
>>>> second.
>>>> 
>>>> Is it possible that there is some performance bug with Solr 4.0?
>>>> According to previous post the problem exists in 3.5 version.
>>>> 
>>>> Best regards
>>>> Agnieszka Kukałowicz
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: mizayah [mailto:mizayah@gmail.com]
>>>>> Sent: Thursday, February 23, 2012 10:19 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: solr 3.5 and indexing performance
>>>>> 
>>>>> Ok i found it.
>>>>> 
>>>>> Its becouse of Hunspell which now is in solr. Somehow when im using
>>>>> it by myself in 3.4 it is a lot of faster then one from 3.5.
>>>>> 
>>>>> Dont know about differences, but is there any way i use my old
>>> Google
>>>>> Hunspell jar?
>>>>> 
>>>>> --
>>>>> View this message in context:
>>>>> http://lucene.472066.n3.nabble.com/solr-
>>>>> 3-5-and-indexing-performance-tp3766653p3769139.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr 3.5 and indexing performance

Posted by Agnieszka Kukałowicz <ag...@usable.pl>.
Hi,

I did some more tests for Hunspell in solr 3.4, 4.0:

Solr 3.4, full import 489017 documents:

StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

Solr 4.0, full import 489017 documents:

StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

Server specification and Java settings are the same as before.

Cheers
Agnieszka


> -----Original Message-----
> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalowicz@usable.pl]
> Sent: Tuesday, March 13, 2012 10:39 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: solr 3.5 and indexing performance
>
> Hi,
>
> Yes, I confirmed that without Hunspell indexing has normal speed.
> I did tests in solr 4.0 with Hunspell and PolishStemmer.
> With StempelPolishStemFilterFactory the speed is normal.
>
> My schema is quit easy. For Hunspell I have one text field I copy 14
> text fields to:
>
> "<field name="text" type="text_pl_hunspell" indexed="true"
> stored="false" multiValued="true"/>"
>
>
>  <copyField source="field1" dest="text"/>  <copyField source="field2"
> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
> source="field4" dest="text"/>  <copyField source="field5" dest="text"/>
> <copyField source="field6" dest="text"/>  <copyField source="field7"
> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
> source="field9" dest="text"/>  <copyField source="field10" dest="text"/>
> <copyField source="field11" dest="text"/>  <copyField source="field12"
> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
> source="field14" dest="text"/>
>
> The "text_pl_hunspell" configuration:
>
> <fieldType name="text_pl_hunspell" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory"
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory"
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
>
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
> files I used in 3.4 version.
>
> For Polish Stemmer the diffrence is only in definion text field:
>
> "<field name="text" type="text_pl" indexed="true" stored="false"
> multiValued="true"/>"
>
>     <fieldType name="text_pl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
>
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
> size of one document is 3-4 kB.
> So, I think this is not very complicated schema.
>
> My environment is:
> - Linux, RedHat 6.2, kernel 2.6.32
> - 2 physical CPU Xeon 5606 (4 cores each)
> - 32 GB RAM
> - 2 SSD disks in RAID 0
> - java version:
>
> java -version
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
> 64-Bit Server VM (build 20.1-b02, mixed mode)
>
> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
> other settings and always I have the same effect)
> - solr has default configuration except Rambuffersize (128MB)
> - solr 4.0 from nightly builds (2012-02-21 build).
>
> If you need more information, please let me know.
> I also will try to use profile to see what happens.
>
> Agnieszka
>
>
> > -----Original Message-----
> > From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> > Sent: Tuesday, March 13, 2012 9:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: solr 3.5 and indexing performance
> >
> > Hi,
> >
> > Have you confirmed that disabling Hunspell in solrconfig gets you back
> > to normal speed?
> > What Hunspell configuration and dictionaries do you have?
> > Can you share more about your environment and documents?
> > Do you have a chance to run a profiler on your Solr instance? Try i.e.
> > VisualVM and run the profiler to see what part of the code takes up
> > the time
> > http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
> > t
> > ml
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com Solr Training - www.solrtraining.com
> >
> > On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
> >
> > > Hi guys,
> > >
> > > I have hit the same problem with Hunspell.
> > > Doing a few tests for 500 000 documents, I've got:
> > >
> > > Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
> > > version -
> > > 125 documents per second
> > > Build Hunspell from 4.0 trunk - 11 documents per second.
> > >
> > > All the tests were made on 8 core CPU with 32 GB RAM and index on
> > > SSD disks.
> > > For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> > > mergefactor but the speed of indexing was about 10 -20 documents per
> > > second.
> > >
> > > Is it possible that there is some performance bug with Solr 4.0?
> > > According to previous post the problem exists in 3.5 version.
> > >
> > > Best regards
> > > Agnieszka Kukałowicz
> > >
> > >
> > >> -----Original Message-----
> > >> From: mizayah [mailto:mizayah@gmail.com]
> > >> Sent: Thursday, February 23, 2012 10:19 AM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: solr 3.5 and indexing performance
> > >>
> > >> Ok i found it.
> > >>
> > >> Its becouse of Hunspell which now is in solr. Somehow when im using
> > >> it by myself in 3.4 it is a lot of faster then one from 3.5.
> > >>
> > >> Dont know about differences, but is there any way i use my old
> > Google
> > >> Hunspell jar?
> > >>
> > >> --
> > >> View this message in context:
> > >> http://lucene.472066.n3.nabble.com/solr-
> > >> 3-5-and-indexing-performance-tp3766653p3769139.html
> > >> Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr 3.5 and indexing performance

Posted by Agnieszka Kukałowicz <ag...@usable.pl>.
Hi,

Yes, I confirmed that without Hunspell indexing has normal speed.
I did tests in solr 4.0 with Hunspell and PolishStemmer.
With StempelPolishStemFilterFactory the speed is normal.

My schema is quit easy. For Hunspell I have one text field I copy 14 text
fields to:

"<field name="text" type="text_pl_hunspell" indexed="true" stored="false"
multiValued="true"/>"


 <copyField source="field1" dest="text"/>
 <copyField source="field2" dest="text"/>
 <copyField source="field3" dest="text"/>
 <copyField source="field4" dest="text"/>
 <copyField source="field5" dest="text"/>
 <copyField source="field6" dest="text"/>
 <copyField source="field7" dest="text"/>
 <copyField source="field8" dest="text"/>
 <copyField source="field9" dest="text"/>
 <copyField source="field10" dest="text"/>
 <copyField source="field11" dest="text"/>
 <copyField source="field12" dest="text"/>
 <copyField source="field13" dest="text"/>
 <copyField source="field14" dest="text"/>

The "text_pl_hunspell" configuration:

<fieldType name="text_pl_hunspell" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <!--filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_pl.txt"/-->
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <filter class="solr.KeywordMarkerFilterFactory"
protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>

I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I
used in 3.4 version.

For Polish Stemmer the diffrence is only in definion text field:

"<field name="text" type="text_pl" indexed="true" stored="false"
multiValued="true"/>"

    <fieldType name="text_pl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="dict/protwords_pl.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>

One document has 23 fields:
- 14 text fields copy to one text field (above) that is only indexed
- 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat)
The size of one document is 3-4 kB.
So, I think this is not very complicated schema.

My environment is:
- Linux, RedHat 6.2, kernel 2.6.32
- 2 physical CPU Xeon 5606 (4 cores each)
- 32 GB RAM
- 2 SSD disks in RAID 0
- java version:

java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

- java is running with -server -Xms4096M -Xmx4096M (I tried a lot of other
settings and always I have the same effect)
- solr has default configuration except Rambuffersize (128MB)
- solr 4.0 from nightly builds (2012-02-21 build).

If you need more information, please let me know.
I also will try to use profile to see what happens.

Agnieszka


> -----Original Message-----
> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> Sent: Tuesday, March 13, 2012 9:47 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr 3.5 and indexing performance
>
> Hi,
>
> Have you confirmed that disabling Hunspell in solrconfig gets you back
> to normal speed?
> What Hunspell configuration and dictionaries do you have?
> Can you share more about your environment and documents?
> Do you have a chance to run a profiler on your Solr instance? Try i.e.
> VisualVM and run the profiler to see what part of the code takes up the
> time
> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.ht
> ml
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
>
> > Hi guys,
> >
> > I have hit the same problem with Hunspell.
> > Doing a few tests for 500 000 documents, I've got:
> >
> > Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
> > version -
> > 125 documents per second
> > Build Hunspell from 4.0 trunk - 11 documents per second.
> >
> > All the tests were made on 8 core CPU with 32 GB RAM and index on SSD
> > disks.
> > For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> > mergefactor but the speed of indexing was about 10 -20 documents per
> > second.
> >
> > Is it possible that there is some performance bug with Solr 4.0?
> > According to previous post the problem exists in 3.5 version.
> >
> > Best regards
> > Agnieszka Kukałowicz
> >
> >
> >> -----Original Message-----
> >> From: mizayah [mailto:mizayah@gmail.com]
> >> Sent: Thursday, February 23, 2012 10:19 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: solr 3.5 and indexing performance
> >>
> >> Ok i found it.
> >>
> >> Its becouse of Hunspell which now is in solr. Somehow when im using
> >> it by myself in 3.4 it is a lot of faster then one from 3.5.
> >>
> >> Dont know about differences, but is there any way i use my old
> Google
> >> Hunspell jar?
> >>
> >> --
> >> View this message in context:
> >> http://lucene.472066.n3.nabble.com/solr-
> >> 3-5-and-indexing-performance-tp3766653p3769139.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.5 and indexing performance

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

Have you confirmed that disabling Hunspell in solrconfig gets you back to normal speed?
What Hunspell configuration and dictionaries do you have?
Can you share more about your environment and documents?
Do you have a chance to run a profiler on your Solr instance? Try i.e. VisualVM and run the profiler to see what part of the code takes up the time
http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.html

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:

> Hi guys,
> 
> I have hit the same problem with Hunspell.
> Doing a few tests for 500 000 documents, I've got:
> 
> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version -
> 125 documents per second
> Build Hunspell from 4.0 trunk - 11 documents per second.
> 
> All the tests were made on 8 core CPU with 32 GB RAM and index on SSD
> disks.
> For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> mergefactor but the speed of indexing was about 10 -20 documents per
> second.
> 
> Is it possible that there is some performance bug with Solr 4.0? According
> to previous post the problem exists in 3.5 version.
> 
> Best regards
> Agnieszka Kukałowicz
> 
> 
>> -----Original Message-----
>> From: mizayah [mailto:mizayah@gmail.com]
>> Sent: Thursday, February 23, 2012 10:19 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: solr 3.5 and indexing performance
>> 
>> Ok i found it.
>> 
>> Its becouse of Hunspell which now is in solr. Somehow when im using it
>> by myself in 3.4 it is a lot of faster then one from 3.5.
>> 
>> Dont know about differences, but is there any way i use my old Google
>> Hunspell jar?
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/solr-
>> 3-5-and-indexing-performance-tp3766653p3769139.html
>> Sent from the Solr - User mailing list archive at Nabble.com.