You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Agnieszka (Created) (JIRA)" <ji...@apache.org> on 2012/03/14 13:06:38 UTC

[jira] [Created] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Poor performance of Hunspell with Polish Dictionary
---------------------------------------------------

                 Key: SOLR-3245
                 URL: https://issues.apache.org/jira/browse/SOLR-3245
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 4.0
         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M 
            Reporter: Agnieszka


In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 

Tests shows:

Solr 3.4, full import 489017 documents:

StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

Solr 4.0, full import 489017 documents:

StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
{code:xml}
"<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"

<copyField source="field1" dest="text"/>  
....
<copyField source="field14" dest="text"/>
{code}


The "text_pl_hunspell" configuration:

{code:xml}
<fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>
{code}

I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 

For Polish Stemmer the diffrence is only in definion text field:
{code}
"<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"

    <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>
{code}
One document has 23 fields:
- 14 text fields copy to one text field (above) that is only indexed
- 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Comment Edited] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Posted by "Romain MERESSE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481277#comment-13481277 ] 

Romain MERESSE edited comment on SOLR-3245 at 10/22/12 9:51 AM:
----------------------------------------------------------------

Same problem here, with French dictionary in Solr 3.6

With Hunspell : ~5 documents/s
Without Hunspell : ~280 documents/s

Someone got a solution ? ...
Quite sad as this is a very important feature (stemming is poor with Snowball)
                
      was (Author: rohk):
    Same problem here, with French dictionary in Solr 3.6

With Hunspell : ~5 documents/s
Without Hunspell : ~280 documents/s

Someone got a solution ? ...
                  
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M 
>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Posted by "Romain MERESSE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481277#comment-13481277 ] 

Romain MERESSE commented on SOLR-3245:
--------------------------------------

Same problem here, with French dictionary in Solr 3.6

With Hunspell : ~5 documents/s
Without Hunspell : ~280 documents/s

Someone got a solution ? ...
                
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M 
>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Posted by "Agnieszka (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Agnieszka updated SOLR-3245:
----------------------------

    Attachment: pl_PL.zip

Polish dictionary for Hunspell
                
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M 
>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Posted by "Agnieszka (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229180#comment-13229180 ] 

Agnieszka commented on SOLR-3245:
---------------------------------

I made one more test for Hunspell with english dictionary (from OpenOffice.org) in Solr 4.0. It seems that the problem not exists with the english dictionary.

Solr 4.0, full import 489017 documents, hunspell, english dictionary:

3146 seconds, 155 docs/sec


But I'm not sure if it is reliable because I use documents with polish text to test english dictionary.
                
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M 
>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org