You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2014/03/11 03:13:44 UTC
[jira] [Resolved] (SOLR-3245) Poor performance of Hunspell with
Polish Dictionary
[ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved SOLR-3245.
-------------------------------
Resolution: Fixed
Fix Version/s: 5.0
4.8
I've been fixing several bugs in this thing recently for the 4.8 release. I don't know what bug was happening here, but I am guessing it mostly involved correctness issues (LUCENE-5483) resulting in bad stems, too, which will cause crazy search results.
I compared performance of the 4.7 release with the current code in branch_4x (to be 4.8). For the corpus I used the first 10k news snippets from the polish corpus here: http://www.corpora.heliohost.org/
||Version||Indexing Speed (docs/second)||Number of tokens (sumTotalTermFreq)||RAM usage||
|4.7|71.1|635117|50.9MB|
|4.8|909.3|456499|2MB|
So I think the performance issues are fixed. As you can see, this polish dictionary was definitely impacted by correctness issues, and this over-recursion no longer happens.
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
> Key: SOLR-3245
> URL: https://issues.apache.org/jira/browse/SOLR-3245
> Project: Solr
> Issue Type: Bug
> Components: Schema and Analysis
> Affects Versions: 4.0-ALPHA
> Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M
> Reporter: Agnieszka
> Labels: performance
> Fix For: 4.8, 5.0
>
> Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good.
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="dict/stopwords_pl.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="dict/stopwords_pl.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
> </analyzer>
> </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version.
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
> <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="dict/stopwords_pl.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StempelPolishStemFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="dict/stopwords_pl.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StempelPolishStemFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
> </analyzer>
> </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org