You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2013/10/17 15:45:20 UTC

[jira] [Resolved] (STANBOL-1153) Improve Solr schema used by the Entityhub SolrYard

     [ https://issues.apache.org/jira/browse/STANBOL-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-1153.
------------------------------------------

    Resolution: Fixed

fixed with http://svn.apache.org/r1521875

> Improve Solr schema used by the Entityhub SolrYard
> --------------------------------------------------
>
>                 Key: STANBOL-1153
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1153
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>              Labels: SolrYard
>
> While working on STANBOL-1128 and Issue10 of SolrTextTagger [1] I recognized that the current default Solr schema use by the Entityhub SolrYard could be improved in several ways:
> Here the list of improvements:
> * Some languages do use the solr.StandardTokenizerFactory together with the solr.WordDelimiterFilterFactory. The WordDelimiterFilter should always be used in combination with the WhitespaceTokenizer
> * The solr.WordDelimiterFilterFactory configuration is not optimal for EntityLinking. It should be changed to
>     * splitOnCaseChange="0": For EntityLinking "PowerShot" should not be splitted to "Power", "Shot"
>     * splitOnNumerics="0": Same is true for "j2se". We do not want suggest this for "j 2 se"
>     * stemEnglishPossessive="1": removing tailing 's from words is OK. Even for languages other then English
>     * generateWordParts="1": Splitting "Wi-Fi" to "Wi Fi" should improve EntityLinking results. Maybe not for "Wi-Fi", but for "Mercedes-Entwicklungsleiter". Note as splitOnCaseChange=0 words such as "PowerShot" will still not be split.
>     * generateNumberParts="1": Splitting "500-42" is OK. Users should rather decide if they would like to link number tokens of the text.
>     * catenateWords="1": Concatenation of words can only improve linking results. So all kind of catenate* properties should be enabled. Disabled for query
>     * catenateNumbers="1". Disabled for query
>     * catenateAll="1". Disabled for query
>     * preserveOriginal="1": Activated for indexing (e.g. to keep punctuation marks in labels) but deactivated for query! Otherwise Entities at the end of sentences could be ignored because of punctuations included in the token.
> * solr.ElisionFilterFactory after WordDelimiterFilter. This might cause slower Phrase queries, but has the advantage that fields are compatible with FST linking (SolrTextTagger).
> * NOT using solr.EnglishPossessiveFilterFactory and solr.HyphenatedWordsFilterFactory as those do not provide additional functionality if WordDelimiterFilter is present.
> * NOT enableing enablePositionIncrement for StopWordFilter as posInc > 1 is not compatible with the SolrTextTagger library used by the FST linking engine
> * enable Norms for all fields (including non String and Text types): As the Entityhub SolrYard supports index time boosts norms can be used to sort results based on popularity of Entities.
> [1] https://github.com/OpenSextant/SolrTextTagger/issues/10



--
This message was sent by Atlassian JIRA
(v6.1#6144)