You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2013/09/03 07:55:17 UTC

svn commit: r1519565 - in /stanbol/trunk/enhancement-engines/lucenefstlinking: README.md src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java

Author: rwesten
Date: Tue Sep  3 05:55:17 2013
New Revision: 1519565

URL: http://svn.apache.org/r1519565
Log:
STANBOL-1128: Fixed a NPO if no default FST configuration was present; Corrected some errors in the README; changed the ordering of the config properties;

Modified:
    stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
    stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java

Modified: stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
URL: http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/README.md?rev=1519565&r1=1519564&r2=1519565&view=diff
==============================================================================
--- stanbol/trunk/enhancement-engines/lucenefstlinking/README.md (original)
+++ stanbol/trunk/enhancement-engines/lucenefstlinking/README.md Tue Sep  3 05:55:17 2013
@@ -40,7 +40,11 @@ Used Solr indexes need also confirm to t
 The SolrTextTagger README provides an example for a Field Analyzer configuration that does work. To make things easier this engine includes this [XML file](fst_field_types.xml) that includes a schema.xml fragment with FST tagging compatible configurations for most languages supported by Solr.
 
 
-### Field Name Encoding 
+### Solr Index Layout Configuration
+
+This part of the configuration is used to specify the layout if the used Solr index. It specifies how Entity information are stored in the Solr index.
+
+#### Field Name Encoding 
 
 The Field Name Encoding configuration `enhancer.engines.linking.solrfst.fieldEncoding` specifies how Solr fields for multiple languages are encoded. As an example a Vocabulary with labels in multiple languages might use "en_label" for the English language labels and "de_label" for the German language labels. In this case users should set this property to `UnderscorePrefix` and simple use "label" when configuring the FST field name. 
 
@@ -60,7 +64,7 @@ This is the full list of supported Field
 * AtSuffix: {field}-{lang} (e.g. "name@en")
 * None: In this case no prefix/suffix rewriting of configured `field` and `store` values is done. This means that the FST Configuration MUST define the exact field names in the Solr index for every configured language.
 
-### FST Tagging Configuration
+#### FST Tagging Configuration
 
 The FST Tagging Configuration `enhancer.engines.linking.solrfst.fstconfig` defines several things:
 
@@ -95,7 +99,12 @@ This would set the index field to "fise:
 
     *;field=fise:fstTagging;stored=rdfs:label;generate=true
 
-__Runtime FST generation Thread Pool__
+#### Additional Entity Information
+
+* __Entity Type Field__ _(enhancer.engines.linking.solrfst.typeField)_: This field specifies the Solr field name holding entity type information of Entities. In case 'SolrYard' is used as _Field Name Encoding_ one can use the the QNAME of the property (typically 'rdf:type'). Otherwise the value must be the exact field name holding the type information. Values are expected to be URIs.
+* __Entity Ranking Field__ _(enhancer.engines.linking.solrfst.rankingField)_: This is an __ADDITIONAL__ property used to configure the name of the Field storing the floating point value of the ranking for the Entity. Entities with higher ranking will get a slightly better `fise:confidence` value if labels of several Entities do match the text.
+
+### Runtime FST generation Thread Pool
 
 The `enhancer.engines.linking.solrfst.fstThreadPoolSize` parameter can be used to configure the size of the thread pool used for the runtime generation of FST models. The default size of the thread pool is `1`. Threads do use the lowest possible priority to reduce the performance impact on enhancements as much as possible.
 
@@ -103,6 +112,7 @@ When configuring the size of the thread 
 
 _NOTE_ that the `generate` parameter of the FST Tagging Configuration needs to be set to `true` to enable runtime generation.
 
+
 ### Entity Cache Configuration
 
 While FST tagging is fully done in-memory the FST linking engine needs to read information of matching Entities from the Solr index. This requires disc IO and is typically the part of the process that consumes the most time. The Entity Cache tries to prevent such disc level IO by caching SolrDocuments containing only fields required for the linking process (labels, types and (if available) entity rankings).  To further reduce memory requirements only labels in languages requested by processed ContentItems are stored in the cache. The Cache uses the LRU semantic and is based on the Solr cache implementation.
@@ -120,11 +130,9 @@ For now this engine uses the exact same 
 
 The Entity Linking Configuration of this Engine is very similar as the one for the [EntityLinking engine](http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entity-linker-configuration). The configuration does use the exact same keys, but it does not support all properties and some do have a slightly different meaning. In the following only the differences are described. For the all other things please refer to the linked section of the documentation of the EntityLinking engine.
 
-
-* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label field is __IGNORED__ as the field holding the labels is anyway provided by the FST Tagging configuration. That means that the field defined by the _stored_ parameter is used. If the _stored_ parameter is not present it fallbacks to the _field_ parameter.
-* __Type Field__ _(enhancer.engines.linking.typeField)_: This must be the name of the Solr field holding the Entity type information. In case 'SolrYard' is used as _Field Name Encoding_ one can use the the QNAME of the property (typically 'rdf:type')
+* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label field is __IGNORED__ as the field holding the labels is anyway provided by the [FST Tagging Configuration]. That means that the field defined by the _stored_ parameter is used. If the _stored_ parameter is not present it fallbacks to the _field_ parameter.
+* <s>__Type Field__ _(enhancer.engines.linking.typeField)_</s>: This configuration gets __IGNORED__ in favor of the `enhancer.engines.linking.solrfst.typeField`. See the [Additional Entity Information] section for details. 
 * __Redirect Field__ _(enhancer.engines.linking.redirectField)_</s>: Note implemented. __NOTE__ This might not be possible to efficiently implement. When those redirects need already be considered when building the FST models.
-* __Entity Ranking Field__ _(enhancer.engines.linking.solrfst.rankingField)_: This is an __ADDITIONAL__ property used to configure the name of the Field storing the floating point value of the ranking for the Entity. Entities with higher ranking will get a slightly better `fise:confidence` value if labels of several Entities do match the text.
 * <s>__Use EntityRankings (enhancer.engines.linking.useEntityRankings)_</s>: This configuration gets __IGNORED__. EntityRanking based sorting is enabled as soon as the _Entity Ranking Field_ is configured.
 * <s>__Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_</s>: Not Yet implemented
 * <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not Yet Implemented. Currently all linked Entities are added regardless of their score. However the way the Tagging is done makes it very unlikely to have suggestions with `fise:confidence` values less as 0.5.

Modified: stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java
URL: http://svn.apache.org/viewvc/stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java?rev=1519565&r1=1519564&r2=1519565&view=diff
==============================================================================
--- stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java (original)
+++ stanbol/trunk/enhancement-engines/lucenefstlinking/src/main/java/org/apache/stanbol/enhancer/engines/lucenefstlinking/FstLinkingEngineComponent.java Tue Sep  3 05:55:17 2013
@@ -143,26 +143,24 @@ import com.google.common.util.concurrent
             name="AtSuffix")
         },value="SolrYard"),
     @Property(name=FstLinkingEngineComponent.FST_CONFIG, cardinality=Integer.MAX_VALUE),
+    @Property(name=FstLinkingEngineComponent.SOLR_TYPE_FIELD, value="rdf:type"),
+    @Property(name=FstLinkingEngineComponent.SOLR_RANKING_FIELD, value="entityhub:entityRank"),
+//  @Property(name=REDIRECT_FIELD,value="rdfs:seeAlso"),
+//  @Property(name=REDIRECT_MODE,options={
+//      @PropertyOption(
+//          value='%'+REDIRECT_MODE+".option.ignore",
+//          name="IGNORE"),
+//      @PropertyOption(
+//          value='%'+REDIRECT_MODE+".option.addValues",
+//          name="ADD_VALUES"),
+//      @PropertyOption(
+//              value='%'+REDIRECT_MODE+".option.follow",
+//              name="FOLLOW")
+//      },value="IGNORE"),
     @Property(name=FstLinkingEngineComponent.FST_THREAD_POOL_SIZE,
         intValue=FstLinkingEngineComponent.DEFAULT_FST_THREAD_POOL_SIZE),
     @Property(name=FstLinkingEngineComponent.ENTITY_CACHE_SIZE, 
         intValue=FstLinkingEngineComponent.DEFAULT_ENTITY_CACHE_SIZE),
-    @Property(name=FstLinkingEngineComponent.SOLR_TYPE_FIELD, value="rdf:type"),
-    @Property(name=FstLinkingEngineComponent.SOLR_RANKING_FIELD, value="entityhub:entityRank"),
-//    @Property(name=REDIRECT_FIELD,value="rdfs:seeAlso"),
-//    @Property(name=REDIRECT_MODE,options={
-//        @PropertyOption(
-//            value='%'+REDIRECT_MODE+".option.ignore",
-//            name="IGNORE"),
-//        @PropertyOption(
-//            value='%'+REDIRECT_MODE+".option.addValues",
-//            name="ADD_VALUES"),
-//        @PropertyOption(
-//                value='%'+REDIRECT_MODE+".option.follow",
-//                name="FOLLOW")
-//        },value="IGNORE"),
-    @Property(name=TYPE_FIELD,value="rdf:type"),
-    @Property(name=ENTITY_TYPES,cardinality=Integer.MAX_VALUE),
     @Property(name=SUGGESTIONS, intValue=DEFAULT_SUGGESTIONS),
     @Property(name=CASE_SENSITIVE,boolValue=DEFAULT_CASE_SENSITIVE_MATCHING_STATE),
     @Property(name=PROCESS_ONLY_PROPER_NOUNS_STATE, boolValue=DEFAULT_PROCESS_ONLY_PROPER_NOUNS_STATE),
@@ -172,6 +170,7 @@ import com.google.common.util.concurrent
                "es;lc=Noun", //the OpenNLP POS tagger for Spanish does not support ProperNouns
                "nl;lc=Noun"}), //same for Dutch 
     @Property(name=DEFAULT_MATCHING_LANGUAGE,value=""),
+    @Property(name=ENTITY_TYPES,cardinality=Integer.MAX_VALUE),
     @Property(name=TYPE_MAPPINGS,cardinality=Integer.MAX_VALUE, value={
         "dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > dbp-ont:Organisation",
         "dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person",
@@ -709,8 +708,14 @@ public class FstLinkingEngineComponent {
         log.info(" - default config");
         Map<String,String> defaultParams = fstConfig.getDefaultParameters();
         String fstName = defaultParams.get(PARAM_FST);
-        final String indexField = defaultParams.get(PARAM_FIELD);
-        final String storeField = defaultParams.get(PARAM_STORE_FIELD);
+        String indexField = defaultParams.get(PARAM_FIELD);
+        if(indexField == null){ //apply the defaults if null
+            indexField = DEFAULT_FIELD;
+        }
+        String storeField = defaultParams.get(PARAM_STORE_FIELD);
+        if(storeField == null){ //apply the defaults if null
+            storeField = indexField;
+        }
         if(fstName == null){ //use default
             fstName = getDefaultFstFileName(indexField);
         }