You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by "Lejtovicz, Katalin" <Ka...@oeaw.ac.at> on 2016/01/11 17:44:19 UTC

question on working with custom vocabularies

Dear All,

I have some problem with using custom vocabularies to enhance my content.
I created an index with Stanbol from a vocabulary, deployed the .jar file and copied the solr index file to the datafiles folder, and created an EntityHub Linking Engine, plus a weighted chain, where the following pipeline was configured: langdetect, opennlp-sentence, opennlp-token, opennlp-pos, opennlp-chunker, and the an EntityHub Linking Engine for my custom vocab.

It worked fine, when text was pasted in this enhancement chain in the user interface of Stanbol, entities were found. However we had an encoding problem in our RDF resource from which the index was built, so entities with umlaut (eg. ö, ä) were not found. We corrected the encoding of the RDF and I ran the indexing process again with the same config files, but with the new RDF resource.
I again deployed (.jar and solr zip), and created the entityhub Linking Engine, plus the same Weighted Chain as above specified.
Now I don't get any results, when I paste text in the text field of this chain in Stanbol.

I configured log files, so that I can see what is happening. The linkable, matchable tokens, etc. are defined correctly eg. 'Berlin' in the sentence 'Berlin is a big city' is defined as linkable token:

11.01.2016 16:14:05.667 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.SectionData     - TokenData: 'Berlin'[linkable=true(linkabkePos=true)| matchable=true(matchablePos=true)| alpha=true| seachLength=true| upperCase=true]

Also it is sent to the solr index, but from there, no results come back:
11.01.2016 16:14:05.668 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker --- preocess Token 0: Berlin (lemma: null) linkable=true, matchable=true | chunk: Chunk: [0, 6] Berlin
11.01.2016 16:14:05.668 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker     - 1:'is' (lemma: null) linkable=false, matchable=false
11.01.2016 16:14:05.668 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker     - 2:'a' (lemma: null) linkable=false, matchable=false
11.01.2016 16:14:05.668 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker   >> searchStrings [Berlin]
11.01.2016 16:14:05.668 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker    > request entities [0-20] entities ...
11.01.2016 16:14:05.669 *DEBUG* [Thread-9] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker       < found 0 entities ...

I also looked at the solr.log, the query looks like this:
(((@en\/rdfs\:label\/:"Berlin")) OR ((@\/rdfs\:label\/:"Berlin")))
hits=0 status=0 QTime=1


I installed solr and copied the index file over to execute the above query. It does not result any Solr Documents, but the following one does:
(((_\!@en\/rdfs\:label\/:" Berlin ")) OR ((_\!@\/rdfs\:label\/:" Berlin ")))

Can someone help me, what I am missing?
Is it a configuration issue when I am creating the index? (Strange is, that I used the same config files for the incorrectly encoded RDF resource file, an that index worked.)
Or is it a Stanbol issue?

Thanks for any hints/help!

Best regards,
Kata

Re: question on working with custom vocabularies

Posted by Rafa Haro <rh...@apache.org>.

Hi Kata,

Have you overwritten the old solr index in the datafiles folder or have you
started from the scratch after fixing the encoding of the RDF files?

Just a hint: you can check if your entities have been indexed by querying
then with the EntityHub API at Stanbol Web interface

Hope that helps,
Rafa

On Mon, Jan 11, 2016 at 7:19 PM Lejtovicz, Katalin <
Katalin.Lejtovicz@oeaw.ac.at> wrote:

> Dear All,
>
> I have some problem with using custom vocabularies to enhance my content.
> I created an index with Stanbol from a vocabulary, deployed the .jar file
> and copied the solr index file to the datafiles folder, and created an
> EntityHub Linking Engine, plus a weighted chain, where the following
> pipeline was configured: langdetect, opennlp-sentence, opennlp-token,
> opennlp-pos, opennlp-chunker, and the an EntityHub Linking Engine for my
> custom vocab.
>
> It worked fine, when text was pasted in this enhancement chain in the user
> interface of Stanbol, entities were found. However we had an encoding
> problem in our RDF resource from which the index was built, so entities
> with umlaut (eg. ö, ä) were not found. We corrected the encoding of the RDF
> and I ran the indexing process again with the same config files, but with
> the new RDF resource.
> I again deployed (.jar and solr zip), and created the entityhub Linking
> Engine, plus the same Weighted Chain as above specified.
> Now I don't get any results, when I paste text in the text field of this
> chain in Stanbol.
>
> I configured log files, so that I can see what is happening. The linkable,
> matchable tokens, etc. are defined correctly eg. 'Berlin' in the sentence
> 'Berlin is a big city' is defined as linkable token:
>
> 11.01.2016 16:14:05.667 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.SectionData     -
> TokenData: 'Berlin'[linkable=true(linkabkePos=true)|
> matchable=true(matchablePos=true)| alpha=true| seachLength=true|
> upperCase=true]
>
> Also it is sent to the solr index, but from there, no results come back:
> 11.01.2016 16:14:05.668 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker ---
> preocess Token 0: Berlin (lemma: null) linkable=true, matchable=true |
> chunk: Chunk: [0, 6] Berlin
> 11.01.2016 16:14:05.668 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker     -
> 1:'is' (lemma: null) linkable=false, matchable=false
> 11.01.2016 16:14:05.668 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker     -
> 2:'a' (lemma: null) linkable=false, matchable=false
> 11.01.2016 16:14:05.668 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker   >>
> searchStrings [Berlin]
> 11.01.2016 16:14:05.668 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker    >
> request entities [0-20] entities ...
> 11.01.2016 16:14:05.669 *DEBUG* [Thread-9]
> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker       <
> found 0 entities ...
>
> I also looked at the solr.log, the query looks like this:
> (((@en\/rdfs\:label\/:"Berlin")) OR ((@\/rdfs\:label\/:"Berlin")))
> hits=0 status=0 QTime=1
>
>
> I installed solr and copied the index file over to execute the above
> query. It does not result any Solr Documents, but the following one does:
> (((_\!@en\/rdfs\:label\/:" Berlin ")) OR ((_\!@\/rdfs\:label\/:" Berlin
> ")))
>
> Can someone help me, what I am missing?
> Is it a configuration issue when I am creating the index? (Strange is,
> that I used the same config files for the incorrectly encoded RDF resource
> file, an that index worked.)
> Or is it a Stanbol issue?
>
> Thanks for any hints/help!
>
> Best regards,
> Kata
>
>