You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Johannes Goslar <jo...@dkd.de> on 2014/01/27 15:23:03 UTC

Extracting english words in german texts

Hi everyone,
is there a way to configure subtext matching, improve word recognition where words a not the same language as the main text?
Concrete example:
The following entity is in a OpenRDF Sesame database:
Linking config: 
Chain:

Linking works great if I input an english text like:
	The Global Toy Conference is a really good thing.

But if I send
	Die Global Toy Conference ist eine gute Sache.

It will report the language as German and will not recognize the entity. Is there any configuration way to enable detecting this?
Maybe one could add a chain component extracting all chained uppercase words as label?

Cheers
Johannes
-- 
Johannes Goslar

dkd Internet Service GmbH 
development // kommunikation // design 
Kaiserstraße 73 
60329 Frankfurt am Main 

Kontakt: 
- email: johannes.goslar@dkd.de 
- fon: +49 69 2475218-0 
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de

Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)

Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski 
Registergericht: Amtsgericht Frankfurt am Main 
Registernummer: HRB 45590




Re: Extracting english words in german texts

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes,

I implemented a fix for STANBOL-1277 today. SO when using a stanbol
version later as r1587849 [1] your issue should be resolved.

Queries for a TextConstraint with {text1} or {text2} in the languages
{lang1} or {lang2} are expected to look like:

    select ?entity, ?label where {
        ?entity rdfs:label ?label
        FILTER((regex(str(?label),"\\b{text1}\\b","i") ||
regex(str(?label),"\\b{text2}\\b","i"))
            && ((lang(?label) = "{lang1}") || (lang(?label) = "{lang2}"))) .
    }

best
Rupert


[1] http://svn.apache.org/r1587849

On Thu, Feb 6, 2014 at 4:37 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Johannes,
>
> thx for the report. I created STANBOL-1277 [1] for this
>
> best
> Rupert
>
> [1] https://issues.apache.org/jira/browse/STANBOL-1277
>
>
> On Wed, Feb 5, 2014 at 2:54 PM, Johannes Goslar <jo...@dkd.de> wrote:
>> Hi Rupert,
>> yes, moving to a managed site did help.
>>
>> Looking through logs, the failed the sparql-queries look like:
>> FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) .
>> So the query builder is somewhere wrongly inserting ^$.
>>
>> best
>> Johnny
>>
>> --
>> Johannes Goslar
>>
>> dkd Internet Service GmbH
>> development // kommunikation // design
>> Kaiserstraße 73
>> 60329 Frankfurt am Main
>>
>> Kontakt:
>> - email: johannes.goslar@dkd.de
>> - fon: +49 69 2475218-0
>> - fax: +49 69 2475218-99
>> - web: http://www.dkd.de
>> - social media: http://social.dkd.de
>>
>> Aktuelle Projekte:
>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>
>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>> Registergericht: Amtsgericht Frankfurt am Main
>> Registernummer: HRB 45590
>>
>>
>>
>> On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:
>>
>>> Hi Johnny
>>>
>>> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>>>> Hi Rupert,
>>>> the docs are really interesting but sadly did not bring me to a solution.
>>>> Removed the config except the *, but Stanbol still behaved the same way.
>>>> Extra models were not installed by hand.
>>>> At the moment the chain is using a Referenced Site pointing to the Sesame
>>>> Sparql Interface.
>>>
>>> So the most likely cause is that the Yard does not suggest the Entity.
>>> So the issue is most likely in the SPARQL query generated for the
>>> Entity Lookup generated by the Entityhub Linking Engine.
>>>
>>> I will try to replicate this, but I will not have time to do it this
>>> week as I am traveling. In the meantime you could try to upload your
>>> RDF data to a ManagedSite backed by a SolrYard.
>>>
>>> best
>>> Rupert
>>>
>>>>
>>>> Cheers
>>>> Johnny
>>>>
>>>> --
>>>> Johannes Goslar
>>>>
>>>> dkd Internet Service GmbH
>>>> development // kommunikation // design
>>>> Kaiserstraße 73
>>>> 60329 Frankfurt am Main
>>>>
>>>> Kontakt:
>>>> - email: johannes.goslar@dkd.de
>>>> - fon: +49 69 2475218-0
>>>> - fax: +49 69 2475218-99
>>>> - web: http://www.dkd.de
>>>> - social media: http://social.dkd.de
>>>>
>>>> Aktuelle Projekte:
>>>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>>>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>>>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>>>
>>>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>>>> Registergericht: Amtsgericht Frankfurt am Main
>>>> Registernummer: HRB 45590
>>>>
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: Extracting english words in german texts

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes,

thx for the report. I created STANBOL-1277 [1] for this

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-1277


On Wed, Feb 5, 2014 at 2:54 PM, Johannes Goslar <jo...@dkd.de> wrote:
> Hi Rupert,
> yes, moving to a managed site did help.
>
> Looking through logs, the failed the sparql-queries look like:
> FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) .
> So the query builder is somewhere wrongly inserting ^$.
>
> best
> Johnny
>
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>
>
>
> On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:
>
>> Hi Johnny
>>
>> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>>> Hi Rupert,
>>> the docs are really interesting but sadly did not bring me to a solution.
>>> Removed the config except the *, but Stanbol still behaved the same way.
>>> Extra models were not installed by hand.
>>> At the moment the chain is using a Referenced Site pointing to the Sesame
>>> Sparql Interface.
>>
>> So the most likely cause is that the Yard does not suggest the Entity.
>> So the issue is most likely in the SPARQL query generated for the
>> Entity Lookup generated by the Entityhub Linking Engine.
>>
>> I will try to replicate this, but I will not have time to do it this
>> week as I am traveling. In the meantime you could try to upload your
>> RDF data to a ManagedSite backed by a SolrYard.
>>
>> best
>> Rupert
>>
>>>
>>> Cheers
>>> Johnny
>>>
>>> --
>>> Johannes Goslar
>>>
>>> dkd Internet Service GmbH
>>> development // kommunikation // design
>>> Kaiserstraße 73
>>> 60329 Frankfurt am Main
>>>
>>> Kontakt:
>>> - email: johannes.goslar@dkd.de
>>> - fon: +49 69 2475218-0
>>> - fax: +49 69 2475218-99
>>> - web: http://www.dkd.de
>>> - social media: http://social.dkd.de
>>>
>>> Aktuelle Projekte:
>>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>>
>>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>>> Registergericht: Amtsgericht Frankfurt am Main
>>> Registernummer: HRB 45590
>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Extracting english words in german texts

Posted by Johannes Goslar <jo...@dkd.de>.
Hi Rupert,
yes, moving to a managed site did help.

Looking through logs, the failed the sparql-queries look like:
FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) . 
So the query builder is somewhere wrongly inserting ^$.

best
Johnny

-- 
Johannes Goslar

dkd Internet Service GmbH 
development // kommunikation // design 
Kaiserstraße 73 
60329 Frankfurt am Main 

Kontakt: 
- email: johannes.goslar@dkd.de 
- fon: +49 69 2475218-0 
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de

Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)

Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski 
Registergericht: Amtsgericht Frankfurt am Main 
Registernummer: HRB 45590



On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:

> Hi Johnny
> 
> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>> Hi Rupert,
>> the docs are really interesting but sadly did not bring me to a solution.
>> Removed the config except the *, but Stanbol still behaved the same way.
>> Extra models were not installed by hand.
>> At the moment the chain is using a Referenced Site pointing to the Sesame
>> Sparql Interface.
> 
> So the most likely cause is that the Yard does not suggest the Entity.
> So the issue is most likely in the SPARQL query generated for the
> Entity Lookup generated by the Entityhub Linking Engine.
> 
> I will try to replicate this, but I will not have time to do it this
> week as I am traveling. In the meantime you could try to upload your
> RDF data to a ManagedSite backed by a SolrYard.
> 
> best
> Rupert
> 
>> 
>> Cheers
>> Johnny
>> 
>> --
>> Johannes Goslar
>> 
>> dkd Internet Service GmbH
>> development // kommunikation // design
>> Kaiserstraße 73
>> 60329 Frankfurt am Main
>> 
>> Kontakt:
>> - email: johannes.goslar@dkd.de
>> - fon: +49 69 2475218-0
>> - fax: +49 69 2475218-99
>> - web: http://www.dkd.de
>> - social media: http://social.dkd.de
>> 
>> Aktuelle Projekte:
>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>> 
>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>> Registergericht: Amtsgericht Frankfurt am Main
>> Registernummer: HRB 45590
>> 
> 
> 
> 
> -- 
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen


Re: Extracting english words in german texts

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johnny

On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
> Hi Rupert,
> the docs are really interesting but sadly did not bring me to a solution.
> Removed the config except the *, but Stanbol still behaved the same way.
> Extra models were not installed by hand.
> At the moment the chain is using a Referenced Site pointing to the Sesame
> Sparql Interface.

So the most likely cause is that the Yard does not suggest the Entity.
So the issue is most likely in the SPARQL query generated for the
Entity Lookup generated by the Entityhub Linking Engine.

I will try to replicate this, but I will not have time to do it this
week as I am traveling. In the meantime you could try to upload your
RDF data to a ManagedSite backed by a SolrYard.

best
Rupert

>
> Cheers
> Johnny
>
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Extracting english words in german texts

Posted by Johannes Goslar <jo...@dkd.de>.
Hi Rupert, 
the docs are really interesting but sadly did not bring me to a solution.
Removed the config except the *, but Stanbol still behaved the same way.
Extra models were not installed by hand.
At the moment the chain is using a Referenced Site pointing to the Sesame Sparql Interface.

Cheers
Johnny

-- 
Johannes Goslar

dkd Internet Service GmbH 
development // kommunikation // design 
Kaiserstraße 73 
60329 Frankfurt am Main 

Kontakt: 
- email: johannes.goslar@dkd.de 
- fon: +49 69 2475218-0 
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de

Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)

Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski 
Registergericht: Amtsgericht Frankfurt am Main 
Registernummer: HRB 45590


Re: Extracting english words in german texts

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes

This should already work as suggested. EntityLinking does already uses
upper case tokens for lookups (see [1] and also the upper case
configurations of [2] for more details).

In your specific case:

* your processed language configuration should not include ';'. Just a
single line with '*' should be sufficient.
* Do you have the German models for OpenNLP installed? If not it is still
expected to work (by only using upper case tokens), but having German
models available would be good.
* Do you use the Sesame Yard as backend for the 'node' Site, or does you
use a SolrYard?

best
Rupert



[1]
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types
[2]
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#text-processing-configuration



On Mon, Jan 27, 2014 at 3:23 PM, Johannes Goslar <jo...@dkd.de>wrote:

> Hi everyone,
> is there a way to configure subtext matching, improve word recognition
> where words a not the same language as the main text?
> Concrete example:
> The following entity is in a OpenRDF Sesame database:
> Linking config:
> Chain:
> Linking works great if I input an english text like:
> The Global Toy Conference is a really good thing.
>
> But if I send
> Die Global Toy Conference ist eine gute Sache.
>
> It will report the language as German and will not recognize the entity.
> Is there any configuration way to enable detecting this?
> Maybe one could add a chain component extracting all chained uppercase
> words as label?
>
> Cheers
> Johannes
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>
>
>
>


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen