You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Johannes Goslar <jo...@dkd.de> on 2014/01/27 15:23:03 UTC
Extracting english words in german texts
Hi everyone,
is there a way to configure subtext matching, improve word recognition where words a not the same language as the main text?
Concrete example:
The following entity is in a OpenRDF Sesame database:
Linking config:
Chain:
Linking works great if I input an english text like:
The Global Toy Conference is a really good thing.
But if I send
Die Global Toy Conference ist eine gute Sache.
It will report the language as German and will not recognize the entity. Is there any configuration way to enable detecting this?
Maybe one could add a chain component extracting all chained uppercase words as label?
Cheers
Johannes
--
Johannes Goslar
dkd Internet Service GmbH
development // kommunikation // design
Kaiserstraße 73
60329 Frankfurt am Main
Kontakt:
- email: johannes.goslar@dkd.de
- fon: +49 69 2475218-0
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de
Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)
Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Re: Extracting english words in german texts
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes,
I implemented a fix for STANBOL-1277 today. SO when using a stanbol
version later as r1587849 [1] your issue should be resolved.
Queries for a TextConstraint with {text1} or {text2} in the languages
{lang1} or {lang2} are expected to look like:
select ?entity, ?label where {
?entity rdfs:label ?label
FILTER((regex(str(?label),"\\b{text1}\\b","i") ||
regex(str(?label),"\\b{text2}\\b","i"))
&& ((lang(?label) = "{lang1}") || (lang(?label) = "{lang2}"))) .
}
best
Rupert
[1] http://svn.apache.org/r1587849
On Thu, Feb 6, 2014 at 4:37 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Johannes,
>
> thx for the report. I created STANBOL-1277 [1] for this
>
> best
> Rupert
>
> [1] https://issues.apache.org/jira/browse/STANBOL-1277
>
>
> On Wed, Feb 5, 2014 at 2:54 PM, Johannes Goslar <jo...@dkd.de> wrote:
>> Hi Rupert,
>> yes, moving to a managed site did help.
>>
>> Looking through logs, the failed the sparql-queries look like:
>> FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) .
>> So the query builder is somewhere wrongly inserting ^$.
>>
>> best
>> Johnny
>>
>> --
>> Johannes Goslar
>>
>> dkd Internet Service GmbH
>> development // kommunikation // design
>> Kaiserstraße 73
>> 60329 Frankfurt am Main
>>
>> Kontakt:
>> - email: johannes.goslar@dkd.de
>> - fon: +49 69 2475218-0
>> - fax: +49 69 2475218-99
>> - web: http://www.dkd.de
>> - social media: http://social.dkd.de
>>
>> Aktuelle Projekte:
>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>
>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>> Registergericht: Amtsgericht Frankfurt am Main
>> Registernummer: HRB 45590
>>
>>
>>
>> On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:
>>
>>> Hi Johnny
>>>
>>> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>>>> Hi Rupert,
>>>> the docs are really interesting but sadly did not bring me to a solution.
>>>> Removed the config except the *, but Stanbol still behaved the same way.
>>>> Extra models were not installed by hand.
>>>> At the moment the chain is using a Referenced Site pointing to the Sesame
>>>> Sparql Interface.
>>>
>>> So the most likely cause is that the Yard does not suggest the Entity.
>>> So the issue is most likely in the SPARQL query generated for the
>>> Entity Lookup generated by the Entityhub Linking Engine.
>>>
>>> I will try to replicate this, but I will not have time to do it this
>>> week as I am traveling. In the meantime you could try to upload your
>>> RDF data to a ManagedSite backed by a SolrYard.
>>>
>>> best
>>> Rupert
>>>
>>>>
>>>> Cheers
>>>> Johnny
>>>>
>>>> --
>>>> Johannes Goslar
>>>>
>>>> dkd Internet Service GmbH
>>>> development // kommunikation // design
>>>> Kaiserstraße 73
>>>> 60329 Frankfurt am Main
>>>>
>>>> Kontakt:
>>>> - email: johannes.goslar@dkd.de
>>>> - fon: +49 69 2475218-0
>>>> - fax: +49 69 2475218-99
>>>> - web: http://www.dkd.de
>>>> - social media: http://social.dkd.de
>>>>
>>>> Aktuelle Projekte:
>>>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>>>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>>>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>>>
>>>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>>>> Registergericht: Amtsgericht Frankfurt am Main
>>>> Registernummer: HRB 45590
>>>>
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11 ++43-699-11108907
>>> | A-5500 Bischofshofen
>>
>
>
>
> --
> | Rupert Westenthaler rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/
Re: Extracting english words in german texts
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes,
thx for the report. I created STANBOL-1277 [1] for this
best
Rupert
[1] https://issues.apache.org/jira/browse/STANBOL-1277
On Wed, Feb 5, 2014 at 2:54 PM, Johannes Goslar <jo...@dkd.de> wrote:
> Hi Rupert,
> yes, moving to a managed site did help.
>
> Looking through logs, the failed the sparql-queries look like:
> FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) .
> So the query builder is somewhere wrongly inserting ^$.
>
> best
> Johnny
>
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>
>
>
> On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:
>
>> Hi Johnny
>>
>> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>>> Hi Rupert,
>>> the docs are really interesting but sadly did not bring me to a solution.
>>> Removed the config except the *, but Stanbol still behaved the same way.
>>> Extra models were not installed by hand.
>>> At the moment the chain is using a Referenced Site pointing to the Sesame
>>> Sparql Interface.
>>
>> So the most likely cause is that the Yard does not suggest the Entity.
>> So the issue is most likely in the SPARQL query generated for the
>> Entity Lookup generated by the Entityhub Linking Engine.
>>
>> I will try to replicate this, but I will not have time to do it this
>> week as I am traveling. In the meantime you could try to upload your
>> RDF data to a ManagedSite backed by a SolrYard.
>>
>> best
>> Rupert
>>
>>>
>>> Cheers
>>> Johnny
>>>
>>> --
>>> Johannes Goslar
>>>
>>> dkd Internet Service GmbH
>>> development // kommunikation // design
>>> Kaiserstraße 73
>>> 60329 Frankfurt am Main
>>>
>>> Kontakt:
>>> - email: johannes.goslar@dkd.de
>>> - fon: +49 69 2475218-0
>>> - fax: +49 69 2475218-99
>>> - web: http://www.dkd.de
>>> - social media: http://social.dkd.de
>>>
>>> Aktuelle Projekte:
>>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>>
>>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>>> Registergericht: Amtsgericht Frankfurt am Main
>>> Registernummer: HRB 45590
>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
Re: Extracting english words in german texts
Posted by Johannes Goslar <jo...@dkd.de>.
Hi Rupert,
yes, moving to a managed site did help.
Looking through logs, the failed the sparql-queries look like:
FILTER(regex(str(?v_7),"^Global$","i") || regex(str(?v_7),"^Toy$","i") && ((lang(?v_7) = "de") || (lang(?v_7) = "en"))) .
So the query builder is somewhere wrongly inserting ^$.
best
Johnny
--
Johannes Goslar
dkd Internet Service GmbH
development // kommunikation // design
Kaiserstraße 73
60329 Frankfurt am Main
Kontakt:
- email: johannes.goslar@dkd.de
- fon: +49 69 2475218-0
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de
Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)
Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
On 28.01.2014, at 15:21, Rupert Westenthaler <ru...@gmail.com> wrote:
> Hi Johnny
>
> On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
>> Hi Rupert,
>> the docs are really interesting but sadly did not bring me to a solution.
>> Removed the config except the *, but Stanbol still behaved the same way.
>> Extra models were not installed by hand.
>> At the moment the chain is using a Referenced Site pointing to the Sesame
>> Sparql Interface.
>
> So the most likely cause is that the Yard does not suggest the Entity.
> So the issue is most likely in the SPARQL query generated for the
> Entity Lookup generated by the Entityhub Linking Engine.
>
> I will try to replicate this, but I will not have time to do it this
> week as I am traveling. In the meantime you could try to upload your
> RDF data to a ManagedSite backed by a SolrYard.
>
> best
> Rupert
>
>>
>> Cheers
>> Johnny
>>
>> --
>> Johannes Goslar
>>
>> dkd Internet Service GmbH
>> development // kommunikation // design
>> Kaiserstraße 73
>> 60329 Frankfurt am Main
>>
>> Kontakt:
>> - email: johannes.goslar@dkd.de
>> - fon: +49 69 2475218-0
>> - fax: +49 69 2475218-99
>> - web: http://www.dkd.de
>> - social media: http://social.dkd.de
>>
>> Aktuelle Projekte:
>> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
>> - http://www.ellen-wille.de - Launch Website (TYPO3)
>> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>>
>> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
>> Registergericht: Amtsgericht Frankfurt am Main
>> Registernummer: HRB 45590
>>
>
>
>
> --
> | Rupert Westenthaler rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen
Re: Extracting english words in german texts
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johnny
On Mon, Jan 27, 2014 at 5:58 PM, Johannes Goslar <jo...@dkd.de> wrote:
> Hi Rupert,
> the docs are really interesting but sadly did not bring me to a solution.
> Removed the config except the *, but Stanbol still behaved the same way.
> Extra models were not installed by hand.
> At the moment the chain is using a Referenced Site pointing to the Sesame
> Sparql Interface.
So the most likely cause is that the Yard does not suggest the Entity.
So the issue is most likely in the SPARQL query generated for the
Entity Lookup generated by the Entityhub Linking Engine.
I will try to replicate this, but I will not have time to do it this
week as I am traveling. In the meantime you could try to upload your
RDF data to a ManagedSite backed by a SolrYard.
best
Rupert
>
> Cheers
> Johnny
>
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
Re: Extracting english words in german texts
Posted by Johannes Goslar <jo...@dkd.de>.
Hi Rupert,
the docs are really interesting but sadly did not bring me to a solution.
Removed the config except the *, but Stanbol still behaved the same way.
Extra models were not installed by hand.
At the moment the chain is using a Referenced Site pointing to the Sesame Sparql Interface.
Cheers
Johnny
--
Johannes Goslar
dkd Internet Service GmbH
development // kommunikation // design
Kaiserstraße 73
60329 Frankfurt am Main
Kontakt:
- email: johannes.goslar@dkd.de
- fon: +49 69 2475218-0
- fax: +49 69 2475218-99
- web: http://www.dkd.de
- social media: http://social.dkd.de
Aktuelle Projekte:
- http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
- http://www.ellen-wille.de - Launch Website (TYPO3)
- http://www.vgf-ffm.de - Relaunch Website (TYPO3)
Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
Registergericht: Amtsgericht Frankfurt am Main
Registernummer: HRB 45590
Re: Extracting english words in german texts
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Johannes
This should already work as suggested. EntityLinking does already uses
upper case tokens for lookups (see [1] and also the upper case
configurations of [2] for more details).
In your specific case:
* your processed language configuration should not include ';'. Just a
single line with '*' should be sufficient.
* Do you have the German models for OpenNLP installed? If not it is still
expected to work (by only using upper case tokens), but having German
models available would be good.
* Do you use the Sesame Yard as backend for the 'node' Site, or does you
use a SolrYard?
best
Rupert
[1]
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types
[2]
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#text-processing-configuration
On Mon, Jan 27, 2014 at 3:23 PM, Johannes Goslar <jo...@dkd.de>wrote:
> Hi everyone,
> is there a way to configure subtext matching, improve word recognition
> where words a not the same language as the main text?
> Concrete example:
> The following entity is in a OpenRDF Sesame database:
> Linking config:
> Chain:
> Linking works great if I input an english text like:
> The Global Toy Conference is a really good thing.
>
> But if I send
> Die Global Toy Conference ist eine gute Sache.
>
> It will report the language as German and will not recognize the entity.
> Is there any configuration way to enable detecting this?
> Maybe one could add a chain component extracting all chained uppercase
> words as label?
>
> Cheers
> Johannes
> --
> Johannes Goslar
>
> dkd Internet Service GmbH
> development // kommunikation // design
> Kaiserstraße 73
> 60329 Frankfurt am Main
>
> Kontakt:
> - email: johannes.goslar@dkd.de
> - fon: +49 69 2475218-0
> - fax: +49 69 2475218-99
> - web: http://www.dkd.de
> - social media: http://social.dkd.de
>
> Aktuelle Projekte:
> - http://j.mp/SehBiS-App – iPhone-App Sehbehinderungssimulator
> - http://www.ellen-wille.de - Launch Website (TYPO3)
> - http://www.vgf-ffm.de - Relaunch Website (TYPO3)
>
> Geschäftsführer: O. Dobberkau, S. Schaffstein, G. Wegenast, C. Zabanski
> Registergericht: Amtsgericht Frankfurt am Main
> Registernummer: HRB 45590
>
>
>
>
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen