You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Maatari Daniel Okouya <ok...@yahoo.fr> on 2014/05/27 12:49:26 UTC

[Finding the Topic or primary topic of a PDF publication with apache stanbol?]

Hi, 

I have just started to use apache stanbol. I’m still playing around  with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned. 

I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce  <<Fise:Annotation>>

as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html


I would appreciate if someone could provide me with some pointers. 

Many thanks, 

Maatary

-- 
Maatari Daniel Okouya
Sent with Airmail

Re: [Finding the Topic or primary topic of a PDF publication with apache stanbol?]

Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Many thanks, 

Got it. 


Best, 

-M-
-- 
Maatari Daniel Okouya
Sent with Airmail

On 28 May 2014 at 06:36:52, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:

Hi Maatari,  

On Tue, May 27, 2014 at 2:53 PM, Maatari Daniel Okouya  
<ok...@yahoo.fr> wrote:  
> Hi, thanks for your answer.  
>  
> I mean Topic Annotation.  
>  

Currently the only available Topic Classification engine in Stanbol is  
the one described by [1]. As Stanbol does not ship with pre-trained  
models (e.g. for IPTC or similar thesauri) you will need to train your  
own models. [1] also provides an introduction how to do that.  

This year I am mentor of an GSoC (Google Summer of Code) project that  
is about defining a clear Topic Classification API [2] [3] and two  
additional implementations of such engines.  

> Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London . } as triple in the return RDF.  
>  
> But for now, i don’t concern myself with using FOAF.  
>  

Topic Engines will always use fise:TopicAnnotation to describe  
extracted engines. If you just want "{PDF-uri} foaf:primaryTopic  
{topic-uri}" you can easily get this by taking the topics referenced  
by fise:TopicAnnotation and linking them using foaf:primaryTopic  
directly to the ContentIem  

> I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc….  
>  
> SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation  
>  

No problem just configure an Enhancement Chain with the  

* tika engine: to extract plain text from the PDFs  
* langdetect engine: to detect the language (as alternative you can  
also parse the language by setting the Content-Language HTTP header in  
requests)  
* the topic engine configured with the model you trained.  

best  
Rupert  

[1] http://www.iks-project.eu/sites/default/files/Topic-Classification.pdf  
[2] http://furkankamaci.com/gsoc-2014-acceptance-apache-stanbol/  
[3] https://issues.apache.org/jira/browse/STANBOL-1294  
>  
> --  
> Maatari Daniel Okouya  
> Sent with Airmail  
>  
> On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:  
>  
> On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya  
> <ok...@yahoo.fr> wrote:  
>> Hi,  
>>  
>> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.  
>>  
>  
> What do you mean by "topic"? In case of PDF files the Tika Engine [1]  
> can extract metadata. Such metadata are directly added to the URI of  
> the contentItem and do not use FISE.  
>  
>> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>  
>>  
>  
> All Stanbol Engines do generate FISE enhancements  
> (fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)  
>  
> When you look at the list of engines [2]  
>  
> * Language Detection engines create a fise:TextAnnotation describing  
> the language of the document (?la dc:type dc:LinguisticSystem; ?la  
> dc:language ?lang)  
> * Named Entity Recognition (NER) Engines create fise:TextAnnotations  
> for Entities recognized by the NLP framework.  
> * Linking / Suggestions create fise:EntityAnnotation for Entities  
> found in the text. They might also add fise:TextAnnotation to mark the  
> exact mention of such entities in the text.  
> * Topic Classification engines use fise:TopicAnnotation to describe  
> assigned topics. They also use a fise:TextAnnotation to mark the part  
> of the text the topic is assigned to  
>  
>> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html  
>  
> Yep this page describes the annotations as created by the EnhancementEngines.  
>  
>  
> Without knowing what you mean by " ... only the topic of the pdf ..."  
> I can not recommend you suitable Stanbol configurations.  
>  
> best  
> Rupert  
>  
>>  
>>  
>  
>  
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine  
> [2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list  
>  
>> I would appreciate if someone could provide me with some pointers.  
>>  
>> Many thanks,  
>>  
>> Maatary  
>>  
>> --  
>> Maatari Daniel Okouya  
>> Sent with Airmail  
>  
>  
>  
> --  
> | Rupert Westenthaler rupert.westenthaler@gmail.com  
> | Bodenlehenstraße 11 ++43-699-11108907  
> | A-5500 Bischofshofen  
> | REDLINK.CO ..........................................................................  
> | http://redlink.co/  



--  
| Rupert Westenthaler rupert.westenthaler@gmail.com  
| Bodenlehenstraße 11 ++43-699-11108907  
| A-5500 Bischofshofen  
| REDLINK.CO ..........................................................................  
| http://redlink.co/  

Re: [Finding the Topic or primary topic of a PDF publication with apache stanbol?]

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Maatari,

On Tue, May 27, 2014 at 2:53 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi, thanks for your answer.
>
> I mean Topic Annotation.
>

Currently the only available Topic Classification engine in Stanbol is
the one described by [1]. As Stanbol does not ship with pre-trained
models (e.g. for IPTC or similar thesauri) you will need to train your
own models. [1] also provides an introduction how to do that.

This year I am mentor of an GSoC (Google Summer of Code) project that
is about defining a clear Topic Classification API [2] [3] and two
additional implementations of such engines.

> Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London  . }   as triple in the return RDF.
>
> But for now, i don’t concern myself with using FOAF.
>

Topic Engines will always use fise:TopicAnnotation to describe
extracted engines. If you just want "{PDF-uri} foaf:primaryTopic
{topic-uri}" you can easily get this by taking the topics referenced
by fise:TopicAnnotation and linking them using foaf:primaryTopic
directly to the ContentIem

> I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc….
>
> SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation
>

No problem just configure an Enhancement Chain with the

* tika engine: to extract plain text from the PDFs
* langdetect engine: to detect the language (as alternative you can
also parse the language by setting the Content-Language HTTP header in
requests)
* the topic engine configured with the model you trained.

best
Rupert

[1] http://www.iks-project.eu/sites/default/files/Topic-Classification.pdf
[2] http://furkankamaci.com/gsoc-2014-acceptance-apache-stanbol/
[3] https://issues.apache.org/jira/browse/STANBOL-1294
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:
>
> On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
> <ok...@yahoo.fr> wrote:
>> Hi,
>>
>> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>>
>
> What do you mean by "topic"? In case of PDF files the Tika Engine [1]
> can extract metadata. Such metadata are directly added to the URI of
> the contentItem and do not use FISE.
>
>> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
>>
>
> All Stanbol Engines do generate FISE enhancements
> (fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)
>
> When you look at the list of engines [2]
>
> * Language Detection engines create a fise:TextAnnotation describing
> the language of the document (?la dc:type dc:LinguisticSystem; ?la
> dc:language ?lang)
> * Named Entity Recognition (NER) Engines create fise:TextAnnotations
> for Entities recognized by the NLP framework.
> * Linking / Suggestions create fise:EntityAnnotation for Entities
> found in the text. They might also add fise:TextAnnotation to mark the
> exact mention of such entities in the text.
> * Topic Classification engines use fise:TopicAnnotation to describe
> assigned topics. They also use a fise:TextAnnotation to mark the part
> of the text the topic is assigned to
>
>> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
>
> Yep this page describes the annotations as created by the EnhancementEngines.
>
>
> Without knowing what you mean by " ... only the topic of the pdf ..."
> I can not recommend you suitable Stanbol configurations.
>
> best
> Rupert
>
>>
>>
>
>
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
> [2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
>
>> I would appreciate if someone could provide me with some pointers.
>>
>> Many thanks,
>>
>> Maatary
>>
>> --
>> Maatari Daniel Okouya
>> Sent with Airmail
>
>
>
> --
> | Rupert Westenthaler rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO ..........................................................................
> | http://redlink.co/



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: [Finding the Topic or primary topic of a PDF publication with apache stanbol?]

Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Hi, thanks for your answer. 

I mean Topic Annotation. 

Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London  . }   as triple in the return RDF. 

But for now, i don’t concern myself with using FOAF. 

I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc…. 

SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation

Many thanks


-- 
Maatari Daniel Okouya
Sent with Airmail

On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:

On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya  
<ok...@yahoo.fr> wrote:  
> Hi,  
>  
> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.  
>  

What do you mean by "topic"? In case of PDF files the Tika Engine [1]  
can extract metadata. Such metadata are directly added to the URI of  
the contentItem and do not use FISE.  

> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>  
>  

All Stanbol Engines do generate FISE enhancements  
(fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)  

When you look at the list of engines [2]  

* Language Detection engines create a fise:TextAnnotation describing  
the language of the document (?la dc:type dc:LinguisticSystem; ?la  
dc:language ?lang)  
* Named Entity Recognition (NER) Engines create fise:TextAnnotations  
for Entities recognized by the NLP framework.  
* Linking / Suggestions create fise:EntityAnnotation for Entities  
found in the text. They might also add fise:TextAnnotation to mark the  
exact mention of such entities in the text.  
* Topic Classification engines use fise:TopicAnnotation to describe  
assigned topics. They also use a fise:TextAnnotation to mark the part  
of the text the topic is assigned to  

> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html  

Yep this page describes the annotations as created by the EnhancementEngines.  


Without knowing what you mean by " ... only the topic of the pdf ..."  
I can not recommend you suitable Stanbol configurations.  

best  
Rupert  

>  
>  


[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine  
[2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list  

> I would appreciate if someone could provide me with some pointers.  
>  
> Many thanks,  
>  
> Maatary  
>  
> --  
> Maatari Daniel Okouya  
> Sent with Airmail  



--  
| Rupert Westenthaler rupert.westenthaler@gmail.com  
| Bodenlehenstraße 11 ++43-699-11108907  
| A-5500 Bischofshofen  
| REDLINK.CO ..........................................................................  
| http://redlink.co/  

Re: [Finding the Topic or primary topic of a PDF publication with apache stanbol?]

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi,
>
> I have just started to use apache stanbol. I’m still playing around  with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>

What do you mean by "topic"? In case of PDF files the Tika Engine [1]
can extract metadata. Such metadata are directly added to the URI of
the contentItem and do not use FISE.

> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce  <<Fise:Annotation>>
>

All Stanbol Engines do generate FISE enhancements
(fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)

When you look at the list of engines [2]

* Language Detection engines create a fise:TextAnnotation describing
the language of the document (?la dc:type dc:LinguisticSystem; ?la
dc:language ?lang)
* Named Entity Recognition (NER) Engines create fise:TextAnnotations
for Entities recognized by the NLP framework.
* Linking / Suggestions create fise:EntityAnnotation for Entities
found in the text. They might also add fise:TextAnnotation to mark the
exact mention of such entities in the text.
* Topic Classification engines use fise:TopicAnnotation to describe
assigned topics. They also use a fise:TextAnnotation to mark the part
of the text the topic is assigned to

> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html

Yep this page describes the annotations as created by the EnhancementEngines.


Without knowing what you mean by " ... only the topic of the pdf ..."
I can not recommend you suitable Stanbol configurations.

best
Rupert

>
>


[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
[2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list

> I would appreciate if someone could provide me with some pointers.
>
> Many thanks,
>
> Maatary
>
> --
> Maatari Daniel Okouya
> Sent with Airmail



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/