You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Maatari Daniel Okouya <ok...@yahoo.fr> on 2014/05/27 12:49:26 UTC
[Finding the Topic or primary topic of a PDF publication with
apache stanbol?]
Hi,
I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
I would appreciate if someone could provide me with some pointers.
Many thanks,
Maatary
--
Maatari Daniel Okouya
Sent with Airmail
Re: [Finding the Topic or primary topic of a PDF publication
with apache stanbol?]
Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Many thanks,
Got it.
Best,
-M-
--
Maatari Daniel Okouya
Sent with Airmail
On 28 May 2014 at 06:36:52, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:
Hi Maatari,
On Tue, May 27, 2014 at 2:53 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi, thanks for your answer.
>
> I mean Topic Annotation.
>
Currently the only available Topic Classification engine in Stanbol is
the one described by [1]. As Stanbol does not ship with pre-trained
models (e.g. for IPTC or similar thesauri) you will need to train your
own models. [1] also provides an introduction how to do that.
This year I am mentor of an GSoC (Google Summer of Code) project that
is about defining a clear Topic Classification API [2] [3] and two
additional implementations of such engines.
> Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London . } as triple in the return RDF.
>
> But for now, i don’t concern myself with using FOAF.
>
Topic Engines will always use fise:TopicAnnotation to describe
extracted engines. If you just want "{PDF-uri} foaf:primaryTopic
{topic-uri}" you can easily get this by taking the topics referenced
by fise:TopicAnnotation and linking them using foaf:primaryTopic
directly to the ContentIem
> I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc….
>
> SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation
>
No problem just configure an Enhancement Chain with the
* tika engine: to extract plain text from the PDFs
* langdetect engine: to detect the language (as alternative you can
also parse the language by setting the Content-Language HTTP header in
requests)
* the topic engine configured with the model you trained.
best
Rupert
[1] http://www.iks-project.eu/sites/default/files/Topic-Classification.pdf
[2] http://furkankamaci.com/gsoc-2014-acceptance-apache-stanbol/
[3] https://issues.apache.org/jira/browse/STANBOL-1294
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:
>
> On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
> <ok...@yahoo.fr> wrote:
>> Hi,
>>
>> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>>
>
> What do you mean by "topic"? In case of PDF files the Tika Engine [1]
> can extract metadata. Such metadata are directly added to the URI of
> the contentItem and do not use FISE.
>
>> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
>>
>
> All Stanbol Engines do generate FISE enhancements
> (fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)
>
> When you look at the list of engines [2]
>
> * Language Detection engines create a fise:TextAnnotation describing
> the language of the document (?la dc:type dc:LinguisticSystem; ?la
> dc:language ?lang)
> * Named Entity Recognition (NER) Engines create fise:TextAnnotations
> for Entities recognized by the NLP framework.
> * Linking / Suggestions create fise:EntityAnnotation for Entities
> found in the text. They might also add fise:TextAnnotation to mark the
> exact mention of such entities in the text.
> * Topic Classification engines use fise:TopicAnnotation to describe
> assigned topics. They also use a fise:TextAnnotation to mark the part
> of the text the topic is assigned to
>
>> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
>
> Yep this page describes the annotations as created by the EnhancementEngines.
>
>
> Without knowing what you mean by " ... only the topic of the pdf ..."
> I can not recommend you suitable Stanbol configurations.
>
> best
> Rupert
>
>>
>>
>
>
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
> [2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
>
>> I would appreciate if someone could provide me with some pointers.
>>
>> Many thanks,
>>
>> Maatary
>>
>> --
>> Maatari Daniel Okouya
>> Sent with Airmail
>
>
>
> --
> | Rupert Westenthaler rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO ..........................................................................
> | http://redlink.co/
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/
Re: [Finding the Topic or primary topic of a PDF publication with
apache stanbol?]
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Maatari,
On Tue, May 27, 2014 at 2:53 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi, thanks for your answer.
>
> I mean Topic Annotation.
>
Currently the only available Topic Classification engine in Stanbol is
the one described by [1]. As Stanbol does not ship with pre-trained
models (e.g. for IPTC or similar thesauri) you will need to train your
own models. [1] also provides an introduction how to do that.
This year I am mentor of an GSoC (Google Summer of Code) project that
is about defining a clear Topic Classification API [2] [3] and two
additional implementations of such engines.
> Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London . } as triple in the return RDF.
>
> But for now, i don’t concern myself with using FOAF.
>
Topic Engines will always use fise:TopicAnnotation to describe
extracted engines. If you just want "{PDF-uri} foaf:primaryTopic
{topic-uri}" you can easily get this by taking the topics referenced
by fise:TopicAnnotation and linking them using foaf:primaryTopic
directly to the ContentIem
> I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc….
>
> SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation
>
No problem just configure an Enhancement Chain with the
* tika engine: to extract plain text from the PDFs
* langdetect engine: to detect the language (as alternative you can
also parse the language by setting the Content-Language HTTP header in
requests)
* the topic engine configured with the model you trained.
best
Rupert
[1] http://www.iks-project.eu/sites/default/files/Topic-Classification.pdf
[2] http://furkankamaci.com/gsoc-2014-acceptance-apache-stanbol/
[3] https://issues.apache.org/jira/browse/STANBOL-1294
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:
>
> On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
> <ok...@yahoo.fr> wrote:
>> Hi,
>>
>> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>>
>
> What do you mean by "topic"? In case of PDF files the Tika Engine [1]
> can extract metadata. Such metadata are directly added to the URI of
> the contentItem and do not use FISE.
>
>> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
>>
>
> All Stanbol Engines do generate FISE enhancements
> (fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)
>
> When you look at the list of engines [2]
>
> * Language Detection engines create a fise:TextAnnotation describing
> the language of the document (?la dc:type dc:LinguisticSystem; ?la
> dc:language ?lang)
> * Named Entity Recognition (NER) Engines create fise:TextAnnotations
> for Entities recognized by the NLP framework.
> * Linking / Suggestions create fise:EntityAnnotation for Entities
> found in the text. They might also add fise:TextAnnotation to mark the
> exact mention of such entities in the text.
> * Topic Classification engines use fise:TopicAnnotation to describe
> assigned topics. They also use a fise:TextAnnotation to mark the part
> of the text the topic is assigned to
>
>> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
>
> Yep this page describes the annotations as created by the EnhancementEngines.
>
>
> Without knowing what you mean by " ... only the topic of the pdf ..."
> I can not recommend you suitable Stanbol configurations.
>
> best
> Rupert
>
>>
>>
>
>
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
> [2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
>
>> I would appreciate if someone could provide me with some pointers.
>>
>> Many thanks,
>>
>> Maatary
>>
>> --
>> Maatari Daniel Okouya
>> Sent with Airmail
>
>
>
> --
> | Rupert Westenthaler rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO ..........................................................................
> | http://redlink.co/
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/
Re: [Finding the Topic or primary topic of a PDF publication
with apache stanbol?]
Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Hi, thanks for your answer.
I mean Topic Annotation.
Ultimately what i would like to have is something like: { PDFuri FoaF:PrimaryTopic London . } as triple in the return RDF.
But for now, i don’t concern myself with using FOAF.
I just want to have the main topics of the PDF. I don’t necessarily want to extract all the entity etc….
SO maybe in term of the annotation generated i would say not having fise:EntityAnnotation neither fise:TextAnnotation but simply fise:TopicAnnotation
Many thanks
--
Maatari Daniel Okouya
Sent with Airmail
On 27 May 2014 at 13:08:38, Rupert Westenthaler (rupert.westenthaler@gmail.com) wrote:
On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi,
>
> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>
What do you mean by "topic"? In case of PDF files the Tika Engine [1]
can extract metadata. Such metadata are directly added to the URI of
the contentItem and do not use FISE.
> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
>
All Stanbol Engines do generate FISE enhancements
(fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)
When you look at the list of engines [2]
* Language Detection engines create a fise:TextAnnotation describing
the language of the document (?la dc:type dc:LinguisticSystem; ?la
dc:language ?lang)
* Named Entity Recognition (NER) Engines create fise:TextAnnotations
for Entities recognized by the NLP framework.
* Linking / Suggestions create fise:EntityAnnotation for Entities
found in the text. They might also add fise:TextAnnotation to mark the
exact mention of such entities in the text.
* Topic Classification engines use fise:TopicAnnotation to describe
assigned topics. They also use a fise:TextAnnotation to mark the part
of the text the topic is assigned to
> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
Yep this page describes the annotations as created by the EnhancementEngines.
Without knowing what you mean by " ... only the topic of the pdf ..."
I can not recommend you suitable Stanbol configurations.
best
Rupert
>
>
[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
[2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
> I would appreciate if someone could provide me with some pointers.
>
> Many thanks,
>
> Maatary
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/
Re: [Finding the Topic or primary topic of a PDF publication with
apache stanbol?]
Posted by Rupert Westenthaler <ru...@gmail.com>.
On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya
<ok...@yahoo.fr> wrote:
> Hi,
>
> I have just started to use apache stanbol. I’m still playing around with it to figure out everything that is out there. However, I’m puzzle by one thing. I would like to configure it such that upon uploading a text or a Pdf document, an RDF containing only the topic of the pdf shall be returned.
>
What do you mean by "topic"? In case of PDF files the Tika Engine [1]
can extract metadata. Such metadata are directly added to the URI of
the contentItem and do not use FISE.
> I’m scratching my head but i don’t see how to do so. What is the engine that is suppose to produce <<Fise:Annotation>>
>
All Stanbol Engines do generate FISE enhancements
(fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation)
When you look at the list of engines [2]
* Language Detection engines create a fise:TextAnnotation describing
the language of the document (?la dc:type dc:LinguisticSystem; ?la
dc:language ?lang)
* Named Entity Recognition (NER) Engines create fise:TextAnnotations
for Entities recognized by the NLP framework.
* Linking / Suggestions create fise:EntityAnnotation for Entities
found in the text. They might also add fise:TextAnnotation to mark the
exact mention of such entities in the text.
* Topic Classification engines use fise:TopicAnnotation to describe
assigned topics. They also use a fise:TextAnnotation to mark the part
of the text the topic is assigned to
> as described in http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
Yep this page describes the annotations as created by the EnhancementEngines.
Without knowing what you mean by " ... only the topic of the pdf ..."
I can not recommend you suitable Stanbol configurations.
best
Rupert
>
>
[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine
[2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
> I would appreciate if someone could provide me with some pointers.
>
> Many thanks,
>
> Maatary
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/