You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Maatari Daniel Okouya <ok...@yahoo.fr> on 2014/05/27 21:05:43 UTC

PDF Description Extraction For Linked data

Hi ,

Completing my previous question, I think it would be better for me to give the bigger picture of what i’m trying to achieve. 


I have been charge with helping in disseminating the publications content of my organisation. Most of them are in PDF. 

Therefore, I need a process to produce a meaningful RDF description of our content that links as much as possible to the LOD cloud and LOV (liked open vocab).  Hence i need to use common core vocabularies as much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for instance.

Searching around the web how to automatically generate these descriptions which would include creator, publisher, primaryTopic, subject, thematic etc…. It seems to me that Apache StanBol was the best match. 

So that’s it, in the first place I would like to automatically generate some rich description about my Pdf publication. not rich tho. We are not yet planing on providing semantic search. It will probably come in the future. 

however for now, i’m interested in providing some biblio graphic data, and state the main topics of the publication i.e. what does it talk about generally speaking. 

I will then deploy those description in sparql endpoint, use a frontend like pubby, and do some content negotiation to redirect toward my pdf when requested. This means also that my description need to have some specific url that i provide them with. 


Can any one give me some pointers? Is it possible to do that with StanBol, if yes how should i go for it ? How to configure the enhancer for that ? 


Many thanks, 

-M-


-- 
Maatari Daniel Okouya
Sent with Airmail

Re: PDF Description Extraction For Linked data

Posted by Rafa Haro <rh...@apache.org>.
Hi Maatari

On Thursday, May 29, 2014, Maatari Daniel Okouya <ok...@yahoo.fr> wrote:

> Many Rafa,
>
> one last question. In some case i will already have the metadata available
> in other format that will need to be translated in RDF.
>
> Let’s assume i have it done. What i will get is basically an set of
> instance resource describe with the vocabularies of my choices.
>
> That’s where the data lifting process you are talking about comes in play.
> To be linked to the LOD, i would need to link my description to other
> dataset available on the LOD.
>
> Is there a way/pipeline that start from an RDF description and links it to
> the LOD, that is available with Sanbol. To be honest i already saw spotted
> things like, datalift and silk, but i was just wondering if something like
> was available with sanbol.
>
> If I have understood correctly, I would say that you can use an extension
of google refine which uses stanbol to reconciliate your current data with
LOD datasets imported in stanbol like for example DBPedia. You can find
more information here:

https://code.google.com/p/lmf/wiki/GoogleRefineUsersDocumentation

Hope that helps.
Cheers,

Rafa


>
> Many thanks,
>
> -M-
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 29 May 2014 at 09:24:54, Rafa Haro (rharo@apache.org) wrote:
>
> Hi Maatari,
>
> El 29/05/14 02:27, Maatari Daniel Okouya escribió:
>
>  Rafa,
>
>  Many thanks for your elaborated answer.
>
>  It seems to me that from your elaborated answer i did not completely
> graps the concepts behind StanBol. Its primary purpose is semantically
> annotating the content of a file for the purpose of semantic search.
> Although one could divert by reusing the enhancing infrastructure to get
> the description generated and apply some Sparql rule to get the description
> in a format desire. It is not geared toward linked data out of the box.
> What i mean generating a description that you could publish as is, which is
> what i was looking for. As you say, the best match here is the description
> returned by the Topic annotation engine and maybe few things extracted by
> Tika.
>
> Well, the primary purpose or use case wouldn't have to be necessarily
> Semantic Search. I would say that Stanbol helps in the task of extracting
> semantic metadata from content (semantic lifting). It is true that the most
> common way of metadata extraction is the Entity Linking and there is a
> reason for that: stanbol was born as a tool for Content Management Systems
> where companies are supposed to manage domain vocabularies that could be
> used to enrich the enterprise content. Anyway, the enhancer has been
> modularized around extracting engines, so you can perfectly implement an
> engine for your use case and take advantage of the Stanbol APIs to express
> your extracted metadata as RDF.
>
>
>  I mean i still need to read a bit, but this is what i get for now, from
> your explanation and my readings.
>
>  Am I close ?
>
> I think so :-). Cheers
>
> Rafa
>
>
>  Best,
>  -M-
>  --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 28 May 2014 at 13:46:00, Rafa Haro (rharo@apache.org) wrote:
>
>  Hi Maatari,
>
> El 27/05/14 21:05, Maatari Daniel Okouya escribió:
> > Hi ,
> >
> > Completing my previous question, I think it would be better for me to
> give the bigger picture of what i’m trying to achieve.
> >
> >
> > I have been charge with helping in disseminating the publications
> content of my organisation. Most of them are in PDF.
> >
> > Therefore, I need a process to produce a meaningful RDF description of
> our content that links as much as possible to the LOD cloud and LOV (liked
> open vocab). Hence i need to use common core vocabularies as much as i can
> i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from
> DBpedia for instance.
> >
> > Searching around the web how to automatically generate these
> descriptions which would include creator, publisher, primaryTopic, subject,
> thematic etc…. It seems to me that Apache StanBol was the best match.
> With Stanbol you can enrich your content with your own vocabularies or
> dataset from the LOD cloud as long as you import them before as a site.
> Let's say that "out of the box" enrichment process consist on linking
> pieces of texts (like entities/concepts' names/labels) with entities
> within your datasets.
> >
>
>

Re: PDF Description Extraction For Linked data

Posted by Antonio David Perez Morales <ap...@zaizi.com>.
Hi Maatari

In Stanbol you can import your custom dataset/vocabulary. The component
responsible of that is called Entityhub in Stanbol and there are different
implementations to be used as entity backend system (one of them is Apache
Marmotta, a LOD platform and store).

Once you have imported your data, you can use the data in the semantic
lifting process performed by the Enhancer. So, as Rafa described in a
previous message, you can configure a serie of engines to be used in the
semantic lifting. One of the available engines is the EntityLinking engine,
where you can configure the Entityhub to be used to link the mentions
(extracted by other engine) to entities.

Hope it helps.

Regards


On Thu, May 29, 2014 at 7:40 PM, Maatari Daniel Okouya <ok...@yahoo.fr>
wrote:

> Many Rafa,
>
> one last question. In some case i will already have the metadata available
> in other format that will need to be translated in RDF.
>
> Let’s assume i have it done. What i will get is basically an set of
> instance resource describe with the vocabularies of my choices.
>
> That’s where the data lifting process you are talking about comes in play.
> To be linked to the LOD, i would need to link my description to other
> dataset available on the LOD.
>
> Is there a way/pipeline that start from an RDF description and links it to
> the LOD, that is available with Sanbol. To be honest i already saw spotted
> things like, datalift and silk, but i was just wondering if something like
> was available with sanbol.
>
>
> Many thanks,
>
> -M-
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 29 May 2014 at 09:24:54, Rafa Haro (rharo@apache.org) wrote:
>
> Hi Maatari,
>
> El 29/05/14 02:27, Maatari Daniel Okouya escribió:
> Rafa,
>
> Many thanks for your elaborated answer.
>
> It seems to me that from your elaborated answer i did not completely graps
> the concepts behind StanBol. Its primary purpose is semantically annotating
> the content of a file for the purpose of semantic search. Although one
> could divert by reusing the enhancing infrastructure to get the description
> generated and apply some Sparql rule to get the description in a format
> desire. It is not geared toward linked data out of the box. What i mean
> generating a description that you could publish as is, which is what i was
> looking for. As you say, the best match here is the description returned by
> the Topic annotation engine and maybe few things extracted by Tika.
> Well, the primary purpose or use case wouldn't have to be necessarily
> Semantic Search. I would say that Stanbol helps in the task of extracting
> semantic metadata from content (semantic lifting). It is true that the most
> common way of metadata extraction is the Entity Linking and there is a
> reason for that: stanbol was born as a tool for Content Management Systems
> where companies are supposed to manage domain vocabularies that could be
> used to enrich the enterprise content. Anyway, the enhancer has been
> modularized around extracting engines, so you can perfectly implement an
> engine for your use case and take advantage of the Stanbol APIs to express
> your extracted metadata as RDF.
>
> I mean i still need to read a bit, but this is what i get for now, from
> your explanation and my readings.
>
> Am I close ?
> I think so :-). Cheers
>
> Rafa
>
> Best,
> -M-
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 28 May 2014 at 13:46:00, Rafa Haro (rharo@apache.org) wrote:
>
> Hi Maatari,
>
> El 27/05/14 21:05, Maatari Daniel Okouya escribió:
> > Hi ,
> >
> > Completing my previous question, I think it would be better for me to
> give the bigger picture of what i’m trying to achieve.
> >
> >
> > I have been charge with helping in disseminating the publications
> content of my organisation. Most of them are in PDF.
> >
> > Therefore, I need a process to produce a meaningful RDF description of
> our content that links as much as possible to the LOD cloud and LOV (liked
> open vocab). Hence i need to use common core vocabularies as much as i can
> i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from
> DBpedia for instance.
> >
> > Searching around the web how to automatically generate these
> descriptions which would include creator, publisher, primaryTopic, subject,
> thematic etc…. It seems to me that Apache StanBol was the best match.
> With Stanbol you can enrich your content with your own vocabularies or
> dataset from the LOD cloud as long as you import them before as a site.
> Let's say that "out of the box" enrichment process consist on linking
> pieces of texts (like entities/concepts' names/labels) with entities
> within your datasets.
> >
> > So that’s it, in the first place I would like to automatically generate
> some rich description about my Pdf publication. not rich tho. We are not
> yet planing on providing semantic search. It will probably come in the
> future.
> I would say that what you need is not related to Entity Linking for now.
> The closer resource that you can use in Stanbol for categorizing your
> content in that way is the Topic Annotation Engine which is able to
> classify your content according to a pre-trained model using a certain
> set of categories. Those categories should correspond to concepts from a
> Stanbol site. Please, note that things like primaryTopic. subject,
> thematic... are usually not possible to be extracted without training a
> model first with already annotated content. There are, of course,
> unsupervised alternatives like Latent Semantic Analysis or Latent
> Dirilecht Allocation that can be used to extract main terms as topics
> for your content, but currently there is not support for those in Stanbol.
> >
> > however for now, i’m interested in providing some biblio graphic data,
> and state the main topics of the publication i.e. what does it talk about
> generally speaking
> If the PDFs have correct metadata, you can use Tika for extracting.
> Probably some one in the list can correct me but, as far as I know
> current Tika engine in Stanbol is used to extract the content for later
> enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%
> sure about this but, anyway, to implement it shouldn't be complex.
> >
> > I will then deploy those description in sparql endpoint, use a frontend
> like pubby, and do some content negotiation to redirect toward my pdf when
> requested. This means also that my description need to have some specific
> url that i provide them with.
> In the 0.12 branch of Stanbol, there is a component called ContentHub
> which is able to automatically store the content metadata as RDF along
> with the enhancements providing also an SPARQL endpoint. If you are
> planning to store huge volumes of data, probably then the best idea is
> to take the RDF response of the enhancer and store it in your own triple
> store.
> >
> >
> > Can any one give me some pointers? Is it possible to do that with
> StanBol, if yes how should i go for it ? How to configure the enhancer for
> that ?
> >
> >
> > Many thanks,
> >
> > -M-
> >
> >
> > --
> > Maatari Daniel Okouya
> > Sent with Airmail
>
>
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: PDF Description Extraction For Linked data

Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Many Rafa, 

one last question. In some case i will already have the metadata available in other format that will need to be translated in RDF. 

Let’s assume i have it done. What i will get is basically an set of instance resource describe with the vocabularies of my choices. 

That’s where the data lifting process you are talking about comes in play. To be linked to the LOD, i would need to link my description to other dataset available on the LOD. 

Is there a way/pipeline that start from an RDF description and links it to the LOD, that is available with Sanbol. To be honest i already saw spotted things like, datalift and silk, but i was just wondering if something like was available with sanbol. 


Many thanks, 

-M-
-- 
Maatari Daniel Okouya
Sent with Airmail

On 29 May 2014 at 09:24:54, Rafa Haro (rharo@apache.org) wrote:

Hi Maatari,

El 29/05/14 02:27, Maatari Daniel Okouya escribió:
Rafa, 

Many thanks for your elaborated answer. 

It seems to me that from your elaborated answer i did not completely graps the concepts behind StanBol. Its primary purpose is semantically annotating the content of a file for the purpose of semantic search. Although one could divert by reusing the enhancing infrastructure to get the description generated and apply some Sparql rule to get the description in a format desire. It is not geared toward linked data out of the box. What i mean generating a description that you could publish as is, which is what i was looking for. As you say, the best match here is the description returned by the Topic annotation engine and maybe few things extracted by Tika.
Well, the primary purpose or use case wouldn't have to be necessarily Semantic Search. I would say that Stanbol helps in the task of extracting semantic metadata from content (semantic lifting). It is true that the most common way of metadata extraction is the Entity Linking and there is a reason for that: stanbol was born as a tool for Content Management Systems where companies are supposed to manage domain vocabularies that could be used to enrich the enterprise content. Anyway, the enhancer has been modularized around extracting engines, so you can perfectly implement an engine for your use case and take advantage of the Stanbol APIs to express your extracted metadata as RDF.

I mean i still need to read a bit, but this is what i get for now, from your explanation and my readings. 

Am I close ?
I think so :-). Cheers

Rafa

Best, 
-M-
-- 
Maatari Daniel Okouya
Sent with Airmail

On 28 May 2014 at 13:46:00, Rafa Haro (rharo@apache.org) wrote:

Hi Maatari,

El 27/05/14 21:05, Maatari Daniel Okouya escribió:
> Hi ,
>
> Completing my previous question, I think it would be better for me to give the bigger picture of what i’m trying to achieve.
>
>
> I have been charge with helping in disseminating the publications content of my organisation. Most of them are in PDF.
>
> Therefore, I need a process to produce a meaningful RDF description of our content that links as much as possible to the LOD cloud and LOV (liked open vocab). Hence i need to use common core vocabularies as much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for instance.
>
> Searching around the web how to automatically generate these descriptions which would include creator, publisher, primaryTopic, subject, thematic etc…. It seems to me that Apache StanBol was the best match.
With Stanbol you can enrich your content with your own vocabularies or
dataset from the LOD cloud as long as you import them before as a site.
Let's say that "out of the box" enrichment process consist on linking
pieces of texts (like entities/concepts' names/labels) with entities
within your datasets.
>
> So that’s it, in the first place I would like to automatically generate some rich description about my Pdf publication. not rich tho. We are not yet planing on providing semantic search. It will probably come in the future.
I would say that what you need is not related to Entity Linking for now.
The closer resource that you can use in Stanbol for categorizing your
content in that way is the Topic Annotation Engine which is able to
classify your content according to a pre-trained model using a certain
set of categories. Those categories should correspond to concepts from a
Stanbol site. Please, note that things like primaryTopic. subject,
thematic... are usually not possible to be extracted without training a
model first with already annotated content. There are, of course,
unsupervised alternatives like Latent Semantic Analysis or Latent
Dirilecht Allocation that can be used to extract main terms as topics
for your content, but currently there is not support for those in Stanbol.
>
> however for now, i’m interested in providing some biblio graphic data, and state the main topics of the publication i.e. what does it talk about generally speaking
If the PDFs have correct metadata, you can use Tika for extracting.
Probably some one in the list can correct me but, as far as I know
current Tika engine in Stanbol is used to extract the content for later
enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%
sure about this but, anyway, to implement it shouldn't be complex.
>
> I will then deploy those description in sparql endpoint, use a frontend like pubby, and do some content negotiation to redirect toward my pdf when requested. This means also that my description need to have some specific url that i provide them with.
In the 0.12 branch of Stanbol, there is a component called ContentHub
which is able to automatically store the content metadata as RDF along
with the enhancements providing also an SPARQL endpoint. If you are
planning to store huge volumes of data, probably then the best idea is
to take the RDF response of the enhancer and store it in your own triple
store.
>
>
> Can any one give me some pointers? Is it possible to do that with StanBol, if yes how should i go for it ? How to configure the enhancer for that ?
>
>
> Many thanks,
>
> -M-
>
>
> --
> Maatari Daniel Okouya
> Sent with Airmail



Re: PDF Description Extraction For Linked data

Posted by Rafa Haro <rh...@apache.org>.
Hi Maatari,

El 29/05/14 02:27, Maatari Daniel Okouya escribió:
> Rafa,
>
> Many thanks for your elaborated answer.
>
> It seems to me that from your elaborated answer i did not completely 
> graps the concepts behind StanBol. Its primary purpose is semantically 
> annotating the content of a file for the purpose of semantic search. 
> Although one could divert by reusing the enhancing infrastructure to 
> get the description generated and apply some Sparql rule to get the 
> description in a format desire. It is not geared toward linked data 
> out of the box. What i mean generating a description that you could 
> publish as is, which is what i was looking for. As you say, the best 
> match here is the description returned by the Topic annotation engine 
> and maybe few things extracted by Tika.
Well, the primary purpose or use case wouldn't have to be necessarily 
Semantic Search. I would say that Stanbol helps in the task of 
extracting semantic metadata from content (semantic lifting). It is true 
that the most common way of metadata extraction is the Entity Linking 
and there is a reason for that: stanbol was born as a tool for Content 
Management Systems where companies are supposed to manage domain 
vocabularies that could be used to enrich the enterprise content. 
Anyway, the enhancer has been modularized around extracting engines, so 
you can perfectly implement an engine for your use case and take 
advantage of the Stanbol APIs to express your extracted metadata as RDF.
>
> I mean i still need to read a bit, but this is what i get for now, 
> from your explanation and my readings.
>
> Am I close ?
I think so :-). Cheers

Rafa
>
> Best,
> -M-
> -- 
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 28 May 2014 at 13:46:00, Rafa Haro (rharo@apache.org 
> <ma...@apache.org>) wrote:
>
>> Hi Maatari,
>>
>> El 27/05/14 21:05, Maatari Daniel Okouya escribió:
>> > Hi ,
>> >
>> > Completing my previous question, I think it would be better for me 
>> to give the bigger picture of what i’m trying to achieve.
>> >
>> >
>> > I have been charge with helping in disseminating the publications 
>> content of my organisation. Most of them are in PDF.
>> >
>> > Therefore, I need a process to produce a meaningful RDF description 
>> of our content that links as much as possible to the LOD cloud and 
>> LOV (liked open vocab). Hence i need to use common core vocabularies 
>> as much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… and 
>> reference entity from DBpedia for instance.
>> >
>> > Searching around the web how to automatically generate these 
>> descriptions which would include creator, publisher, primaryTopic, 
>> subject, thematic etc…. It seems to me that Apache StanBol was the 
>> best match.
>> With Stanbol you can enrich your content with your own vocabularies or
>> dataset from the LOD cloud as long as you import them before as a site.
>> Let's say that "out of the box" enrichment process consist on linking
>> pieces of texts (like entities/concepts' names/labels) with entities
>> within your datasets.
>> >
>> > So that’s it, in the first place I would like to automatically 
>> generate some rich description about my Pdf publication. not rich 
>> tho. We are not yet planing on providing semantic search. It will 
>> probably come in the future.
>> I would say that what you need is not related to Entity Linking for now.
>> The closer resource that you can use in Stanbol for categorizing your
>> content in that way is the Topic Annotation Engine which is able to
>> classify your content according to a pre-trained model using a certain
>> set of categories. Those categories should correspond to concepts from a
>> Stanbol site. Please, note that things like primaryTopic. subject,
>> thematic... are usually not possible to be extracted without training a
>> model first with already annotated content. There are, of course,
>> unsupervised alternatives like Latent Semantic Analysis or Latent
>> Dirilecht Allocation that can be used to extract main terms as topics
>> for your content, but currently there is not support for those in 
>> Stanbol.
>> >
>> > however for now, i’m interested in providing some biblio graphic 
>> data, and state the main topics of the publication i.e. what does it 
>> talk about generally speaking
>> If the PDFs have correct metadata, you can use Tika for extracting.
>> Probably some one in the list can correct me but, as far as I know
>> current Tika engine in Stanbol is used to extract the content for later
>> enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%
>> sure about this but, anyway, to implement it shouldn't be complex.
>> >
>> > I will then deploy those description in sparql endpoint, use a 
>> frontend like pubby, and do some content negotiation to redirect 
>> toward my pdf when requested. This means also that my description 
>> need to have some specific url that i provide them with.
>> In the 0.12 branch of Stanbol, there is a component called ContentHub
>> which is able to automatically store the content metadata as RDF along
>> with the enhancements providing also an SPARQL endpoint. If you are
>> planning to store huge volumes of data, probably then the best idea is
>> to take the RDF response of the enhancer and store it in your own triple
>> store.
>> >
>> >
>> > Can any one give me some pointers? Is it possible to do that with 
>> StanBol, if yes how should i go for it ? How to configure the 
>> enhancer for that ?
>> >
>> >
>> > Many thanks,
>> >
>> > -M-
>> >
>> >
>> > --
>> > Maatari Daniel Okouya
>> > Sent with Airmail
>>


Re: PDF Description Extraction For Linked data

Posted by Maatari Daniel Okouya <ok...@yahoo.fr>.
Rafa, 

Many thanks for your elaborated answer. 

It seems to me that from your elaborated answer i did not completely graps the concepts behind StanBol. Its primary purpose is semantically annotating the content of a file for the purpose of semantic search. Although one could divert by reusing the enhancing infrastructure to get the description generated and apply some Sparql rule to get the description in a format desire. It is not geared toward linked data out of the box. What i mean generating a description that you could publish as is, which is what i was looking for. As you say, the best match here is the description returned by the Topic annotation engine and maybe few things extracted by Tika. 

I mean i still need to read a bit, but this is what i get for now, from your explanation and my readings. 

Am I close ?

Best, 
-M-
-- 
Maatari Daniel Okouya
Sent with Airmail

On 28 May 2014 at 13:46:00, Rafa Haro (rharo@apache.org) wrote:

Hi Maatari,  

El 27/05/14 21:05, Maatari Daniel Okouya escribió:  
> Hi ,  
>  
> Completing my previous question, I think it would be better for me to give the bigger picture of what i’m trying to achieve.  
>  
>  
> I have been charge with helping in disseminating the publications content of my organisation. Most of them are in PDF.  
>  
> Therefore, I need a process to produce a meaningful RDF description of our content that links as much as possible to the LOD cloud and LOV (liked open vocab). Hence i need to use common core vocabularies as much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for instance.  
>  
> Searching around the web how to automatically generate these descriptions which would include creator, publisher, primaryTopic, subject, thematic etc…. It seems to me that Apache StanBol was the best match.  
With Stanbol you can enrich your content with your own vocabularies or  
dataset from the LOD cloud as long as you import them before as a site.  
Let's say that "out of the box" enrichment process consist on linking  
pieces of texts (like entities/concepts' names/labels) with entities  
within your datasets.  
>  
> So that’s it, in the first place I would like to automatically generate some rich description about my Pdf publication. not rich tho. We are not yet planing on providing semantic search. It will probably come in the future.  
I would say that what you need is not related to Entity Linking for now.  
The closer resource that you can use in Stanbol for categorizing your  
content in that way is the Topic Annotation Engine which is able to  
classify your content according to a pre-trained model using a certain  
set of categories. Those categories should correspond to concepts from a  
Stanbol site. Please, note that things like primaryTopic. subject,  
thematic... are usually not possible to be extracted without training a  
model first with already annotated content. There are, of course,  
unsupervised alternatives like Latent Semantic Analysis or Latent  
Dirilecht Allocation that can be used to extract main terms as topics  
for your content, but currently there is not support for those in Stanbol.  
>  
> however for now, i’m interested in providing some biblio graphic data, and state the main topics of the publication i.e. what does it talk about generally speaking  
If the PDFs have correct metadata, you can use Tika for extracting.  
Probably some one in the list can correct me but, as far as I know  
current Tika engine in Stanbol is used to extract the content for later  
enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%  
sure about this but, anyway, to implement it shouldn't be complex.  
>  
> I will then deploy those description in sparql endpoint, use a frontend like pubby, and do some content negotiation to redirect toward my pdf when requested. This means also that my description need to have some specific url that i provide them with.  
In the 0.12 branch of Stanbol, there is a component called ContentHub  
which is able to automatically store the content metadata as RDF along  
with the enhancements providing also an SPARQL endpoint. If you are  
planning to store huge volumes of data, probably then the best idea is  
to take the RDF response of the enhancer and store it in your own triple  
store.  
>  
>  
> Can any one give me some pointers? Is it possible to do that with StanBol, if yes how should i go for it ? How to configure the enhancer for that ?  
>  
>  
> Many thanks,  
>  
> -M-  
>  
>  
> --  
> Maatari Daniel Okouya  
> Sent with Airmail  


Re: PDF Description Extraction For Linked data

Posted by Rafa Haro <rh...@apache.org>.
Hi Maatari,

El 27/05/14 21:05, Maatari Daniel Okouya escribió:
> Hi ,
>
> Completing my previous question, I think it would be better for me to give the bigger picture of what i’m trying to achieve.
>
>
> I have been charge with helping in disseminating the publications content of my organisation. Most of them are in PDF.
>
> Therefore, I need a process to produce a meaningful RDF description of our content that links as much as possible to the LOD cloud and LOV (liked open vocab).  Hence i need to use common core vocabularies as much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for instance.
>
> Searching around the web how to automatically generate these descriptions which would include creator, publisher, primaryTopic, subject, thematic etc…. It seems to me that Apache StanBol was the best match.
With Stanbol you can enrich your content with your own vocabularies or 
dataset from the LOD cloud as long as you import them before as a site. 
Let's say that "out of the box" enrichment process consist on linking 
pieces of texts (like entities/concepts' names/labels) with entities 
within your datasets.
>
> So that’s it, in the first place I would like to automatically generate some rich description about my Pdf publication. not rich tho. We are not yet planing on providing semantic search. It will probably come in the future.
I would say that what you need is not related to Entity Linking for now. 
The closer resource that you can use in Stanbol for categorizing your 
content in that way is the Topic Annotation Engine which is able to 
classify your content according to a pre-trained model using a certain 
set of categories. Those categories should correspond to concepts from a 
Stanbol site. Please, note that things like primaryTopic. subject, 
thematic... are usually not possible to be extracted without training a 
model first with already annotated content. There are, of course, 
unsupervised alternatives like Latent Semantic Analysis or Latent 
Dirilecht Allocation that can be used to extract main terms as topics 
for your content, but currently there is not support for those in Stanbol.
>
> however for now, i’m interested in providing some biblio graphic data, and state the main topics of the publication i.e. what does it talk about generally speaking
If the PDFs have correct metadata, you can use Tika for extracting. 
Probably some one in the list can correct me but, as far as I know 
current Tika engine in Stanbol is used to extract the content for later 
enrich it, but it is not mapping extracted metadata to RDF. I'm not 100% 
sure about this but, anyway, to implement it shouldn't be complex.
>
> I will then deploy those description in sparql endpoint, use a frontend like pubby, and do some content negotiation to redirect toward my pdf when requested. This means also that my description need to have some specific url that i provide them with.
In the 0.12 branch of Stanbol, there is a component called ContentHub 
which is able to automatically store the content metadata as RDF along 
with the enhancements providing also an SPARQL endpoint. If you are 
planning to store huge volumes of data, probably then the best idea is 
to take the RDF response of the enhancer and store it in your own triple 
store.
>
>
> Can any one give me some pointers? Is it possible to do that with StanBol, if yes how should i go for it ? How to configure the enhancer for that ?
>
>
> Many thanks,
>
> -M-
>
>
> -- 
> Maatari Daniel Okouya
> Sent with Airmail