You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rüdiger Kurz <r....@alkacon.com> on 2013/03/16 16:22:18 UTC

How to extract entities from documents using Microdata?

Hello Stanbolers,

I want to extract and then store entities from HTML documents that are 
using Microdata annotations based on the type hierarchy of schema.org as 
Ontology. I appreciate any kind of approach including the use of VIE.

Many thanks in advance
Rüdiger

-- 
Rüdiger Kurz

-------------------

Alkacon Software GmbH - The OpenCms Experts
An der Wachsfabrik 13
50996 Koeln, DE

http://www.alkacon.com
http://www.opencms.org

Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613

Re: How to extract entities from documents using Microdata?

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Walter

On Sat, Mar 16, 2013 at 5:20 PM, Walter Kasper <wk...@apache.org> wrote:
> Hi Rüdiger,
>
> The engine just adds the extracted microdata annotations as RDF to the items
> metadata linked to the content item by the
> "http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains"
> property. So it should be straightforward to build an index as you described
> by extracting them from the metadata.
>

I think the htmlextractor engine should also create
"fise:EntityAnnotations" for extracted Entities. So that

    {entityAnno} rdf:type fise:EntityAnnotation
    {entityAnno} fise:entity-reference {entity-uri}
    {entityAnno} fise:entity-label {entity-label}
    {entityAnno} fise:entity-type {entity-type}
    {entityAnno} fise:extracted-from {content-item-uti}

If your Engine can also determine the section in the text mentioning
the entity, than it would good if you could also create an
fise:TextAnnotation for it.

This might be important:

* because other Engines in the Enhancement Chain could process
explicitly mentioned Entities similar to extracted one
* the Contenthub will need those information to correctly set the
Context for executing configured LDPaths.

best
Rupert

> Best regards,
>
> Walter Kasper
>
>
> Rüdiger Kurz wrote:
>>
>> Hi Walter,
>>
>> thanks for the quick reply. Are the extracted entities from the
>> htmlextractor enhancement engine automatically stored into the entity hub?
>>
>> What I want to reach is to get an index that stores the extracted entities
>> and also the document itself with references on the entities related to this
>> document. It would be great if that could be done by configuration only.
>> Maybe someone could lend me a hand with building the right enhancement chain
>> as a first step.
>>
>> In my mind is building up a Solr Search UI offering entity based
>> autosuggestion including spellchecker and faceted search.
>>
>> Thanks again.
>>
>> Am 16.03.2013 16:29, schrieb Walter Kasper:
>>>
>>> Dear Rüdiger,
>>>
>>> The htmlextractor enhancement engine provides a microdata extractor that
>>> should work well for schema.org annotations. Just test it with your data.
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> Rüdiger Kurz wrote:
>>>>
>>>> Hello Stanbolers,
>>>>
>>>> I want to extract and then store entities from HTML documents that are
>>>> using Microdata annotations based on the type hierarchy of schema.org
>>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>>
>>>> Many thanks in advance
>>>> Rüdiger
>>
>>
>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: How to extract entities from documents using Microdata?

Posted by Walter Kasper <wk...@apache.org>.
Hi Rüdiger,

The engine just adds the extracted microdata annotations as RDF to the 
items metadata linked to the content item by the 
"http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains" 
property. So it should be straightforward to build an index as you 
described by extracting them from the metadata.

Best regards,

Walter Kasper

Rüdiger Kurz wrote:
> Hi Walter,
>
> thanks for the quick reply. Are the extracted entities from the 
> htmlextractor enhancement engine automatically stored into the entity 
> hub?
>
> What I want to reach is to get an index that stores the extracted 
> entities and also the document itself with references on the entities 
> related to this document. It would be great if that could be done by 
> configuration only. Maybe someone could lend me a hand with building 
> the right enhancement chain as a first step.
>
> In my mind is building up a Solr Search UI offering entity based 
> autosuggestion including spellchecker and faceted search.
>
> Thanks again.
>
> Am 16.03.2013 16:29, schrieb Walter Kasper:
>> Dear Rüdiger,
>>
>> The htmlextractor enhancement engine provides a microdata extractor that
>> should work well for schema.org annotations. Just test it with your 
>> data.
>>
>> Best regards,
>>
>> Walter
>>
>> Rüdiger Kurz wrote:
>>> Hello Stanbolers,
>>>
>>> I want to extract and then store entities from HTML documents that are
>>> using Microdata annotations based on the type hierarchy of schema.org
>>> as Ontology. I appreciate any kind of approach including the use of 
>>> VIE.
>>>
>>> Many thanks in advance
>>> Rüdiger
>



-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: How to extract entities from documents using Microdata?

Posted by Rüdiger Kurz <r....@alkacon.com>.
Hi Rupert,

also many thanks to you ...

overall I think I first have to go and start experimenting with the 
Stanbol web-interface, anyway I wrote some comments and questions ...

Am 17.03.2013 11:12, schrieb Rupert Westenthaler:
> Hi Rüdiger
>
> On Sat, Mar 16, 2013 at 4:56 PM, Rüdiger Kurz <r....@alkacon.com> wrote:
>> Hi Walter,
>>
>> thanks for the quick reply. Are the extracted entities from the
>> htmlextractor enhancement engine automatically stored into the entity hub?
>>
>
> There is no such component that stores Entities present in the
> ContentItem to the Entityhub, as this is a very uncommon use case.
> Typically entities extracted from parsed content are considered as
> suggestions.  So one would consider an additional user interaction
> (e.g. accepting & storing an Entity) before adding them to the
> Entityhub.
>
> However if this is your use case it should be simple to add such an
> Enhancement engine.
Since my approach is starting from already existent annotated HTML, I 
thought the idea makes sense. Maybe I'm wrong because I'm not that deep 
into Stanbol. Please let me know if my idea makes any kind of sense.
>
>> What I want to reach is to get an index that stores the extracted entities
>> and also the document itself with references on the entities related to this
>> document. It would be great if that could be done by configuration only.
>
> If the htmlextractor engines adds extracted Entities to the metadata
> of the ContentItem you can access them with the LDPath configuration
> of the Contenthub.
I don't have any experiences using LDPath but it sounds like it would be 
easy doing what you wrote and it would be valuable to spend some time on 
experimenting on it. Is there a good starting point working with LDPath 
together with Stanbol?
>
>> Maybe someone could lend me a hand with building the right enhancement chain
>> as a first step.
>>
>
> If you just want the Enhancer to extract Microdata you can create an
> chain that only contains the htmlextractor engine
Sounds straight forward to me!
>
>> In my mind is building up a Solr Search UI offering entity based
>> autosuggestion including spellchecker and faceted search.
>
> For Entity based autosuggestion you might want to use the Entityhub.
> The rest should be possible by using the Contenthub.
As I said before I urgently have to start some experiments ...
>
>>
>> Thanks again.
>>
>> Am 16.03.2013 16:29, schrieb Walter Kasper:
>>
>>> Dear Rüdiger,
>>>
>>> The htmlextractor enhancement engine provides a microdata extractor that
>>> should work well for schema.org annotations. Just test it with your data.
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> Rüdiger Kurz wrote:
>>>>
>>>> Hello Stanbolers,
>>>>
>>>> I want to extract and then store entities from HTML documents that are
>>>> using Microdata annotations based on the type hierarchy of schema.org
>>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>>
>>>> Many thanks in advance
>>>> Rüdiger

-- 
Rüdiger Kurz

-------------------

Alkacon Software GmbH - The OpenCms Experts
Rüdiger Kurz
An der Wachsfabrik 13
50996 Koeln, DE

Tel: +49 (0)2236 3826-16
Fax: +49 (0)2236 3826-20
Email: r.kurz@alkacon.com

http://www.alkacon.com
http://www.opencms.org

Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613

Re: How to extract entities from documents using Microdata?

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Rüdiger

On Sat, Mar 16, 2013 at 4:56 PM, Rüdiger Kurz <r....@alkacon.com> wrote:
> Hi Walter,
>
> thanks for the quick reply. Are the extracted entities from the
> htmlextractor enhancement engine automatically stored into the entity hub?
>

There is no such component that stores Entities present in the
ContentItem to the Entityhub, as this is a very uncommon use case.
Typically entities extracted from parsed content are considered as
suggestions.  So one would consider an additional user interaction
(e.g. accepting & storing an Entity) before adding them to the
Entityhub.

However if this is your use case it should be simple to add such an
Enhancement engine.

> What I want to reach is to get an index that stores the extracted entities
> and also the document itself with references on the entities related to this
> document. It would be great if that could be done by configuration only.

If the htmlextractor engines adds extracted Entities to the metadata
of the ContentItem you can access them with the LDPath configuration
of the Contenthub.

> Maybe someone could lend me a hand with building the right enhancement chain
> as a first step.
>

If you just want the Enhancer to extract Microdata you can create an
chain that only contains the htmlextractor engine

> In my mind is building up a Solr Search UI offering entity based
> autosuggestion including spellchecker and faceted search.

For Entity based autosuggestion you might want to use the Entityhub.
The rest should be possible by using the Contenthub.

>
> Thanks again.
>
> Am 16.03.2013 16:29, schrieb Walter Kasper:
>
>> Dear Rüdiger,
>>
>> The htmlextractor enhancement engine provides a microdata extractor that
>> should work well for schema.org annotations. Just test it with your data.
>>
>> Best regards,
>>
>> Walter
>>
>> Rüdiger Kurz wrote:
>>>
>>> Hello Stanbolers,
>>>
>>> I want to extract and then store entities from HTML documents that are
>>> using Microdata annotations based on the type hierarchy of schema.org
>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>
>>> Many thanks in advance
>>> Rüdiger
>
>
> --
> Rüdiger Kurz
>
> -------------------
>
> Alkacon Software GmbH - The OpenCms Experts
> An der Wachsfabrik 13
> 50996 Koeln, DE
>
> http://www.alkacon.com
> http://www.opencms.org
>
> Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: How to extract entities from documents using Microdata?

Posted by Rüdiger Kurz <r....@alkacon.com>.
Hi Walter,

thanks for the quick reply. Are the extracted entities from the 
htmlextractor enhancement engine automatically stored into the entity hub?

What I want to reach is to get an index that stores the extracted 
entities and also the document itself with references on the entities 
related to this document. It would be great if that could be done by 
configuration only. Maybe someone could lend me a hand with building the 
right enhancement chain as a first step.

In my mind is building up a Solr Search UI offering entity based 
autosuggestion including spellchecker and faceted search.

Thanks again.

Am 16.03.2013 16:29, schrieb Walter Kasper:
> Dear Rüdiger,
>
> The htmlextractor enhancement engine provides a microdata extractor that
> should work well for schema.org annotations. Just test it with your data.
>
> Best regards,
>
> Walter
>
> Rüdiger Kurz wrote:
>> Hello Stanbolers,
>>
>> I want to extract and then store entities from HTML documents that are
>> using Microdata annotations based on the type hierarchy of schema.org
>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>
>> Many thanks in advance
>> Rüdiger

-- 
Rüdiger Kurz

-------------------

Alkacon Software GmbH - The OpenCms Experts
An der Wachsfabrik 13
50996 Koeln, DE

http://www.alkacon.com
http://www.opencms.org

Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613

Re: How to extract entities from documents using Microdata?

Posted by Walter Kasper <ka...@dfki.de>.
Dear Rüdiger,

The htmlextractor enhancement engine provides a microdata extractor that 
should work well for schema.org annotations. Just test it with your data.

Best regards,

Walter

Rüdiger Kurz wrote:
> Hello Stanbolers,
>
> I want to extract and then store entities from HTML documents that are 
> using Microdata annotations based on the type hierarchy of schema.org 
> as Ontology. I appreciate any kind of approach including the use of VIE.
>
> Many thanks in advance
> Rüdiger
>