You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rüdiger Kurz <r....@alkacon.com> on 2013/03/16 16:22:18 UTC
How to extract entities from documents using Microdata?
Hello Stanbolers,
I want to extract and then store entities from HTML documents that are
using Microdata annotations based on the type hierarchy of schema.org as
Ontology. I appreciate any kind of approach including the use of VIE.
Many thanks in advance
Rüdiger
--
Rüdiger Kurz
-------------------
Alkacon Software GmbH - The OpenCms Experts
An der Wachsfabrik 13
50996 Koeln, DE
http://www.alkacon.com
http://www.opencms.org
Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613
Re: How to extract entities from documents using Microdata?
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Walter
On Sat, Mar 16, 2013 at 5:20 PM, Walter Kasper <wk...@apache.org> wrote:
> Hi Rüdiger,
>
> The engine just adds the extracted microdata annotations as RDF to the items
> metadata linked to the content item by the
> "http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains"
> property. So it should be straightforward to build an index as you described
> by extracting them from the metadata.
>
I think the htmlextractor engine should also create
"fise:EntityAnnotations" for extracted Entities. So that
{entityAnno} rdf:type fise:EntityAnnotation
{entityAnno} fise:entity-reference {entity-uri}
{entityAnno} fise:entity-label {entity-label}
{entityAnno} fise:entity-type {entity-type}
{entityAnno} fise:extracted-from {content-item-uti}
If your Engine can also determine the section in the text mentioning
the entity, than it would good if you could also create an
fise:TextAnnotation for it.
This might be important:
* because other Engines in the Enhancement Chain could process
explicitly mentioned Entities similar to extracted one
* the Contenthub will need those information to correctly set the
Context for executing configured LDPaths.
best
Rupert
> Best regards,
>
> Walter Kasper
>
>
> Rüdiger Kurz wrote:
>>
>> Hi Walter,
>>
>> thanks for the quick reply. Are the extracted entities from the
>> htmlextractor enhancement engine automatically stored into the entity hub?
>>
>> What I want to reach is to get an index that stores the extracted entities
>> and also the document itself with references on the entities related to this
>> document. It would be great if that could be done by configuration only.
>> Maybe someone could lend me a hand with building the right enhancement chain
>> as a first step.
>>
>> In my mind is building up a Solr Search UI offering entity based
>> autosuggestion including spellchecker and faceted search.
>>
>> Thanks again.
>>
>> Am 16.03.2013 16:29, schrieb Walter Kasper:
>>>
>>> Dear Rüdiger,
>>>
>>> The htmlextractor enhancement engine provides a microdata extractor that
>>> should work well for schema.org annotations. Just test it with your data.
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> Rüdiger Kurz wrote:
>>>>
>>>> Hello Stanbolers,
>>>>
>>>> I want to extract and then store entities from HTML documents that are
>>>> using Microdata annotations based on the type hierarchy of schema.org
>>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>>
>>>> Many thanks in advance
>>>> Rüdiger
>>
>>
>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.: +49-681-85775-5300
> Fax: +49-681-85775-5338
> Email: kasper@dfki.de
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
Re: How to extract entities from documents using Microdata?
Posted by Walter Kasper <wk...@apache.org>.
Hi Rüdiger,
The engine just adds the extracted microdata annotations as RDF to the
items metadata linked to the content item by the
"http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains"
property. So it should be straightforward to build an index as you
described by extracting them from the metadata.
Best regards,
Walter Kasper
Rüdiger Kurz wrote:
> Hi Walter,
>
> thanks for the quick reply. Are the extracted entities from the
> htmlextractor enhancement engine automatically stored into the entity
> hub?
>
> What I want to reach is to get an index that stores the extracted
> entities and also the document itself with references on the entities
> related to this document. It would be great if that could be done by
> configuration only. Maybe someone could lend me a hand with building
> the right enhancement chain as a first step.
>
> In my mind is building up a Solr Search UI offering entity based
> autosuggestion including spellchecker and faceted search.
>
> Thanks again.
>
> Am 16.03.2013 16:29, schrieb Walter Kasper:
>> Dear Rüdiger,
>>
>> The htmlextractor enhancement engine provides a microdata extractor that
>> should work well for schema.org annotations. Just test it with your
>> data.
>>
>> Best regards,
>>
>> Walter
>>
>> Rüdiger Kurz wrote:
>>> Hello Stanbolers,
>>>
>>> I want to extract and then store entities from HTML documents that are
>>> using Microdata annotations based on the type hierarchy of schema.org
>>> as Ontology. I appreciate any kind of approach including the use of
>>> VIE.
>>>
>>> Many thanks in advance
>>> Rüdiger
>
--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.: +49-681-85775-5300
Fax: +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------
Re: How to extract entities from documents using Microdata?
Posted by Rüdiger Kurz <r....@alkacon.com>.
Hi Rupert,
also many thanks to you ...
overall I think I first have to go and start experimenting with the
Stanbol web-interface, anyway I wrote some comments and questions ...
Am 17.03.2013 11:12, schrieb Rupert Westenthaler:
> Hi Rüdiger
>
> On Sat, Mar 16, 2013 at 4:56 PM, Rüdiger Kurz <r....@alkacon.com> wrote:
>> Hi Walter,
>>
>> thanks for the quick reply. Are the extracted entities from the
>> htmlextractor enhancement engine automatically stored into the entity hub?
>>
>
> There is no such component that stores Entities present in the
> ContentItem to the Entityhub, as this is a very uncommon use case.
> Typically entities extracted from parsed content are considered as
> suggestions. So one would consider an additional user interaction
> (e.g. accepting & storing an Entity) before adding them to the
> Entityhub.
>
> However if this is your use case it should be simple to add such an
> Enhancement engine.
Since my approach is starting from already existent annotated HTML, I
thought the idea makes sense. Maybe I'm wrong because I'm not that deep
into Stanbol. Please let me know if my idea makes any kind of sense.
>
>> What I want to reach is to get an index that stores the extracted entities
>> and also the document itself with references on the entities related to this
>> document. It would be great if that could be done by configuration only.
>
> If the htmlextractor engines adds extracted Entities to the metadata
> of the ContentItem you can access them with the LDPath configuration
> of the Contenthub.
I don't have any experiences using LDPath but it sounds like it would be
easy doing what you wrote and it would be valuable to spend some time on
experimenting on it. Is there a good starting point working with LDPath
together with Stanbol?
>
>> Maybe someone could lend me a hand with building the right enhancement chain
>> as a first step.
>>
>
> If you just want the Enhancer to extract Microdata you can create an
> chain that only contains the htmlextractor engine
Sounds straight forward to me!
>
>> In my mind is building up a Solr Search UI offering entity based
>> autosuggestion including spellchecker and faceted search.
>
> For Entity based autosuggestion you might want to use the Entityhub.
> The rest should be possible by using the Contenthub.
As I said before I urgently have to start some experiments ...
>
>>
>> Thanks again.
>>
>> Am 16.03.2013 16:29, schrieb Walter Kasper:
>>
>>> Dear Rüdiger,
>>>
>>> The htmlextractor enhancement engine provides a microdata extractor that
>>> should work well for schema.org annotations. Just test it with your data.
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> Rüdiger Kurz wrote:
>>>>
>>>> Hello Stanbolers,
>>>>
>>>> I want to extract and then store entities from HTML documents that are
>>>> using Microdata annotations based on the type hierarchy of schema.org
>>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>>
>>>> Many thanks in advance
>>>> Rüdiger
--
Rüdiger Kurz
-------------------
Alkacon Software GmbH - The OpenCms Experts
Rüdiger Kurz
An der Wachsfabrik 13
50996 Koeln, DE
Tel: +49 (0)2236 3826-16
Fax: +49 (0)2236 3826-20
Email: r.kurz@alkacon.com
http://www.alkacon.com
http://www.opencms.org
Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613
Re: How to extract entities from documents using Microdata?
Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Rüdiger
On Sat, Mar 16, 2013 at 4:56 PM, Rüdiger Kurz <r....@alkacon.com> wrote:
> Hi Walter,
>
> thanks for the quick reply. Are the extracted entities from the
> htmlextractor enhancement engine automatically stored into the entity hub?
>
There is no such component that stores Entities present in the
ContentItem to the Entityhub, as this is a very uncommon use case.
Typically entities extracted from parsed content are considered as
suggestions. So one would consider an additional user interaction
(e.g. accepting & storing an Entity) before adding them to the
Entityhub.
However if this is your use case it should be simple to add such an
Enhancement engine.
> What I want to reach is to get an index that stores the extracted entities
> and also the document itself with references on the entities related to this
> document. It would be great if that could be done by configuration only.
If the htmlextractor engines adds extracted Entities to the metadata
of the ContentItem you can access them with the LDPath configuration
of the Contenthub.
> Maybe someone could lend me a hand with building the right enhancement chain
> as a first step.
>
If you just want the Enhancer to extract Microdata you can create an
chain that only contains the htmlextractor engine
> In my mind is building up a Solr Search UI offering entity based
> autosuggestion including spellchecker and faceted search.
For Entity based autosuggestion you might want to use the Entityhub.
The rest should be possible by using the Contenthub.
>
> Thanks again.
>
> Am 16.03.2013 16:29, schrieb Walter Kasper:
>
>> Dear Rüdiger,
>>
>> The htmlextractor enhancement engine provides a microdata extractor that
>> should work well for schema.org annotations. Just test it with your data.
>>
>> Best regards,
>>
>> Walter
>>
>> Rüdiger Kurz wrote:
>>>
>>> Hello Stanbolers,
>>>
>>> I want to extract and then store entities from HTML documents that are
>>> using Microdata annotations based on the type hierarchy of schema.org
>>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>>
>>> Many thanks in advance
>>> Rüdiger
>
>
> --
> Rüdiger Kurz
>
> -------------------
>
> Alkacon Software GmbH - The OpenCms Experts
> An der Wachsfabrik 13
> 50996 Koeln, DE
>
> http://www.alkacon.com
> http://www.opencms.org
>
> Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen
Re: How to extract entities from documents using Microdata?
Posted by Rüdiger Kurz <r....@alkacon.com>.
Hi Walter,
thanks for the quick reply. Are the extracted entities from the
htmlextractor enhancement engine automatically stored into the entity hub?
What I want to reach is to get an index that stores the extracted
entities and also the document itself with references on the entities
related to this document. It would be great if that could be done by
configuration only. Maybe someone could lend me a hand with building the
right enhancement chain as a first step.
In my mind is building up a Solr Search UI offering entity based
autosuggestion including spellchecker and faceted search.
Thanks again.
Am 16.03.2013 16:29, schrieb Walter Kasper:
> Dear Rüdiger,
>
> The htmlextractor enhancement engine provides a microdata extractor that
> should work well for schema.org annotations. Just test it with your data.
>
> Best regards,
>
> Walter
>
> Rüdiger Kurz wrote:
>> Hello Stanbolers,
>>
>> I want to extract and then store entities from HTML documents that are
>> using Microdata annotations based on the type hierarchy of schema.org
>> as Ontology. I appreciate any kind of approach including the use of VIE.
>>
>> Many thanks in advance
>> Rüdiger
--
Rüdiger Kurz
-------------------
Alkacon Software GmbH - The OpenCms Experts
An der Wachsfabrik 13
50996 Koeln, DE
http://www.alkacon.com
http://www.opencms.org
Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613
Re: How to extract entities from documents using Microdata?
Posted by Walter Kasper <ka...@dfki.de>.
Dear Rüdiger,
The htmlextractor enhancement engine provides a microdata extractor that
should work well for schema.org annotations. Just test it with your data.
Best regards,
Walter
Rüdiger Kurz wrote:
> Hello Stanbolers,
>
> I want to extract and then store entities from HTML documents that are
> using Microdata annotations based on the type hierarchy of schema.org
> as Ontology. I appreciate any kind of approach including the use of VIE.
>
> Many thanks in advance
> Rüdiger
>