You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Rüdiger Kurz <r....@alkacon.com> on 2012/11/08 10:49:27 UTC

User story: Don't want to lose the semantic information I already have inside my CMS

Hi Staboler,

during ApacheCon in Sinsheim I had some interesting conversations with 
Fabian, Rupert and Anil as result I want to summarize one of the 
discussions as an user story telling a typical requirement for us as CMS 
provider.

Talking about traditional Content Management Systems and assuming that 
they don't store semantic informations is not correct. For example CMS 
Systems already deliver RDFa annotated HTML, nearly all systems are 
providing some tagging/categorizing mechanism. Specially OpenCms 
provides a generic approach to define a structured content and therefore 
we have the information that a specific field/item of a content has a 
specified type and a defined label. E.g. A technology event named 
ApacheCon takes place in Sinsheim from 05. Nov until 08. Nov 2012 is the 
information that is already stored in OpenCms. More over OpenCms is able 
to connect that event with all speakers/persons that will make a 
presentation on that event, ...

What we would like to achieve is not only a plain text enhancement more 
over we are interested in telling Stanbol all informations and 
associations we already know. In other words we absolutely don't want to 
lose the semantic information that is already existent in OpenCms.

A good starting point would be a REST endpoint providing the ability to 
retrieve a RDFa annotated HTML document and than extracts the RDFa in 
order to store those inside the semantic-index/entity-hub/... as I 
previously suggested on the list under the subject "Extend stanbol 
content hub for RDFa support". Maybe the content hub is not the right 
component, but the requirement of RDFa extraction is still existent.

-- 
Kind Regards,
Rüdiger.

-------------------

Rüdiger Kurz

Alkacon Software GmbH  - The OpenCms Experts
http://www.alkacon.com- http://www.opencms.org

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Stéphane Corlosquet <sc...@gmail.com>.

On Fri, Nov 9, 2012 at 7:30 AM, Walter Kasper <ka...@dfki.de> wrote:

> Hi,
>
>
> Rupert Westenthaler wrote:
>
>> Hi Walter, all
>>
>> I had already a look at the htmlextractor and I think it is a nice
>> addition to Stanbol!
>>
>> I would be interested in an Engine that does not only extract embedded
>> knowledge, but also keeps the link to the actual position within the
>> parsed Content. In more detail I would like to link the extracted
>> knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
>> selects the annotated part of the content.
>>
>> This would not only allow to have the extracted knowledge in the
>> metadata of the ContentItem, but also allow EnhancementEngines to
>> process those information in the same way as if they would be
>> extracted by an other engine (e.g. linking an RDFa annotation about an
>> Person, Place in the same way as an Person, Place detected by an NER
>> engine).
>>
>
> I think that could be done.
>
>
>
>> Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
>> at the ApacheCon included a nice example on how to extract the text of
>> an Link. I think this is a nice starting point for such an feature.
>>
>> Generally I think it would be better to add RDFa, Micro Data support
>> to directly to Tika instead of implementing custom solutions within
>> Stanbol. WDYT?
>>
>
> Tika currently is not suitable for RDFa extraction etc. because its HTML
> parser (TagSoup) throws away all namespace declarations needed for the RDF.
>

You might want to consider any23 [1], another Apache project which can
extract RDFa and other semantic markups from HTML. There are also some
independent RDFa parser you can use in java such as [2].

Steph.

[1] http://any23.apache.org/extractors.html
[2] https://github.com/niklasl/clj-rdfa-jena

>
> Best regards,
>
> Walter
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>


-- 
Steph.

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Walter Kasper <ka...@dfki.de>.

Hi,

Rupert Westenthaler wrote:
> Hi Walter, all
>
> I had already a look at the htmlextractor and I think it is a nice
> addition to Stanbol!
>
> I would be interested in an Engine that does not only extract embedded
> knowledge, but also keeps the link to the actual position within the
> parsed Content. In more detail I would like to link the extracted
> knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
> selects the annotated part of the content.
>
> This would not only allow to have the extracted knowledge in the
> metadata of the ContentItem, but also allow EnhancementEngines to
> process those information in the same way as if they would be
> extracted by an other engine (e.g. linking an RDFa annotation about an
> Person, Place in the same way as an Person, Place detected by an NER
> engine).

I think that could be done.

>
> Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
> at the ApacheCon included a nice example on how to extract the text of
> an Link. I think this is a nice starting point for such an feature.
>
> Generally I think it would be better to add RDFa, Micro Data support
> to directly to Tika instead of implementing custom solutions within
> Stanbol. WDYT?

Tika currently is not suitable for RDFa extraction etc. because its HTML 
parser (TagSoup) throws away all namespace declarations needed for the RDF.

Best regards,

Walter

-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Walter, all

I had already a look at the htmlextractor and I think it is a nice
addition to Stanbol!

I would be interested in an Engine that does not only extract embedded
knowledge, but also keeps the link to the actual position within the
parsed Content. In more detail I would like to link the extracted
knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
selects the annotated part of the content.

This would not only allow to have the extracted knowledge in the
metadata of the ContentItem, but also allow EnhancementEngines to
process those information in the same way as if they would be
extracted by an other engine (e.g. linking an RDFa annotation about an
Person, Place in the same way as an Person, Place detected by an NER
engine).

Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
at the ApacheCon included a nice example on how to extract the text of
an Link. I think this is a nice starting point for such an feature.

Generally I think it would be better to add RDFa, Micro Data support
to directly to Tika instead of implementing custom solutions within
Stanbol. WDYT?

best
Rupert

[1] http://www.slideshare.net/jukka/content-extraction-with-apache-tika Slide 19

On Thu, Nov 8, 2012 at 12:31 PM, Walter Kasper <ka...@dfki.de> wrote:
> Hi Rüdiger,
>
> RDFa extraction from HTML is part of the htmlextractor engine in Stanbol.
> Iwould welcome it if you could test it with yourOpenCms docs.
>
> Best regards,
>
> Walter
>
>
> Rüdiger Kurz wrote:
>>
>> Hi Staboler,
>>
>> during ApacheCon in Sinsheim I had some interesting conversations with
>> Fabian, Rupert and Anil as result I want to summarize one of the discussions
>> as an user story telling a typical requirement for us as CMS provider.
>>
>> Talking about traditional Content Management Systems and assuming that
>> they don't store semantic informations is not correct. For example CMS
>> Systems already deliver RDFa annotated HTML, nearly all systems are
>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>> generic approach to define a structured content and therefore we have the
>> information that a specific field/item of a content has a specified type and
>> a defined label. E.g. A technology event named ApacheCon takes place in
>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>> stored in OpenCms. More over OpenCms is able to connect that event with all
>> speakers/persons that will make a presentation on that event, ...
>>
>> What we would like to achieve is not only a plain text enhancement more
>> over we are interested in telling Stanbol all informations and associations
>> we already know. In other words we absolutely don't want to lose the
>> semantic information that is already existent in OpenCms.
>>
>> A good starting point would be a REST endpoint providing the ability to
>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>> to store those inside the semantic-index/entity-hub/... as I previously
>> suggested on the list under the subject "Extend stanbol content hub for RDFa
>> support". Maybe the content hub is not the right component, but the
>> requirement of RDFa extraction is still existent.
>>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Walter Kasper <ka...@dfki.de>.

Hi Rüdiger,

RDFa extraction from HTML is part of the htmlextractor engine in 
Stanbol. Iwould welcome it if you could test it with yourOpenCms docs.

Best regards,

Walter

Rüdiger Kurz wrote:
> Hi Staboler,
>
> during ApacheCon in Sinsheim I had some interesting conversations with 
> Fabian, Rupert and Anil as result I want to summarize one of the 
> discussions as an user story telling a typical requirement for us as 
> CMS provider.
>
> Talking about traditional Content Management Systems and assuming that 
> they don't store semantic informations is not correct. For example CMS 
> Systems already deliver RDFa annotated HTML, nearly all systems are 
> providing some tagging/categorizing mechanism. Specially OpenCms 
> provides a generic approach to define a structured content and 
> therefore we have the information that a specific field/item of a 
> content has a specified type and a defined label. E.g. A technology 
> event named ApacheCon takes place in Sinsheim from 05. Nov until 08. 
> Nov 2012 is the information that is already stored in OpenCms. More 
> over OpenCms is able to connect that event with all speakers/persons 
> that will make a presentation on that event, ...
>
> What we would like to achieve is not only a plain text enhancement 
> more over we are interested in telling Stanbol all informations and 
> associations we already know. In other words we absolutely don't want 
> to lose the semantic information that is already existent in OpenCms.
>
> A good starting point would be a REST endpoint providing the ability 
> to retrieve a RDFa annotated HTML document and than extracts the RDFa 
> in order to store those inside the semantic-index/entity-hub/... as I 
> previously suggested on the list under the subject "Extend stanbol 
> content hub for RDFa support". Maybe the content hub is not the right 
> component, but the requirement of RDFa extraction is still existent.
>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Suat Gonul <su...@gmail.com>.

Hi Gabriel,

Currently, the component getting documents from a CMIS repository and
sending them to the Contenthub is the CMISContenthubFeeder[1] which is a
CMIS based implementation of ContenthubFeeder[2] interface. Actually,
this interface and its implementations are open to improve based on the
requirements of actual end users like you.

Currently, CMISContenthubFeeder assumes that documents in CMS have the
cmis:Document type. This component traverses all of the properties of a
document except the ones defined in the "excludedProperties" field
residing in the CMISContenthubFeeder class. This can make configurable
easily.

Another point is that: In the current version in the SVN, the properties
of a cmis:Document are sent to the Contenthub as explicit constraints
along with the ContentItem. We have changed this point such that the
properties of the document are added as a ContentItem part. This is a
more convenient way considering the structure of a ContentItem[3].

It's a long time since I have not worked on this feature and currently,
I have a problem when I get a CMIS session, so I cannot test this
feature. Once that issue is resolved I can add a configuration where you
can specify the properties to be excluded through the OSGi console.

Best,
Suat

[1]
https://svn.apache.org/repos/asf/stanbol/trunk/cmsadapter/cmis/src/main/java/org/apache/stanbol/cmsadapter/cmis/mapping/CMISContenthubFeeder.java
[2]
https://svn.apache.org/repos/asf/stanbol/trunk/cmsadapter/servicesapi/src/main/java/org/apache/stanbol/cmsadapter/servicesapi/mapping/ContenthubFeeder.java
[3]
http://stanbol.apache.org/docs/0.9.0-incubating/enhancer/contentitem.html

On 11/8/2012 12:15 PM, Gabriel Vince wrote:
> Hello,
>
> I have a related question and suggestion.
>
> We are heavily using metadata in our CMS and considering to use
> Stanbol as a semantic extension and enrichment for the already stored
> metadata. Seems CMIS-RDF mapping stores only hardcoded set of CMIS
> object properties (see the CMSAdapterVocabulary).
>
> Whouldn't be more flexible to have an extendable configuration what
> properties to store? Or simply store all found CMIS properties as RDF
> triples. I'm currently working on this extension if somebody
> interested (along with Alfresco aspect support).
>
> Best regards
>           Gabriel
>
>
> On Thu, Nov 8, 2012 at 10:59 AM, Fabian Christ
> <ch...@googlemail.com> wrote:
>> Hi Rüdiger,
>>
>> thank you for bringing this up. Yes, we should reshape our view on
>> traditional CMS and not only focus on plain text without any metadata. We
>> have to think about the right ways to ensure that we do "never lose any
>> semantics", like in "never lose content".
>>
>> Best,
>>  - Fabian
>>
>>
>> 2012/11/8 Rüdiger Kurz <r....@alkacon.com>
>>
>>> Hi Staboler,
>>>
>>> during ApacheCon in Sinsheim I had some interesting conversations with
>>> Fabian, Rupert and Anil as result I want to summarize one of the
>>> discussions as an user story telling a typical requirement for us as CMS
>>> provider.
>>>
>>> Talking about traditional Content Management Systems and assuming that
>>> they don't store semantic informations is not correct. For example CMS
>>> Systems already deliver RDFa annotated HTML, nearly all systems are
>>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>>> generic approach to define a structured content and therefore we have the
>>> information that a specific field/item of a content has a specified type
>>> and a defined label. E.g. A technology event named ApacheCon takes place in
>>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>>> stored in OpenCms. More over OpenCms is able to connect that event with all
>>> speakers/persons that will make a presentation on that event, ...
>>>
>>> What we would like to achieve is not only a plain text enhancement more
>>> over we are interested in telling Stanbol all informations and associations
>>> we already know. In other words we absolutely don't want to lose the
>>> semantic information that is already existent in OpenCms.
>>>
>>> A good starting point would be a REST endpoint providing the ability to
>>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>>> to store those inside the semantic-index/entity-hub/... as I previously
>>> suggested on the list under the subject "Extend stanbol content hub for
>>> RDFa support". Maybe the content hub is not the right component, but the
>>> requirement of RDFa extraction is still existent.
>>>
>>> --
>>> Kind Regards,
>>> Rüdiger.
>>>
>>> -------------------
>>>
>>> Rüdiger Kurz
>>>
>>> Alkacon Software GmbH  - The OpenCms Experts
>>> http://www.alkacon.com- http://www.opencms.org
>>>
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>
>

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Gabriel Vince <ga...@apogado.com>.

Hello,

I have a related question and suggestion.

We are heavily using metadata in our CMS and considering to use
Stanbol as a semantic extension and enrichment for the already stored
metadata. Seems CMIS-RDF mapping stores only hardcoded set of CMIS
object properties (see the CMSAdapterVocabulary).

Whouldn't be more flexible to have an extendable configuration what
properties to store? Or simply store all found CMIS properties as RDF
triples. I'm currently working on this extension if somebody
interested (along with Alfresco aspect support).

Best regards
          Gabriel


On Thu, Nov 8, 2012 at 10:59 AM, Fabian Christ
<ch...@googlemail.com> wrote:
> Hi Rüdiger,
>
> thank you for bringing this up. Yes, we should reshape our view on
> traditional CMS and not only focus on plain text without any metadata. We
> have to think about the right ways to ensure that we do "never lose any
> semantics", like in "never lose content".
>
> Best,
>  - Fabian
>
>
> 2012/11/8 Rüdiger Kurz <r....@alkacon.com>
>
>> Hi Staboler,
>>
>> during ApacheCon in Sinsheim I had some interesting conversations with
>> Fabian, Rupert and Anil as result I want to summarize one of the
>> discussions as an user story telling a typical requirement for us as CMS
>> provider.
>>
>> Talking about traditional Content Management Systems and assuming that
>> they don't store semantic informations is not correct. For example CMS
>> Systems already deliver RDFa annotated HTML, nearly all systems are
>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>> generic approach to define a structured content and therefore we have the
>> information that a specific field/item of a content has a specified type
>> and a defined label. E.g. A technology event named ApacheCon takes place in
>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>> stored in OpenCms. More over OpenCms is able to connect that event with all
>> speakers/persons that will make a presentation on that event, ...
>>
>> What we would like to achieve is not only a plain text enhancement more
>> over we are interested in telling Stanbol all informations and associations
>> we already know. In other words we absolutely don't want to lose the
>> semantic information that is already existent in OpenCms.
>>
>> A good starting point would be a REST endpoint providing the ability to
>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>> to store those inside the semantic-index/entity-hub/... as I previously
>> suggested on the list under the subject "Extend stanbol content hub for
>> RDFa support". Maybe the content hub is not the right component, but the
>> requirement of RDFa extraction is still existent.
>>
>> --
>> Kind Regards,
>> Rüdiger.
>>
>> -------------------
>>
>> Rüdiger Kurz
>>
>> Alkacon Software GmbH  - The OpenCms Experts
>> http://www.alkacon.com- http://www.opencms.org
>>
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt



-- 
Gabriel Vince
Senior Consultant
Apogado
http://www.apogado.com

Re: User story: Don't want to lose the semantic information I already have inside my CMS

Posted by Fabian Christ <ch...@googlemail.com>.

Hi Rüdiger,

thank you for bringing this up. Yes, we should reshape our view on
traditional CMS and not only focus on plain text without any metadata. We
have to think about the right ways to ensure that we do "never lose any
semantics", like in "never lose content".

Best,
 - Fabian


2012/11/8 Rüdiger Kurz <r....@alkacon.com>

> Hi Staboler,
>
> during ApacheCon in Sinsheim I had some interesting conversations with
> Fabian, Rupert and Anil as result I want to summarize one of the
> discussions as an user story telling a typical requirement for us as CMS
> provider.
>
> Talking about traditional Content Management Systems and assuming that
> they don't store semantic informations is not correct. For example CMS
> Systems already deliver RDFa annotated HTML, nearly all systems are
> providing some tagging/categorizing mechanism. Specially OpenCms provides a
> generic approach to define a structured content and therefore we have the
> information that a specific field/item of a content has a specified type
> and a defined label. E.g. A technology event named ApacheCon takes place in
> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
> stored in OpenCms. More over OpenCms is able to connect that event with all
> speakers/persons that will make a presentation on that event, ...
>
> What we would like to achieve is not only a plain text enhancement more
> over we are interested in telling Stanbol all informations and associations
> we already know. In other words we absolutely don't want to lose the
> semantic information that is already existent in OpenCms.
>
> A good starting point would be a REST endpoint providing the ability to
> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
> to store those inside the semantic-index/entity-hub/... as I previously
> suggested on the list under the subject "Extend stanbol content hub for
> RDFa support". Maybe the content hub is not the right component, but the
> requirement of RDFa extraction is still existent.
>
> --
> Kind Regards,
> Rüdiger.
>
> -------------------
>
> Rüdiger Kurz
>
> Alkacon Software GmbH  - The OpenCms Experts
> http://www.alkacon.com- http://www.opencms.org
>



-- 
Fabian
http://twitter.com/fctwitt