You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Mihály Héder <he...@gmail.com> on 2012/09/25 21:07:09 UTC

Lessons learnt from EAP+ questions about future directions

Hi All,

I have written a blog post about the lessons learnt from the EAP project I
had been working on:
http://blog.iks-project.eu/lessons-learnt-while-working-with-apache-stanbol/

The reason I'm citing this here is that I'm interested in your opinion on
the following mid-term development questions and suggestions (discussed in
detail in the post):
-What is the best way to monitor a running stanbol instance with
munin/nagios/icinga, etc? How can I extract e.g. an enchancement/hour
statistic from stanbol?
-I think at some point we should create a standardized a REST API through
which non-java EEs could be accessed.
-Also, I think that if we had some standardized description XML or whatever
format that would tell what kind of output a certain EE produces, that
would be helpful.
-Finally, a general Enhancement Feedback interface would be great (but
there have been some discussion about this already on the list)

Cheers
Mihály

Re: Lessons learnt from EAP+ questions about future directions

Posted by Mihály Héder <he...@gmail.com>.
Hi!

On 2 October 2012 08:31, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi,
>
> let me just comment on your last point
>
> On Mon, Oct 1, 2012 at 8:55 PM, Mihály Héder <he...@gmail.com> wrote:
>>> However this feature is much more important for UIMA as for Stanbol,
>>> because with Stanbol EnhancementEngines are expected to create
>>> Annotations that confirm to the EnhancementStructure.
>>
>> I totally support the self-description interface you propose, as the
>> conformity to the structure is really helpful but not everything. For
>> instance I had to experiment with Stanbol to figure out that LangId
>> will provide a "dc:language" property, and there will be only one of
>> this, not multiple ones (e.g. for every sentence).
>
> This is defined by STANBOL-613.
>
>> An other example
>> that the UIMAToTriples in my current deployment puts an sso:posTag
>> property to every TextAnnotation.
>
> Here the idea is to use NIF (NLP Interchange Format), but this is
> still in the workings. Current work is done in STANBOL-741, but most
> likely I will create an own Issue that defines how NIF annotations are
> linked to Stanbol Enhancements.
>
> Generally representing Word/Phrase level annotations as RDF does not
> scale. This is the reason why STANBOL-733 introduced the AnalyzedText
> ContentPart. So if you would like to allow other Engines to consume
> NLP annotations the UIMA integration should also support the
> AnalyzedText ContentPart.

That's good news. Will look into it!

>> That might be helpful for other EE
>> developers but they have to figure the uri of the property somehow -
>> ok, it is in the documentation, but still...
>>
>
> Maybe we can use the already existing
>
>     org.apache.stanbol.enhancer.servicesapi.ServiceProperties
>
> interface (already implemented by most Enhancement Engines. Possible
> additions would include
>
> * EnhancementFeature: MetadataExtraction, PlainTextExtraction,
> LanguageIdentification, POS tagging, Chunking, NER, EntityLinking, ...

I think I see the benefits of describing the features by naming their
functions. But as we surely cannot foresee all various ways people
might intend to use EEs, I'm afraid this kind of ontology will have to
be continuously expanded or we end up having some joker category. So
the question arises where this ontology will be kept/maintained?
Anyway, I think a useful addition to this descriptor would be an other
one that tells little about the function of the EE but tells how
precisely the structure of RDF-s look like (what namespace/ontology
they use, the hierarchy/multiplicity of the triples, etc).

Best
Mihály

> * RequiresFeature: Enhancements required by an EnhancementEngine
> * supportsLanguage: list of languages supported (with support for
> exclusions and wildcard (e.g. !fr, !de, *)
> * supportsMimeType: allows an EnhancementEngine to define the
> supported mime types
> * ...
>
> If we use an Ontology for those Features we can
>
> 1. implement the Webservice that publishes the RDF metadata for
> EnhancementEngines based on the ServiceProperties provided by an
> EnhancementEngine
> 2. the URIs of those properties would be also a good entry point for
> the documentation of how those features are represented in the
> EnhancementStructure (or NIF)
>
> best
> Rupert
>
>> Cheers
>> Mihály
>>
>>> best
>>> Rupert
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Lessons learnt from EAP+ questions about future directions

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi,

let me just comment on your last point

On Mon, Oct 1, 2012 at 8:55 PM, Mihály Héder <he...@gmail.com> wrote:
>> However this feature is much more important for UIMA as for Stanbol,
>> because with Stanbol EnhancementEngines are expected to create
>> Annotations that confirm to the EnhancementStructure.
>
> I totally support the self-description interface you propose, as the
> conformity to the structure is really helpful but not everything. For
> instance I had to experiment with Stanbol to figure out that LangId
> will provide a "dc:language" property, and there will be only one of
> this, not multiple ones (e.g. for every sentence).

This is defined by STANBOL-613.

> An other example
> that the UIMAToTriples in my current deployment puts an sso:posTag
> property to every TextAnnotation.

Here the idea is to use NIF (NLP Interchange Format), but this is
still in the workings. Current work is done in STANBOL-741, but most
likely I will create an own Issue that defines how NIF annotations are
linked to Stanbol Enhancements.

Generally representing Word/Phrase level annotations as RDF does not
scale. This is the reason why STANBOL-733 introduced the AnalyzedText
ContentPart. So if you would like to allow other Engines to consume
NLP annotations the UIMA integration should also support the
AnalyzedText ContentPart.

> That might be helpful for other EE
> developers but they have to figure the uri of the property somehow -
> ok, it is in the documentation, but still...
>

Maybe we can use the already existing

    org.apache.stanbol.enhancer.servicesapi.ServiceProperties

interface (already implemented by most Enhancement Engines. Possible
additions would include

* EnhancementFeature: MetadataExtraction, PlainTextExtraction,
LanguageIdentification, POS tagging, Chunking, NER, EntityLinking, ...
* RequiresFeature: Enhancements required by an EnhancementEngine
* supportsLanguage: list of languages supported (with support for
exclusions and wildcard (e.g. !fr, !de, *)
* supportsMimeType: allows an EnhancementEngine to define the
supported mime types
* ...

If we use an Ontology for those Features we can

1. implement the Webservice that publishes the RDF metadata for
EnhancementEngines based on the ServiceProperties provided by an
EnhancementEngine
2. the URIs of those properties would be also a good entry point for
the documentation of how those features are represented in the
EnhancementStructure (or NIF)

best
Rupert

> Cheers
> Mihály
>
>> best
>> Rupert
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Lessons learnt from EAP+ questions about future directions

Posted by Mihály Héder <he...@gmail.com>.
Hi Rupert,

On 27 September 2012 15:49, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Mihály
>
> On Tue, Sep 25, 2012 at 9:07 PM, Mihály Héder <he...@gmail.com> wrote:
>> Hi All,
>>
>> I have written a blog post about the lessons learnt from the EAP project I
>> had been working on:
>> http://blog.iks-project.eu/lessons-learnt-while-working-with-apache-stanbol/
>>
>
> Thanks for this blog post. It is really valuable feedback.
> I will try to answer some of your questions.
>
>> The reason I'm citing this here is that I'm interested in your opinion on
>> the following mid-term development questions and suggestions (discussed in
>> detail in the post):
>> -What is the best way to monitor a running stanbol instance with
>> munin/nagios/icinga, etc? How can I extract e.g. an enchancement/hour
>> statistic from stanbol?
>
> Within Apache Stanbol the EnhancementJobManager collects the
> ExecutionMetadata [1]. They are stored in an own ContentPart of the
> processed ContentItem.
>
> So one possibility would be to add a feature to the EnhancementJobManger that
> allows to log those information (or even to store them into a RDF triple store).
>
> If we do that this would really allow very fine grained analyses about requests
> processed by the Stanbol Enhancer.
>
>
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata.html

Looks good, thanks. I think at some time in the not-so-immediate
future I will develop a munin and nagios plugin for Stanbol based on
this.

>> -I think at some point we should create a standardized a REST API through
>> which non-java EEs could be accessed.
>
> I am not sure how such a interface should look like? I could think
> about an interface that POST the current metadata of the ContentItem
> to some URI. The results could again be RDF that is than added to the
> ContentItem. Maybe one could even allow the definition of some kind of
> Filter so that not the whole RDF metadata need to be serialized.
>
> Non-java EE that also need the content (e.g. the text/plain Blob)
> would need a different kind of interface.

I'm sure that basically everyone wants the content, too. I can imagine
cases in which the Non-java EE is only an RDF metadata provider but
does not consume anything but the content.

> BTW: Serialization/Deserialization of ContentItems is already
> implemented (by using multipart mime).

Sounds good!

>> -Also, I think that if we had some standardized description XML or whatever
>> format that would tell what kind of output a certain EE produces, that
>> would be helpful.
>
> I would really like to have EnhancementEngines providing RDF
> descriptions of themselves when making a GET request to
>
>     http://{stanbol-instance}/enhancer/engine/{engine-name}
>
> if those descriptions would also include information about the
> consumed/produced elements that would be great.
>
> However this feature is much more important for UIMA as for Stanbol,
> because with Stanbol EnhancementEngines are expected to create
> Annotations that confirm to the EnhancementStructure.

I totally support the self-description interface you propose, as the
conformity to the structure is really helpful but not everything. For
instance I had to experiment with Stanbol to figure out that LangId
will provide a "dc:language" property, and there will be only one of
this, not multiple ones (e.g. for every sentence). An other example
that the UIMAToTriples in my current deployment puts an sso:posTag
property to every TextAnnotation. That might be helpful for other EE
developers but they have to figure the uri of the property somehow -
ok, it is in the documentation, but still...

Cheers
Mihály

> best
> Rupert
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Lessons learnt from EAP+ questions about future directions

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Mihály

On Tue, Sep 25, 2012 at 9:07 PM, Mihály Héder <he...@gmail.com> wrote:
> Hi All,
>
> I have written a blog post about the lessons learnt from the EAP project I
> had been working on:
> http://blog.iks-project.eu/lessons-learnt-while-working-with-apache-stanbol/
>

Thanks for this blog post. It is really valuable feedback.
I will try to answer some of your questions.

> The reason I'm citing this here is that I'm interested in your opinion on
> the following mid-term development questions and suggestions (discussed in
> detail in the post):
> -What is the best way to monitor a running stanbol instance with
> munin/nagios/icinga, etc? How can I extract e.g. an enchancement/hour
> statistic from stanbol?

Within Apache Stanbol the EnhancementJobManager collects the
ExecutionMetadata [1]. They are stored in an own ContentPart of the
processed ContentItem.

So one possibility would be to add a feature to the EnhancementJobManger that
allows to log those information (or even to store them into a RDF triple store).

If we do that this would really allow very fine grained analyses about requests
processed by the Stanbol Enhancer.


[1] http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata.html

> -I think at some point we should create a standardized a REST API through
> which non-java EEs could be accessed.

I am not sure how such a interface should look like? I could think
about an interface that POST the current metadata of the ContentItem
to some URI. The results could again be RDF that is than added to the
ContentItem. Maybe one could even allow the definition of some kind of
Filter so that not the whole RDF metadata need to be serialized.

Non-java EE that also need the content (e.g. the text/plain Blob)
would need a different kind of interface.

BTW: Serialization/Deserialization of ContentItems is already
implemented (by using multipart mime).

> -Also, I think that if we had some standardized description XML or whatever
> format that would tell what kind of output a certain EE produces, that
> would be helpful.

I would really like to have EnhancementEngines providing RDF
descriptions of themselves when making a GET request to

    http://{stanbol-instance}/enhancer/engine/{engine-name}

if those descriptions would also include information about the
consumed/produced elements that would be great.

However this feature is much more important for UIMA as for Stanbol,
because with Stanbol EnhancementEngines are expected to create
Annotations that confirm to the EnhancementStructure.

best
Rupert


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen