You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by florent andré <fl...@4sengines.com> on 2012/01/17 00:52:08 UTC

Feedback about stanbol-414 specification

Hi Rupert, *

First, thanks a lot for this first draft definition.

I really like the idea of an RDF graph description of "enhancement 
chain" and "engine".

Here come my points :

°°°°°° Entreprise integration patterns (EIP) and Apache Camel °°°°°°

My major remark is about not use a well know, and defined pattern : the 
enterprise integration pattern [1].
Behind this "big name", this is all about transferring messages between 
"processing unit".
Camel is a very generic framework that implements most of EIP [2], where 
messages and processing unit can be almost anything.
Apply to Stanbol, we can consider ContentItem as message and Engines as 
processing unit.
Cherry on the cake, camel take care of messages and processing units but 
also machinery to make this in "music" (poll, ordering, grouping, error 
management,...), and provide pretty simple ways to manage this.

Let's stop my Camel "commercial speech" :), and just say that I will 
really try to commit the first version of a Camel enhancer this week.

By the way, as far as I know, Camel don't provide a graph to route 
(Camel's term for chain) or route to graph utility... but there is well 
define DSL's - spring[1], scala,... - so this can be a clue.

°°°°°° Forward building of chain °°°°°°

In you proposal, the chain is build on a "forward" nature :
you know that A is before B, because B depend on A (property ep:dependsOn).

I don't really like this way of define chain (but it's may be almost my 
personal taste), for mainly two reasons :
- As a human, building, but more reading, understanding and make a 
cognitive representation of a chain build in that way is pretty 
difficult, and difficulty increase with chain complexity. Forward 
processing is not a natural way for thinking chains.
- Chain is about processing data, information, message and in usual way 
information come from a point and go to another point... and IMO 
describe a chain is more about describe the path of the message than the 
inner structure of the chain.

°°°°°° Missing features °°°°°°

There is IMO two main missing features in this definition :
1) No way to link chains each others ("chain linking")
2) No way to select engines (or subchain) depending of a condition 
("selector")

Let's illustrate this feature with an example :

Imagine we have this 4 chains already defined :
- MusicChain : define a chain with music specifics engines (thesaurus, 
ws, etc)
- FoodChain : define a chain with food specifics engines
- PizzaChain : the better chain for pizza
- otherStuffChain : chain for the rest

So far so good, but now I have content with no idea on that content...
I can submit it to all chains (not optimal), or to one random chain 
(with the risk to put a Restaurant story in the musicChain)...

So let's define a CategorisationChain.
This chain have for example the topic engine and a generic dbpedia enhancer.
At the end of the chain we have a graph that lead to a with a pretty 
good idea of the content's nature.

Now, with the "linking chain" and "selector" features we can define an 
"UltimateBigChain" like that :

from(input_file) --> categorisationChain
--> if (graph has "music") --> musicChain.
--> elseif (graph has "food") --> foodChain --> if (graph has 
"pizza")--> pizzaChain.
--> otherwise() --> otherStuffChain.

My two cents...
++

[1] : http://www.enterpriseintegrationpatterns.com/toc.html
[2] : http://camel.apache.org/eip.html
[3] : http://camel.apache.org/schema/spring/camel-spring-2.9.0.xsd

Re: Feedback about stanbol-414 specification

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

In my  opinion EIP and enhancement chains are two different things on
two separate (architectural) layers. I think the example provided by
florent shows this very nicely as it clearly shows how users can solve
such kind of problems by combining EIP provided by Apache Camel and
Enhancement Chains provided by the Stanbol Enhancer. In other words I
see Apache Camel more as an alternative to the RESTful interface of
Apache Stanbol.

Florent as soon as your Camel enhancer is available we should start
working on a "How to Enterprise Level Information Extraction with
Apache Camel and Apache Stanbol" usage scenario. This would not only
help potential users by also us developers to better understand the
whole stack. WDYT?


Some more comments inline

On Tue, Jan 17, 2012 at 1:29 PM, Fabian Christ
<ch...@googlemail.com> wrote:
>
> IMO the main use case is:
>
> - Have different URIs for different chains
> - At each chain URI you can configure N engines in a user defined order
> (optional: allow chains to be nested within other chains)
>
> That's what I would start with and wait for user feedback if more
> complex scenarios come up on the mailing list.
>

Exactly. From the configuration perspective I still think that linear
enhancement chains will be the most used one.
The

* "WeightedChain" allows users to just lost all the names of engines
he want to have in the chain. Ordering is calculated automatically
similar to the current WeightedJobManager.
* "ListChain": we could add this Chain type. Here the user MUST
provide the list of engines in the exact order of execution.
* "GraphChain": This is intended for expert users that want to
optimize chain configurations (e.g. explicitly tell the
EnhancementJobManager what engines can be executed in parallel).

Defining the Execution Plan in RDF has the advantage that it makes it
very easy to provide information about he execution within the
metadata of the enhanced content item. If you have not seen it yet.
Yesterday I added a new section "Execution Metadata" to the
specification describing how the EnhancementJobManager should encode
metadata about the enhancement process. This information are critical
if we want to use the "org.apache.stanbol.commons.jobs" api for the
async REST API as suggested by David in [1].


>>> 2012/1/17 florent andré <fl...@4sengines.com>:
>>> °°°°°° Missing features °°°°°°
>>>
>>> There is IMO two main missing features in this definition :
>>> 1) No way to link chains each others ("chain linking")

As also mentioned by Fabian this might be useful and added to
Enhancement Chains at some point. Should be also relatively easy to
implement.

Regarding

>>> Now, with the "linking chain" and "selector" features we can define an
>>> "UltimateBigChain" like that :
>>>
>>> from(input_file) --> categorisationChain
>>> --> if (graph has "music") --> musicChain.
>>> --> elseif (graph has "food") --> foodChain --> if (graph has "pizza")-->
>>> pizzaChain.
>>> --> otherwise() --> otherStuffChain.

and

>> However that does not prevent us to expose Stanbol engines and chains
>> as Camel Endpoints [1] for people would like to benefit from the Camel
>> wide support for various messaging systems (i.e. as an ETL).
>>
>>  [1] https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/main/java/org/apache/camel/Endpoint.java
>>

+1

The addition of Enhancement Chains allows to shortens the definition
of such Camel workflows because users need no longer to call single
EnhancementEngines but can use Chains instead. In addition this allows
to change the configuration of a chain without affecting the Workflow.

best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Feedback about stanbol-414 specification

Posted by Fabian Christ <ch...@googlemail.com>.
Hi,

I had similar thoughts like Olivier. We have to find a balance between
a complex configuration and the needs of a "standard" Stanbol user.
For simple linear chains I don't see why we want a user to learn the
Camel style. This should be configurable straight forward.

IMO the main use case is:

- Have different URIs for different chains
- At each chain URI you can configure N engines in a user defined order
(optional: allow chains to be nested within other chains)

That's what I would start with and wait for user feedback if more
complex scenarios come up on the mailing list.

Best,
 - Fabian

Am 17. Januar 2012 12:08 schrieb Olivier Grisel <ol...@ensta.org>:
> 2012/1/17 florent andré <fl...@4sengines.com>:
>> Hi Rupert, *
>>
>> First, thanks a lot for this first draft definition.
>>
>> I really like the idea of an RDF graph description of "enhancement chain"
>> and "engine".
>>
>> Here come my points :
>>
>> °°°°°° Entreprise integration patterns (EIP) and Apache Camel °°°°°°
>>
>> My major remark is about not use a well know, and defined pattern : the
>> enterprise integration pattern [1].
>> Behind this "big name", this is all about transferring messages between
>> "processing unit".
>> Camel is a very generic framework that implements most of EIP [2], where
>> messages and processing unit can be almost anything.
>> Apply to Stanbol, we can consider ContentItem as message and Engines as
>> processing unit.
>> Cherry on the cake, camel take care of messages and processing units but
>> also machinery to make this in "music" (poll, ordering, grouping, error
>> management,...), and provide pretty simple ways to manage this.
>>
>> Let's stop my Camel "commercial speech" :), and just say that I will really
>> try to commit the first version of a Camel enhancer this week.
>>
>> By the way, as far as I know, Camel don't provide a graph to route (Camel's
>> term for chain) or route to graph utility... but there is well define DSL's
>> - spring[1], scala,... - so this can be a clue.
>>
>> °°°°°° Forward building of chain °°°°°°
>>
>> In you proposal, the chain is build on a "forward" nature :
>> you know that A is before B, because B depend on A (property ep:dependsOn).
>>
>> I don't really like this way of define chain (but it's may be almost my
>> personal taste), for mainly two reasons :
>> - As a human, building, but more reading, understanding and make a cognitive
>> representation of a chain build in that way is pretty difficult, and
>> difficulty increase with chain complexity. Forward processing is not a
>> natural way for thinking chains.
>> - Chain is about processing data, information, message and in usual way
>> information come from a point and go to another point... and IMO describe a
>> chain is more about describe the path of the message than the inner
>> structure of the chain.
>>
>> °°°°°° Missing features °°°°°°
>>
>> There is IMO two main missing features in this definition :
>> 1) No way to link chains each others ("chain linking")
>> 2) No way to select engines (or subchain) depending of a condition
>> ("selector")
>>
>> Let's illustrate this feature with an example :
>>
>> Imagine we have this 4 chains already defined :
>> - MusicChain : define a chain with music specifics engines (thesaurus, ws,
>> etc)
>> - FoodChain : define a chain with food specifics engines
>> - PizzaChain : the better chain for pizza
>> - otherStuffChain : chain for the rest
>>
>> So far so good, but now I have content with no idea on that content...
>> I can submit it to all chains (not optimal), or to one random chain (with
>> the risk to put a Restaurant story in the musicChain)...
>>
>> So let's define a CategorisationChain.
>> This chain have for example the topic engine and a generic dbpedia enhancer.
>> At the end of the chain we have a graph that lead to a with a pretty good
>> idea of the content's nature.
>>
>> Now, with the "linking chain" and "selector" features we can define an
>> "UltimateBigChain" like that :
>>
>> from(input_file) --> categorisationChain
>> --> if (graph has "music") --> musicChain.
>> --> elseif (graph has "food") --> foodChain --> if (graph has "pizza")-->
>> pizzaChain.
>> --> otherwise() --> otherStuffChain.
>
> I am not entirely sure this use case is worth the configuration
> complexity that will be induced and also I am not sure the Enhancer
> jobmanager should handle this kind of semantic reasoning at its level.
> What would not the engines them self be able to handle that directly?
> First engine could be a topic extractor and then the following engines
> in the chain only process the content items is they found the
> previously extracted metadata suiting their own configuration an
> behaviors.
>
> Debugging chain routing issues from a REST client developer who has no
> idea on how to debug java code will be hard. I prefer to have explicit
> linear chain configurations with the explicit order list of engine ids
> in a direct OSGi configuration.
>
> However that does not prevent us to expose Stanbol engines and chains
> as Camel Endpoints [1] for people would like to benefit from the Camel
> wide support for various messaging systems (i.e. as an ETL).
>
>  [1] https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/main/java/org/apache/camel/Endpoint.java
>
> However I don't think the default the deployment of Stanbol Enhancer
> should force the administrator to understand the generic concept model
> and configuration format of Apache Camel just to chain three Stanbol
> engines by ids.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel



-- 
Fabian
http://twitter.com/fctwitt

Re: Feedback about stanbol-414 specification

Posted by Olivier Grisel <ol...@ensta.org>.
2012/1/17 florent andré <fl...@4sengines.com>:
> Hi Rupert, *
>
> First, thanks a lot for this first draft definition.
>
> I really like the idea of an RDF graph description of "enhancement chain"
> and "engine".
>
> Here come my points :
>
> °°°°°° Entreprise integration patterns (EIP) and Apache Camel °°°°°°
>
> My major remark is about not use a well know, and defined pattern : the
> enterprise integration pattern [1].
> Behind this "big name", this is all about transferring messages between
> "processing unit".
> Camel is a very generic framework that implements most of EIP [2], where
> messages and processing unit can be almost anything.
> Apply to Stanbol, we can consider ContentItem as message and Engines as
> processing unit.
> Cherry on the cake, camel take care of messages and processing units but
> also machinery to make this in "music" (poll, ordering, grouping, error
> management,...), and provide pretty simple ways to manage this.
>
> Let's stop my Camel "commercial speech" :), and just say that I will really
> try to commit the first version of a Camel enhancer this week.
>
> By the way, as far as I know, Camel don't provide a graph to route (Camel's
> term for chain) or route to graph utility... but there is well define DSL's
> - spring[1], scala,... - so this can be a clue.
>
> °°°°°° Forward building of chain °°°°°°
>
> In you proposal, the chain is build on a "forward" nature :
> you know that A is before B, because B depend on A (property ep:dependsOn).
>
> I don't really like this way of define chain (but it's may be almost my
> personal taste), for mainly two reasons :
> - As a human, building, but more reading, understanding and make a cognitive
> representation of a chain build in that way is pretty difficult, and
> difficulty increase with chain complexity. Forward processing is not a
> natural way for thinking chains.
> - Chain is about processing data, information, message and in usual way
> information come from a point and go to another point... and IMO describe a
> chain is more about describe the path of the message than the inner
> structure of the chain.
>
> °°°°°° Missing features °°°°°°
>
> There is IMO two main missing features in this definition :
> 1) No way to link chains each others ("chain linking")
> 2) No way to select engines (or subchain) depending of a condition
> ("selector")
>
> Let's illustrate this feature with an example :
>
> Imagine we have this 4 chains already defined :
> - MusicChain : define a chain with music specifics engines (thesaurus, ws,
> etc)
> - FoodChain : define a chain with food specifics engines
> - PizzaChain : the better chain for pizza
> - otherStuffChain : chain for the rest
>
> So far so good, but now I have content with no idea on that content...
> I can submit it to all chains (not optimal), or to one random chain (with
> the risk to put a Restaurant story in the musicChain)...
>
> So let's define a CategorisationChain.
> This chain have for example the topic engine and a generic dbpedia enhancer.
> At the end of the chain we have a graph that lead to a with a pretty good
> idea of the content's nature.
>
> Now, with the "linking chain" and "selector" features we can define an
> "UltimateBigChain" like that :
>
> from(input_file) --> categorisationChain
> --> if (graph has "music") --> musicChain.
> --> elseif (graph has "food") --> foodChain --> if (graph has "pizza")-->
> pizzaChain.
> --> otherwise() --> otherStuffChain.

I am not entirely sure this use case is worth the configuration
complexity that will be induced and also I am not sure the Enhancer
jobmanager should handle this kind of semantic reasoning at its level.
What would not the engines them self be able to handle that directly?
First engine could be a topic extractor and then the following engines
in the chain only process the content items is they found the
previously extracted metadata suiting their own configuration an
behaviors.

Debugging chain routing issues from a REST client developer who has no
idea on how to debug java code will be hard. I prefer to have explicit
linear chain configurations with the explicit order list of engine ids
in a direct OSGi configuration.

However that does not prevent us to expose Stanbol engines and chains
as Camel Endpoints [1] for people would like to benefit from the Camel
wide support for various messaging systems (i.e. as an ETL).

 [1] https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/main/java/org/apache/camel/Endpoint.java

However I don't think the default the deployment of Stanbol Enhancer
should force the administrator to understand the generic concept model
and configuration format of Apache Camel just to chain three Stanbol
engines by ids.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

RE : Feedback about stanbol-414 specification

Posted by Tanneguy DULONG <ta...@arisem.com>.
HI all,

@Florent
FYI,  Talend made a pretty good job at packaging Camel within their Talend Open Studio (TOS) ESB 
TOS is open-source and provides a documented Eclipse RCP graphical design tool for routes (among other nice features)

I have not tested the Camel integration but found the ETL part to have a high level of polish (better than some commercial software actually).
If as I surmise there  should be some chain prototyping from there, it can save you a lot of time.

Hope it helps

Tanneguy




________________________________________
De : florent andré [florent.andre-dev@4sengines.com]
Date d'envoi : mardi 17 janvier 2012 00:52
À : stanbol-dev@incubator.apache.org
Objet : Feedback about stanbol-414 specification

Hi Rupert, *

First, thanks a lot for this first draft definition.

I really like the idea of an RDF graph description of "enhancement
chain" and "engine".

Here come my points :

°°°°°° Entreprise integration patterns (EIP) and Apache Camel °°°°°°

My major remark is about not use a well know, and defined pattern : the
enterprise integration pattern [1].
Behind this "big name", this is all about transferring messages between
"processing unit".
Camel is a very generic framework that implements most of EIP [2], where
messages and processing unit can be almost anything.
Apply to Stanbol, we can consider ContentItem as message and Engines as
processing unit.
Cherry on the cake, camel take care of messages and processing units but
also machinery to make this in "music" (poll, ordering, grouping, error
management,...), and provide pretty simple ways to manage this.

Let's stop my Camel "commercial speech" :), and just say that I will
really try to commit the first version of a Camel enhancer this week.

By the way, as far as I know, Camel don't provide a graph to route
(Camel's term for chain) or route to graph utility... but there is well
define DSL's - spring[1], scala,... - so this can be a clue.

°°°°°° Forward building of chain °°°°°°

In you proposal, the chain is build on a "forward" nature :
you know that A is before B, because B depend on A (property ep:dependsOn).

I don't really like this way of define chain (but it's may be almost my
personal taste), for mainly two reasons :
- As a human, building, but more reading, understanding and make a
cognitive representation of a chain build in that way is pretty
difficult, and difficulty increase with chain complexity. Forward
processing is not a natural way for thinking chains.
- Chain is about processing data, information, message and in usual way
information come from a point and go to another point... and IMO
describe a chain is more about describe the path of the message than the
inner structure of the chain.

°°°°°° Missing features °°°°°°

There is IMO two main missing features in this definition :
1) No way to link chains each others ("chain linking")
2) No way to select engines (or subchain) depending of a condition
("selector")

Let's illustrate this feature with an example :

Imagine we have this 4 chains already defined :
- MusicChain : define a chain with music specifics engines (thesaurus,
ws, etc)
- FoodChain : define a chain with food specifics engines
- PizzaChain : the better chain for pizza
- otherStuffChain : chain for the rest

So far so good, but now I have content with no idea on that content...
I can submit it to all chains (not optimal), or to one random chain
(with the risk to put a Restaurant story in the musicChain)...

So let's define a CategorisationChain.
This chain have for example the topic engine and a generic dbpedia enhancer.
At the end of the chain we have a graph that lead to a with a pretty
good idea of the content's nature.

Now, with the "linking chain" and "selector" features we can define an
"UltimateBigChain" like that :

from(input_file) --> categorisationChain
--> if (graph has "music") --> musicChain.
--> elseif (graph has "food") --> foodChain --> if (graph has
"pizza")--> pizzaChain.
--> otherwise() --> otherStuffChain.

My two cents...
++

[1] : http://www.enterpriseintegrationpatterns.com/toc.html
[2] : http://camel.apache.org/eip.html
[3] : http://camel.apache.org/schema/spring/camel-spring-2.9.0.xsd