You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by David Riccitelli <da...@insideout.io> on 2013/01/14 10:58:57 UTC

Stanbol front-end API

Hello,

I would like to introduce one more contribution for Apache Stanbol.

It is not an engine, but an HTTP API for Stanbol which pre-processes and
submits analysis tasks, and returns the result synchronously to the
consumer. It aims to simplify development integrations and to provide a
powerful pre-processing API for analysis of URLs.

It implements the *Readability* library, in order to support URL
submissions:
 - loading contents from remote URLs and
 - cleaning them up of all the surrounding noise.

Readability is the same library behind the *Reader* function of Safari that
many users know already.

To summarize:

   - extremely simple APIs to ease prototyping, integration and usage
   - support for textual contents
   - support for URLs
   - *for URLs, preprocessing of HTML pages to capture the actual URL
   content while skipping noise such as ads, menus and so forth*
   - synchronous access (for asynchronous access see idntik.it)

You can find more information and the source code here:
https://github.com/insideout10/stanbol-facade

Shall I open a JIRA to discuss a possible integration in the trunk?

BR,
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Fabian Christ <ch...@googlemail.com>.
2013/1/14 David Riccitelli <da...@insideout.io>

> > 3) The accepted media-type is also defined in the JSON file for the
> request
>
> The HTTP *Accept *header is currently ignored. Indeed it would be probably
> more correct to eliminate the *mimeType *property and rely solely on
> the *Accept
> *header.


Yes, that was also my first thought when thinking RESTful.


-- 
Fabian
http://twitter.com/fctwitt

Re: Stanbol front-end API

Posted by Fabian Christ <ch...@googlemail.com>.
2013/1/15 David Riccitelli <da...@insideout.io>

> Task Request-based API, features:
>  1. a new end-point that can be added in /enhancer/task
>  2. the end-point takes a Task Request (interface to be defined)
>  3. the Task Request will allow to post:
>       a) content or URL submission
>       b) per-call engine parameters
>       c) per-cal EnhancementChain definitions
>

Could we also use this task API to submit already available metadata to
Stanbol along with the to be enhanced content?


>  4. it supports synchronous operations
>  5. eventually it can support asynchronous operations with a callback URL
> (this point is to review as probably a proxy/gateway is more appropriate)
>

I am interested in this callback feature. Maybe we should discuss it
separately.

-- 
Fabian
http://twitter.com/fctwitt

Re: Stanbol front-end API

Posted by David Riccitelli <da...@insideout.io>.
All right, I had a quick chat with Rupert as I needed to understand some
more things.

We'll split the two goals:
 1. provide a Task Request-based API in the enhancer scope
 2. provide a Text Extraction feature as a Preprocessing Engine.

Task Request-based API, features:
 1. a new end-point that can be added in /enhancer/task
 2. the end-point takes a Task Request (interface to be defined)
 3. the Task Request will allow to post:
      a) content or URL submission
      b) per-call engine parameters
      c) per-cal EnhancementChain definitions
 4. it supports synchronous operations
 5. eventually it can support asynchronous operations with a callback URL
(this point is to review as probably a proxy/gateway is more appropriate)

In order to implement the above the JIRA STANBOL-488 [1] must be taken into
consideration.

Text Extraction features:
 1. currently Readability
 2. might be interesting to try out Boilerpipe and Goose to understand the
better performance and quality
 3. to implement, just use the ContentReference to create the ContentItem.
It will load the data automatically.

[1] https://issues.apache.org/jira/browse/STANBOL-488

BR
David


On Mon, Jan 14, 2013 at 8:22 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi
>
> On Mon, Jan 14, 2013 at 2:21 PM, David Riccitelli <da...@insideout.io>
> wrote:
> >> Are you now asking how this could be made available via the /api/tasks
> API
> >> you proposed?
> >
> > Yep, should it be then restricted to that specific URL [1] or would the
> > engine be allowed to create an additional end-point at /api/tasks?
> >
>
> I would not add an other RESTful service to an EnhancementEngine,
> because this goes against modularity forcing users to have both - the
> engine and the RESTful API.
> I would rather have a service that does the TextExtraction and than
> implement an EnhancementEngine and a RESTful service based on that.
>
> best
> Rupert
>
> > BR,
> > David
> >
> > [1] http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
> >
> >
> > On Mon, Jan 14, 2013 at 2:21 PM, Fabian Christ <
> christ.fabian@googlemail.com
> >> wrote:
> >
> >> 2013/1/14 David Riccitelli <da...@insideout.io>
> >>
> >> > About point a) I have a question. As the API allow for selection of
> the
> >> > Enhancement Chain, how would that work if we move the API in an
> engine.
> >> The
> >> > engine can be executed outside of the scope of an enhancement chain?
> >> >
> >>
> >> You can call single engine [1] by using
> >>
> >> http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
> >>
> >> Are you now asking how this could be made available via the /api/tasks
> API
> >> you proposed?
> >>
> >> [1]
> >>
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/index.html
> >>
> >>
> >> > Shall we move this thread on a JIRA thread?
> >> >
> >>
> >> No, discussing this here is totally fine. We should create a JIRA that
> >> describes what to do/implement after the discussion and we have some
> >> consensus.
> >>
> >> Best,
> >>  - Fabian
> >>
> >>
> >> --
> >> Fabian
> >> http://twitter.com/fctwitt
> >>
> >
> >
> >
> > --
> > David Riccitelli
> >
> > -- check the Swagger for WordLift <http://bit.ly/VtoM5H>
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >
> ********************************************************************************
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

On Mon, Jan 14, 2013 at 2:21 PM, David Riccitelli <da...@insideout.io> wrote:
>> Are you now asking how this could be made available via the /api/tasks API
>> you proposed?
>
> Yep, should it be then restricted to that specific URL [1] or would the
> engine be allowed to create an additional end-point at /api/tasks?
>

I would not add an other RESTful service to an EnhancementEngine,
because this goes against modularity forcing users to have both - the
engine and the RESTful API.
I would rather have a service that does the TextExtraction and than
implement an EnhancementEngine and a RESTful service based on that.

best
Rupert

> BR,
> David
>
> [1] http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
>
>
> On Mon, Jan 14, 2013 at 2:21 PM, Fabian Christ <christ.fabian@googlemail.com
>> wrote:
>
>> 2013/1/14 David Riccitelli <da...@insideout.io>
>>
>> > About point a) I have a question. As the API allow for selection of the
>> > Enhancement Chain, how would that work if we move the API in an engine.
>> The
>> > engine can be executed outside of the scope of an enhancement chain?
>> >
>>
>> You can call single engine [1] by using
>>
>> http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
>>
>> Are you now asking how this could be made available via the /api/tasks API
>> you proposed?
>>
>> [1]
>> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/index.html
>>
>>
>> > Shall we move this thread on a JIRA thread?
>> >
>>
>> No, discussing this here is totally fine. We should create a JIRA that
>> describes what to do/implement after the discussion and we have some
>> consensus.
>>
>> Best,
>>  - Fabian
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>>
>
>
>
> --
> David Riccitelli
>
> -- check the Swagger for WordLift <http://bit.ly/VtoM5H>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
> ********************************************************************************



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol front-end API

Posted by David Riccitelli <da...@insideout.io>.
> Are you now asking how this could be made available via the /api/tasks API
> you proposed?

Yep, should it be then restricted to that specific URL [1] or would the
engine be allowed to create an additional end-point at /api/tasks?

BR,
David

[1] http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}


On Mon, Jan 14, 2013 at 2:21 PM, Fabian Christ <christ.fabian@googlemail.com
> wrote:

> 2013/1/14 David Riccitelli <da...@insideout.io>
>
> > About point a) I have a question. As the API allow for selection of the
> > Enhancement Chain, how would that work if we move the API in an engine.
> The
> > engine can be executed outside of the scope of an enhancement chain?
> >
>
> You can call single engine [1] by using
>
> http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
>
> Are you now asking how this could be made available via the /api/tasks API
> you proposed?
>
> [1]
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/index.html
>
>
> > Shall we move this thread on a JIRA thread?
> >
>
> No, discussing this here is totally fine. We should create a JIRA that
> describes what to do/implement after the discussion and we have some
> consensus.
>
> Best,
>  - Fabian
>
>
> --
> Fabian
> http://twitter.com/fctwitt
>



-- 
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Fabian Christ <ch...@googlemail.com>.
2013/1/14 David Riccitelli <da...@insideout.io>

> About point a) I have a question. As the API allow for selection of the
> Enhancement Chain, how would that work if we move the API in an engine. The
> engine can be executed outside of the scope of an enhancement chain?
>

You can call single engine [1] by using

http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}

Are you now asking how this could be made available via the /api/tasks API
you proposed?

[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/index.html


> Shall we move this thread on a JIRA thread?
>

No, discussing this here is totally fine. We should create a JIRA that
describes what to do/implement after the discussion and we have some
consensus.

Best,
 - Fabian


-- 
Fabian
http://twitter.com/fctwitt

Re: Stanbol front-end API

Posted by David Riccitelli <da...@insideout.io>.
Thanks for the precious feedback,

To summarize:
 a) unless there's a specific use case, the API could be implemented inside
a pre-processing engine (Fabian).
 b) other tools exist to extract contents from Html such as Boilerpipe and
Goose (Goose was based in the beginning on Readability). It could be worth
to try out these tools as well so to understand which one is the best and
eventually allow the consumer to choose the most suited tool according to
the requested analysis (Andrea).

About point a) I have a question. As the API allow for selection of the
Enhancement Chain, how would that work if we move the API in an engine. The
engine can be executed outside of the scope of an enhancement chain?

Shall we move this thread on a JIRA thread?

Thanks,
David


On Mon, Jan 14, 2013 at 12:53 PM, Andrea Di Menna <ni...@gmail.com> wrote:

> Hi David,
>
> what is the performance of Readability compared with other text extraction
> tools like Boilerpipe [1] or Goose [2]?
>
> I think it would be interesting to extend your approach to configurable
> text extraction engines.
> As Fabian is suggesting, to create specialized enhancement engines which
> extract text from HTML contents and feed them to other enhancement engines.
>
> Regards,
> Andrea
>
> [1] http://code.google.com/p/boilerpipe/
> [2] https://github.com/jiminoc/goose/wiki
>
> 2013/1/14 Fabian Christ <ch...@googlemail.com>
>
> > 2013/1/14 David Riccitelli <da...@insideout.io>
> >
> > > > 1) You introduce an new endpoint http://localhost:8080/api/tasks
> > >
> > > Correct. Ideally the API front-end focuses more on developer adoption
> so
> > to
> > > provide APIs that ease integration.
> >
> >
> > At this point I am not sure how we want to add such an API layer. Or even
> > if we want such a thing at all. It may be confusing for people.
> >
> > Why not add your service as an enhancement engine that can be configured
> as
> > the first engine in an enhancement chain. That would be the natural way
> of
> > doing it with the stuff we have right now.
> >
> > What is the benefit of having another REST API facade? Are there more use
> > cases for it?
> >
> > Thanks,
> >  - Fabian
> > --
> > Fabian
> > http://twitter.com/fctwitt
> >
>



-- 
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Andrea Di Menna <ni...@gmail.com>.
Hi David,

what is the performance of Readability compared with other text extraction
tools like Boilerpipe [1] or Goose [2]?

I think it would be interesting to extend your approach to configurable
text extraction engines.
As Fabian is suggesting, to create specialized enhancement engines which
extract text from HTML contents and feed them to other enhancement engines.

Regards,
Andrea

[1] http://code.google.com/p/boilerpipe/
[2] https://github.com/jiminoc/goose/wiki

2013/1/14 Fabian Christ <ch...@googlemail.com>

> 2013/1/14 David Riccitelli <da...@insideout.io>
>
> > > 1) You introduce an new endpoint http://localhost:8080/api/tasks
> >
> > Correct. Ideally the API front-end focuses more on developer adoption so
> to
> > provide APIs that ease integration.
>
>
> At this point I am not sure how we want to add such an API layer. Or even
> if we want such a thing at all. It may be confusing for people.
>
> Why not add your service as an enhancement engine that can be configured as
> the first engine in an enhancement chain. That would be the natural way of
> doing it with the stuff we have right now.
>
> What is the benefit of having another REST API facade? Are there more use
> cases for it?
>
> Thanks,
>  - Fabian
> --
> Fabian
> http://twitter.com/fctwitt
>

Re: Stanbol front-end API

Posted by Fabian Christ <ch...@googlemail.com>.
2013/1/14 David Riccitelli <da...@insideout.io>

> > 1) You introduce an new endpoint http://localhost:8080/api/tasks
>
> Correct. Ideally the API front-end focuses more on developer adoption so to
> provide APIs that ease integration.


At this point I am not sure how we want to add such an API layer. Or even
if we want such a thing at all. It may be confusing for people.

Why not add your service as an enhancement engine that can be configured as
the first engine in an enhancement chain. That would be the natural way of
doing it with the stuff we have right now.

What is the benefit of having another REST API facade? Are there more use
cases for it?

Thanks,
 - Fabian
-- 
Fabian
http://twitter.com/fctwitt

Re: Stanbol front-end API

Posted by David Riccitelli <da...@insideout.io>.
Thanks Fabian.

Here are the answers:

> 1) You introduce an new endpoint http://localhost:8080/api/tasks

Correct. Ideally the API front-end focuses more on developer adoption so to
provide APIs that ease integration.

> 2) The endpoint consumes JSON that either has the HTML content or a URL
> pointing to HTML content

When the consumer posts:
 - a 'content', it is sent straight to the enhancement chain.
 - a URL, it is parsed with Readability and the output content is then sent
to the enhancement chain. Note that if Readability understands the the URL
points to an article split on multiple pages (therefore multiple URLs), it
will then load the content from all the related URLs.

> 3) The accepted media-type is also defined in the JSON file for the
request

The HTTP *Accept *header is currently ignored. Indeed it would be probably
more correct to eliminate the *mimeType *property and rely solely on
the *Accept
*header.

> 4) Using readability the HTML is cleaned and then some enhancement chain
is
> triggered. Which chain is used here?

The default chain is used unless the consumer specifies which chain to use
by setting the chainName property in the JSON payload [1].

> 5) The usual enhancement RDF is returned to the user

Correct.

BR,
David

[1] ln 95:
https://github.com/insideout10/stanbol-facade/blob/master/stanbol-facade-api/src/main/java/io/insideout/stanbol/facade/services/TaskService.java




On Mon, Jan 14, 2013 at 12:21 PM, Fabian Christ <
christ.fabian@googlemail.com> wrote:

> Hi David,
>
> nice idea. First let me summarize what this contribution is about to see if
> I understood it correctly.
>
> 1) You introduce an new endpoint http://localhost:8080/api/tasks
> 2) The endpoint consumes JSON that either has the HTML content or a URL
> pointing to HTML content
> 3) The accepted media-type is also defined in the JSON file for the request
> 4) Using readability the HTML is cleaned and then some enhancement chain is
> triggered. Which chain is used here?
> 5) The usual enhancement RDF is returned to the user
>
> Is this what it does?
>
> Thanks,
>   - Fabian
>
>
> 2013/1/14 David Riccitelli <da...@insideout.io>
>
> > Hello,
> >
> > I would like to introduce one more contribution for Apache Stanbol.
> >
> > It is not an engine, but an HTTP API for Stanbol which pre-processes and
> > submits analysis tasks, and returns the result synchronously to the
> > consumer. It aims to simplify development integrations and to provide a
> > powerful pre-processing API for analysis of URLs.
> >
> > It implements the *Readability* library, in order to support URL
> > submissions:
> >  - loading contents from remote URLs and
> >  - cleaning them up of all the surrounding noise.
> >
> > Readability is the same library behind the *Reader* function of Safari
> that
> > many users know already.
> >
> > To summarize:
> >
> >    - extremely simple APIs to ease prototyping, integration and usage
> >    - support for textual contents
> >    - support for URLs
> >    - *for URLs, preprocessing of HTML pages to capture the actual URL
> >    content while skipping noise such as ads, menus and so forth*
> >    - synchronous access (for asynchronous access see idntik.it)
> >
> > You can find more information and the source code here:
> > https://github.com/insideout10/stanbol-facade
> >
> > Shall I open a JIRA to discuss a possible integration in the trunk?
> >
> > BR,
> > David Riccitelli
> >
> > -- check the Swagger for WordLift <http://bit.ly/VtoM5H>
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt
>



-- 
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Fabian Christ <ch...@googlemail.com>.
Hi David,

nice idea. First let me summarize what this contribution is about to see if
I understood it correctly.

1) You introduce an new endpoint http://localhost:8080/api/tasks
2) The endpoint consumes JSON that either has the HTML content or a URL
pointing to HTML content
3) The accepted media-type is also defined in the JSON file for the request
4) Using readability the HTML is cleaned and then some enhancement chain is
triggered. Which chain is used here?
5) The usual enhancement RDF is returned to the user

Is this what it does?

Thanks,
  - Fabian


2013/1/14 David Riccitelli <da...@insideout.io>

> Hello,
>
> I would like to introduce one more contribution for Apache Stanbol.
>
> It is not an engine, but an HTTP API for Stanbol which pre-processes and
> submits analysis tasks, and returns the result synchronously to the
> consumer. It aims to simplify development integrations and to provide a
> powerful pre-processing API for analysis of URLs.
>
> It implements the *Readability* library, in order to support URL
> submissions:
>  - loading contents from remote URLs and
>  - cleaning them up of all the surrounding noise.
>
> Readability is the same library behind the *Reader* function of Safari that
> many users know already.
>
> To summarize:
>
>    - extremely simple APIs to ease prototyping, integration and usage
>    - support for textual contents
>    - support for URLs
>    - *for URLs, preprocessing of HTML pages to capture the actual URL
>    content while skipping noise such as ads, menus and so forth*
>    - synchronous access (for asynchronous access see idntik.it)
>
> You can find more information and the source code here:
> https://github.com/insideout10/stanbol-facade
>
> Shall I open a JIRA to discuss a possible integration in the trunk?
>
> BR,
> David Riccitelli
>
> -- check the Swagger for WordLift <http://bit.ly/VtoM5H>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>



-- 
Fabian
http://twitter.com/fctwitt

Re: Stanbol front-end API

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Tue, Jan 15, 2013 at 11:40 AM, David Riccitelli <da...@insideout.io> wrote:
>> 1. Simple enhancement of textual content
>
> The POST request is very similar to the Enhancer APIs [1], although the
> return data would be different. You're proposing to define an output
> different from the Enhancement Structure [2], like an XML/JSON format
> indexed by Entity, correct? e.g.
>
> {
>  language: "en",
>  entities: [{
>   "<about>": {
>     type: "{Person|Organization|Product|...}",
>     confidence: 1.0
>   }
>  },{
>    ...
>  }]
> }

To implement this one needs 2 things:

1. extract the interesting information from the Enhancement-Metadata.
Should be done in an PostProcessing Engine.
2. the serialization: For this part I would not give up on RDF, but
rater define a nice JSON-LD context [1] that produces JSON as shown in
the example above.

For this I think we should start with

* typical use cases (e.g. tag suggestion (with user interaction), auto
tagging , inline text annotation (like with annotate.js),  ...)
* specify annotations suitable for such scenarios
* implement (1) and (2) for those scenarios.


Regarding the "enhancer/task" API:

I see this only as a different RESTful service to access the Enhancer
Service. For some use cases the current API is more efficient while
for others the  enhancer/task API  has more appeal. If we provide both
options the users will decide in the end.

In anyway for EnhancementRequest specific parameters we need to
change/extend some APIs in the Enhancer. This was already discussed on
the list [2]. There was even a decision on how to do it and work will
start after the next Enhancer release (what will happen within a week
or so). After this changes there will be an EnhancementJob class that
can be created based on the request by the JAX-RS resource.


Regarding "enhancement pipeline"

> I'm saying pipeline and not enhancement chain as this goes a bit
> further, the pipeline can include selection/configuration of the DCE,
> selection/configuration of the renderer used for the enhancement graph
> etc., probably using a mini flow language to allow parts of the
> pipeline to depend on previous results (similar to the
> https://gist.github.com/2931050 idea).

Especially with all the new NLP processing related EnhancementEngines
added after STANBOL-733 this would for sure be a very welcome
extension.

best
Rupert

[1] http://json-ld.org/spec/latest/json-ld-syntax/#the-context
[2] http://markmail.org/message/ylzv4iipa5t3g5qs

--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol front-end API

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi David,

On Tue, Jan 15, 2013 at 11:40 AM, David Riccitelli <da...@insideout.io> wrote:
> ...I think what you're describing has the same aim, has many points in common
> and is giving even more value to it...

Cool

>
>> 1. Simple enhancement of textual content
> ...You're proposing to define an output
> different from the Enhancement Structure [2], like an XML/JSON format
> indexed by Entity, correct? e.g....

Something like that...but thinking about it this is orthogonal with
what we're discussing now, so for now maybe just say "output format is
selectable, with a default format that helps newbies find their way
around Stanbol".

>> 3. Enhancement of remote content
...
> {
>  url: "http://server/path/doc.ext",  -- or -- content: "actual content",
>  mimeType: "content/mime-type",
>  parameters: {
>    "engine-a-param-1": "value-1",
>    "engine-b-param-2": "value-2",
>    "engine-c-param-n": "value-3"
>  }
> }...

ok, and in the case of URLs I'd see the mime-type as an optional
fallback - the GET response should provide it.

> ...The DCE could retrieve the content directly. In the case of Readability it
> is required for it to be able to access contents that are spread on
> multiple pages....

Ok, works for me!

-Bertrand

Re: Stanbol front-end API

Posted by David Riccitelli <da...@insideout.io>.
Hello Bertrand,

I think what you're describing has the same aim, has many points in common
and is giving even more value to it.
Therefore it should be taken in consideration further defining the scope of
the activity.

> 1. Simple enhancement of textual content

The POST request is very similar to the Enhancer APIs [1], although the
return data would be different. You're proposing to define an output
different from the Enhancement Structure [2], like an XML/JSON format
indexed by Entity, correct? e.g.

{
 language: "en",
 entities: [{
  "<about>": {
    type: "{Person|Organization|Product|...}",
    confidence: 1.0
  }
 },{
   ...
 }]
}

We could also try to mimic similar APIs output formats to enable an easy
switch from one system to another.

> 2. Enhancement of binary content

Agree.

> 3. Enhancement of remote content

I think this matches the proposal for the Task Request json. We would also
allow to add some per-call analysis settings here. Maybe something like
(similar to what has been implemented so far):

{
 url: "http://server/path/doc.ext",  -- or -- content: "actual content",
 mimeType: "content/mime-type",
 parameters: {
   "engine-a-param-1": "value-1",
   "engine-b-param-2": "value-2",
   "engine-c-param-n": "value-3"
 }
}

> Same as 2. but the posted (json?) document contains URLs of content
> that Stanbol first retrieves

The DCE could retrieve the content directly. In the case of Readability it
is required for it to be able to access contents that are spread on
multiple pages.


BR,
David

[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/#RESTful_API
[2]
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html




On Tue, Jan 15, 2013 at 12:17 PM, Bertrand Delacretaz <
bdelacretaz@apache.org> wrote:

> On Tue, Jan 15, 2013 at 11:02 AM, Bertrand Delacretaz
> <bd...@apache.org> wrote:
> ...
> > 4. Requests including enhancement pipeline definitions ("stateless
> Stanbol")...
> > Using a multipart POST in the previous use cases, one part can be a
> > pipeline definition...
>
> We could also use a part to supply initial metadata about the content
> being submitted.
>
> -Bertrand
>



-- 
David Riccitelli

-- check the Swagger for WordLift <http://bit.ly/VtoM5H>
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: Stanbol front-end API

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Tue, Jan 15, 2013 at 11:02 AM, Bertrand Delacretaz
<bd...@apache.org> wrote:
...
> 4. Requests including enhancement pipeline definitions ("stateless Stanbol")...
> Using a multipart POST in the previous use cases, one part can be a
> pipeline definition...

We could also use a part to supply initial metadata about the content
being submitted.

-Bertrand

Re: Stanbol front-end API

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,

On Mon, Jan 14, 2013 at 10:58 AM, David Riccitelli <da...@insideout.io> wrote:
> ...You can find more information and the source code here:
> https://github.com/insideout10/stanbol-facade ...

Interesting - I think that matches my recent thoughts about Stanbol as
a (mostly) stateless content enhancement service - let me try to
describe my use cases for that, to see how much our ideas overlap.

I don't want to derail your efforts, what I describe here might have a
larger scope and I don't have code to back it so far, so feel free to
go ahead with your proposal...but maybe this helps refine the idea or
design it in an extensible way.

1. Simple enhancement of textual content
Client either POSTS a text/plain document (that's the default mime
type), or does a GET with the content in a request parameter.

Stanbol use a default enhancement pipeline (mor on that below) and
returns enhancements in a simple default format (ideally a human
readable format that doesn't scare "semantic newbies"). Client can
request other output formats with Accept header or by adding an
extension to the request URL.

2. Enhancement of binary content
Client POSTS a PDF, image or other document.

Stanbol uses a default content extractor (DCE) to get text from that
binary content, and then runs as above.

3. Enhancement of remote content
Same as 2. but the posted (json?) document contains URLs of content
that Stanbol first retrieves. Textual content is then extracted and
aggregated from the responses using the DCE, then proceed as in 2.

4. Requests including enhancement pipeline definitions ("stateless Stanbol")
Using a multipart POST in the previous use cases, one part can be a
pipeline definition that describes the enhancement pipeline to use.
The only configuration required on the Stanbol side is making the
engines available with unique names, their assembly is dynamic while
processing the request.

I'm saying pipeline and not enhancement chain as this goes a bit
further, the pipeline can include selection/configuration of the DCE,
selection/configuration of the renderer used for the enhancement graph
etc., probably using a mini flow language to allow parts of the
pipeline to depend on previous results (similar to the
https://gist.github.com/2931050 idea).

The pipeline granularity can also be smaller than enhancement engines,
for example to select specific NLP components as introduced by
STANBOL-733. One example is dynamic selection of a different part of
speech tagger depending on the detected language.

-Bertrand