You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Tommaso Teofili <to...@gmail.com> on 2011/03/25 12:15:30 UTC

Solr StanbolUpdateHandler

Hi all,
recently I've been working with Solr to enable named entity recognition of
indexed documents which I did with UIMA so I wonder if that could be an
interesting use case for Stanbol as well.

For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr
which enables enriching of documents being indexed with Apache UIMA on the
basis of the following use case:

   1. user sends documents to Solr
   2. each document received by Solr is sent to a UIMA analysis pipeline
   just before it gets indexed
   3. the UIMA pipeline extracts enrichments, i.e. named entites
   4. the enrichments are written to Solr fields on the basis of a mapping
   configuration
   5. the enriched Solr document is actually written inside the index

In my opinion that could be done also with Stanbol Enhancer.
Such an integration could run on top of the already developed contrib module
[2][3] or with a separate one written from scratch; obviously such options
have advantages and drawbacks we can discuss (later?).
What do you think?
Cheers,
Tommaso

[1] : http://wiki.apache.org/solr/SolrPlugins#UpdateHandler
[2] : http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/
[3] : http://wiki.apache.org/solr/SolrUIMA

Re: Solr StanbolUpdateHandler

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Fri, Mar 25, 2011 at 12:42 PM, Olivier Grisel
<ol...@ensta.org> wrote:
> 2011/3/25 Tommaso Teofili <to...@gmail.com>:
>> Hi all,
>> recently I've been working with Solr to enable named entity recognition of
>> indexed documents which I did with UIMA so I wonder if that could be an
>> interesting use case for Stanbol as well.
>>
>> For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr
>> which enables enriching of documents being indexed with Apache UIMA on the
>> basis of the following use case:
>>
>>   1. user sends documents to Solr
>>   2. each document received by Solr is sent to a UIMA analysis pipeline
>>   just before it gets indexed
>>   3. the UIMA pipeline extracts enrichments, i.e. named entites
>>   4. the enrichments are written to Solr fields on the basis of a mapping
>>   configuration
>>   5. the enriched Solr document is actually written inside the index
>>
>> In my opinion that could be done also with Stanbol Enhancer.
>> Such an integration could run on top of the already developed contrib module
>> [2][3] or with a separate one written from scratch; obviously such options
>> have advantages and drawbacks we can discuss (later?).
>> What do you think?
>
> I think that we should definitely work at some point to be able to run
> an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to
> write a dummy collection reader that turns a ContentItem into a CAS
> and a generic cas consumer that converts the output into a Clerezza
> Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also
> the CAS to Clerezza Graph consumer could be directly contributed to
> the clerezza project while the ContentItem to CAS collection reader is
> stanbol specific.
>
> That would allow Stanbol users to reuse existing UIMA tools and turn
> them into a more linked data centric REST service.
>
> As for the use case, this in indeed interesting. Please note that the
> Solr engine embedded inside the entity hub is dedicated to fast local
> indexing Linked Data entities (dbpedia entries for instance) and not
> documents. Stanbol it not really meant to be a document  management
> system (at least not in the short term) but more like a knowledge base
> management system that lives next to an existing CMS that would
> probably have its own instance of Solr to index its documents.
>

The suggested module would be interesting for CMS that do already use
Solr within there search infrastructure. It would allow them very
easily to incorporate the semantic lifting capabilities of Stanbol.

> Extending Stanbol to build semantically enriched indices of documents
> would still be in the scope of stanbol but I think we should first
> focus on finishing the cleaning / refactoring of the existing code
> base before implementing new services.
>
The /store and the /sparql endpoint do provide this functionality to
some degree and as far as I know they are used especially in portal
environments (where the documents provided by the portal are actually
managed by different CMS.

Semantic search over the metadata extracted from documents by the
stanbol enhancer is an interesting feature and I think it could be
implemented by combining a triple store together with an adapted
version of the SolrYard (Solr based storage component - part of the
Entityhub).
However I would define this as an additional component (same level as
Enhancer and Entityhub) - maybe Documenthub?

best
Rupert






> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Solr StanbolUpdateHandler

Posted by Tommaso Teofili <to...@gmail.com>.
2011/3/25 Olivier Grisel <ol...@ensta.org>

>
> I think that we should definitely work at some point to be able to run
> an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to
> write a dummy collection reader that turns a ContentItem into a CAS
> and a generic cas consumer that converts the output into a Clerezza
> Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also
> the CAS to Clerezza Graph consumer could be directly contributed to
> the clerezza project while the ContentItem to CAS collection reader is
> stanbol specific.
>

Good points, I can work on the ClerezzaGraph CAS Consumer basing on the
uima.utils Clerezza module [1].
I was thinking also to how an engine could be packaged as a UIMA PEAR to
allow the execution of Stanbol engines inside UIMA pipelines without writing
"mapping code" (wondering if a custom Maven plugin could make that).



>
> That would allow Stanbol users to reuse existing UIMA tools and turn
> them into a more linked data centric REST service.
>

:)


>
> As for the use case, this in indeed interesting. Please note that the
> Solr engine embedded inside the entity hub is dedicated to fast local
> indexing Linked Data entities (dbpedia entries for instance) and not
> documents. Stanbol it not really meant to be a document  management
> system (at least not in the short term) but more like a knowledge base
> management system that lives next to an existing CMS that would
> probably have its own instance of Solr to index its documents.


> Extending Stanbol to build semantically enriched indices of documents
> would still be in the scope of stanbol but I think we should first
> focus on finishing the cleaning / refactoring of the existing code
> base before implementing new services.



Perfectly agree, just wanted to raise the point in time to make proper
architectural considerations and share existing usage scenarios :)
Cheers,
Tommaso

[1] :
http://svn.apache.org/repos/asf/incubator/clerezza/trunk/parent/uima/uima.utils/

Re: Solr StanbolUpdateHandler

Posted by Olivier Grisel <ol...@ensta.org>.
2011/3/25 Tommaso Teofili <to...@gmail.com>:
> Hi all,
> recently I've been working with Solr to enable named entity recognition of
> indexed documents which I did with UIMA so I wonder if that could be an
> interesting use case for Stanbol as well.
>
> For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr
> which enables enriching of documents being indexed with Apache UIMA on the
> basis of the following use case:
>
>   1. user sends documents to Solr
>   2. each document received by Solr is sent to a UIMA analysis pipeline
>   just before it gets indexed
>   3. the UIMA pipeline extracts enrichments, i.e. named entites
>   4. the enrichments are written to Solr fields on the basis of a mapping
>   configuration
>   5. the enriched Solr document is actually written inside the index
>
> In my opinion that could be done also with Stanbol Enhancer.
> Such an integration could run on top of the already developed contrib module
> [2][3] or with a separate one written from scratch; obviously such options
> have advantages and drawbacks we can discuss (later?).
> What do you think?

I think that we should definitely work at some point to be able to run
an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to
write a dummy collection reader that turns a ContentItem into a CAS
and a generic cas consumer that converts the output into a Clerezza
Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also
the CAS to Clerezza Graph consumer could be directly contributed to
the clerezza project while the ContentItem to CAS collection reader is
stanbol specific.

That would allow Stanbol users to reuse existing UIMA tools and turn
them into a more linked data centric REST service.

As for the use case, this in indeed interesting. Please note that the
Solr engine embedded inside the entity hub is dedicated to fast local
indexing Linked Data entities (dbpedia entries for instance) and not
documents. Stanbol it not really meant to be a document  management
system (at least not in the short term) but more like a knowledge base
management system that lives next to an existing CMS that would
probably have its own instance of Solr to index its documents.

Extending Stanbol to build semantically enriched indices of documents
would still be in the scope of stanbol but I think we should first
focus on finishing the cleaning / refactoring of the existing code
base before implementing new services.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel