You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by JCodina <jo...@barcelonamedia.org> on 2009/07/20 12:43:38 UTC

Solr and UIMA

We are starting to use UIMA as a platform to analyze the text.
The result of analyzing a document is a UIMA CAS. A Cas is a generic data
structure that can contain different data.
UIMA processes single documents, They get the documents from a CAS producer,
process them using a PIPE that the user defines and finally sends the
result to a CAS consumer, that "saves" or "stores" the result.
The pipe is then a connection of different tools that annotate the text with
different information. Different sets of tools are available out there, each
of them deffining it's own data type's that are included in the CAS. To
perform a PIPE output and input CAS of the elements to connect need to be
compatible

There is CAS consumer that feeds a LUCENE index, it's called LUCAS but I was
looking to it, and I prefer to use UIMA connected to SOLR, why?
A: I know solr ;-) and i like it
B: I can configure the fields and their processing in solr using xml. Once
done then I have it ready to use with a set of tools that allow me to easily
explore the data....
C: Is easier to use SOLR as a "web service" that may receive docs from
different UIMA's (Natural Language processing is CPU intensive )
D: Break things down. The CAS would only produce XML that solr can process.
Then different Tokenizers can be used to deal with the data in the CAS. the
main point is that the XML has a the doc and field labels of solr.
E: The set of capabilities to process the xml is defined in XML, similar to
lucas to define the ouput and in the solr schema to define how this is
processed.

I want to use it in order to index something that is common but I can't get
any tool to do that with sol: indexing a word and coding at the same
position the syntactic and semantic information. I know that in Lucene this
is evolving and it will be possible to include metadata but for the moment

So, my idea is first to produce a UIMA CAS consumer that performs the POST
of an XML file containing the plain text text of the document to SOLR; then
try to modify this in order to include multiple fields and start coding the
semantic information.

So, before starting, i would like to know your opinions and if anyone is
interested to collaborate, or has some code that can be integrated into
this.

--
View this message in context: http://www.nabble.com/Solr-and-UIMA-tp24567504p24567504.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and UIMA

Posted by JCodina <jo...@barcelonamedia.org>.




On Jul 21, 2009, at 11:57 AM, JCodina wrote:

Let me sintetize:

We (well, I think Grant?) do changes in the DPTFF (
DelimitedPayloadTokenFilterFactory ) so that is able to index at the same
position different tokes that may have payloads.
1. token delimiter (#)
2. payload delimiter (|) 

We (that's me) perform a SolCAS: a UIMA CAS consumer equivalent to LuCAS but
that allows indexing using Solr. This SolCAS is able to manage generate
different tokens at the same position and maybe payloads, the result is
ready for the new  DPTFF

We (me again) develop some filtering utilities based on the payload that,
something like the stopwords 
filter but instead of rejecting those tokens that are in the stopwords list
would reject those  that are in the "payloads" list.

We will try also to develop an n-gram generator based on the payloads, like
for example find the nouns followed by an adjective that are at less than 4
positions. 

For the moment searches can not be performed based on payloads, not even as
a filter... but this is a matter of time.

Problems to solve:
Perform a nice processing of the N tokens that share the same position, as
the tokenizer.Next() will not give them together (which is a pitty) .Write
some utility tht would allow the tools that manage multitokens to have a
similar front-end and back-end that does multiple Nexts in order to put
toguether all the information at the same position, performs the treatment
with a multitoken structure and then generates a multitoken that is sent to
the backend that has the next again on single tokens...

Joan
-- 
View this message in context: http://www.nabble.com/Solr-and-UIMA-tp24567504p24639814.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and UIMA

Posted by JCodina <jo...@barcelonamedia.org>.

You can test our UIMA to Solr cas consumer
is based on JulieLab Lucas and uses their CAS.
but transformed to generate XML which can be saved to a file or posted
direcly to solr
In the map file you can define which information is generated for each
token, and how its concatenaded, allowing the generation of thinks like
"the|AD car|NC " which then can be processed using payloads.

now you can get it from my page
http://www.barcelonamedia.org/personal/joan.codina/en
http://www.barcelonamedia.org/personal/joan.codina/en 


-- 
View this message in context: http://old.nabble.com/Solr-and-UIMA-tp24567504p27753399.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and UIMA

Posted by Roland Cornelissen <ro...@kabelco.nl>.

Hi Joan,
I'm curious to try your Solr-Cas consumer.
Do you have news already ;-)

Roland


On 02/11/2010 03:15 PM, JCodina wrote:
> 
> Things are done!!!!  :-)
> 
> now we already have done the UIMA CAS consumer for Solr, 
> we are making it public, more news soon.
> 
> We have also been developing some filters based on payloads 
> One of the filters is to remove words with the payloads in the list the
> other one  maintains only these tokens with paylodas in the list.  It works
> the same way than the stopsFilterFactory
> 
> you can find it at my page:
> http://www.barcelonamedia.org/personal/joan.codina/en
> http://www.barcelonamedia.org/personal/joan.codina/en 
>

Re: Solr and UIMA

Posted by JCodina <jo...@barcelonamedia.org>.

Things are done!!!!  :-)

now we already have done the UIMA CAS consumer for Solr, 
we are making it public, more news soon.

We have also been developing some filters based on payloads 
One of the filters is to remove words with the payloads in the list the
other one  maintains only these tokens with paylodas in the list.  It works
the same way than the stopsFilterFactory

you can find it at my page:
http://www.barcelonamedia.org/personal/joan.codina/en
http://www.barcelonamedia.org/personal/joan.codina/en 

-- 
View this message in context: http://old.nabble.com/Solr-and-UIMA-tp24567504p27544646.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and UIMA

Posted by Jussi Arpalahti <ja...@gmail.com>.

2009/7/23 Grant Ingersoll <gs...@apache.org>:
>
> On Jul 21, 2009, at 11:57 AM, JCodina wrote:
>
>>
>> Hello, Grant,
>> there are two ways, to implement this, one is payloads, and the other one
>> is
>> multiple tokens at the same positions.
>> Each of them can be useful, let me explain the way I thick they can be
>> used.
>> Payloads : every token has extra information that can be used in the
>> processing , for example if I can add Part-of-speech then I can develop
>> tokenizers that take into account the POS (or for example I can generate
>> bigrams of Noum Adjective, or Noum prep Noum i can have a better stopwords
>> algorithm....)
>>
>> Multiple tokes in one position: If I can have  different tokens at the
>> same
>> place, I can have different informations like: "was #verb _be" so I can do
>> a
>> search for "you _be #adjective" to find all the sentences that talk about
>> "you" for example "you were clever" "you are tall" ......
>
> This was one of the use cases for payloads as well, but it likely needs more
> Query support at the moment, as the BoostingTermQuery would only allow you
> to boost values where it's a verb, not include/exclude.
>
>>
>>
>> I have not understood the way that the
>>  DelimitedPayloadTokenFilterFactory
>> may work in solr, which is the input format?
>
> the DPTFF (nice acronym, eh?) allows you to send in your normal Solr XML,
> but with payloads encoded in the text.  For instance:
>
> <field name="foo">the quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ
> brown|JJ dogs|NN</field>
>
> The DPTFF will take the value before the delimiter as the Token and the
> value after the delimiter as the payload.  This then allows Solr to add
> Payloads without modifying a single thing in Solr, at least on the indexing
> side.
>
>>
>> so I was thinking in generating an xml where for each token a single
>> string
>> is generated like "was#verb#be"
>> and then there is a tokenfilter that splits by # each white space
>> separated
>> string,  in this case  in three words and adds the trailing character that
>> allows to search for the right semantic info. But gives them the same
>> increment. Of course the full processing chain must be aware of this.
>> But I must think on multiwords tokens
>>
>
> We could likely make a generic TokenFilter that can capture both multiple
> tokens and payloads all at the same time, simply by allowing it to have to
> attributes:
> 1. token delimiter (#)
> 2. payload delimiter (|)
>
> Then, you could do something like:
> was#be|verb
> or
> was#be|0.3
>
> where "was" and "be" are both tokens at the same position and "verb" or
> "0.3" are payloads on those tokens.  This is a nearly trivial variation of
> the DelimitedPayloadTokenFilter
>

Hi.

Apologies if I'm hijacking the thread.. I for one would very much like
this behaviour when indexing XML documents. I have a requirement to
get matching field's XPath location in the document. I currently
generate index like this:

some_field: {{ payload "//p[1]" }} actual text content of first p element

Then I strip "payload" part with custom filter (before other, "normal"
filters), but store the text with "payload" part. Then client side
gets XPath and user can choose to fetch matched part from found
document. User of course sees actual text with highlighting, "payload"
part removed. I think Lucene's payload mechanism would be better fit
for this, but being not too compenent with Java I developed this hack.
It does make client side parsing that much more difficult..

Of course payload would need to find it's way to Solr's query response
XML somehow.

Thank you.


Jussi Arpalahti

>
>
>
>
>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>>
>>> On Jul 20, 2009, at 6:43 AM, JCodina wrote:
>>>
>>>> D: Break things down. The CAS would only produce XML that solr can
>>>> process.
>>>> Then different Tokenizers can be used to deal with the data in the
>>>> CAS. the
>>>> main point is that the XML has  the doc and field labels of solr.
>>>
>>> I just committed the DelimitedPayloadTokenFilterFactory, I suspect
>>> this is along the lines of what you are thinking, but I haven't done
>>> all that much with UIMA.
>>>
>>> I also suspect the Tee/Sink capabilities of Lucene could be helpful,
>>> but they aren't available in Solr yet.
>>>
>>>
>>>
>>>
>>>> E: The set of capabilities to process the xml is defined in XML,
>>>> similar to
>>>> lucas to define the ouput and in the solr schema to define how this is
>>>> processed.
>>>>
>>>>
>>>> I want to use it in order to index something that is common but I
>>>> can't get
>>>> any tool to do that with sol: indexing a word and coding at the same
>>>> position the syntactic and semantic information. I know that in
>>>> Lucene this
>>>> is evolving and it will be possible to include metadata but for the
>>>> moment
>>>
>>> What does Lucas do with Lucene?  Is it putting multiple tokens at the
>>> same position or using Payloads?
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Solr and UIMA

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 21, 2009, at 11:57 AM, JCodina wrote:

>
> Hello, Grant,
> there are two ways, to implement this, one is payloads, and the  
> other one is
> multiple tokens at the same positions.
> Each of them can be useful, let me explain the way I thick they can  
> be used.
> Payloads : every token has extra information that can be used in the
> processing , for example if I can add Part-of-speech then I can  
> develop
> tokenizers that take into account the POS (or for example I can  
> generate
> bigrams of Noum Adjective, or Noum prep Noum i can have a better  
> stopwords
> algorithm....)
>
> Multiple tokes in one position: If I can have  different tokens at  
> the same
> place, I can have different informations like: "was #verb _be" so I  
> can do a
> search for "you _be #adjective" to find all the sentences that talk  
> about
> "you" for example "you were clever" "you are tall" ......

This was one of the use cases for payloads as well, but it likely  
needs more Query support at the moment, as the BoostingTermQuery would  
only allow you to boost values where it's a verb, not include/exclude.

>
>
> I have not understood the way that the     
> DelimitedPayloadTokenFilterFactory
> may work in solr, which is the input format?

the DPTFF (nice acronym, eh?) allows you to send in your normal Solr  
XML, but with payloads encoded in the text.  For instance:

<field name="foo">the quick|JJ red|JJ fox|NN jumped|VB over the lazy| 
JJ brown|JJ dogs|NN</field>

The DPTFF will take the value before the delimiter as the Token and  
the value after the delimiter as the payload.  This then allows Solr  
to add Payloads without modifying a single thing in Solr, at least on  
the indexing side.

>
> so I was thinking in generating an xml where for each token a single  
> string
> is generated like "was#verb#be"
> and then there is a tokenfilter that splits by # each white space  
> separated
> string,  in this case  in three words and adds the trailing  
> character that
> allows to search for the right semantic info. But gives them the same
> increment. Of course the full processing chain must be aware of this.
> But I must think on multiwords tokens
>

We could likely make a generic TokenFilter that can capture both  
multiple tokens and payloads all at the same time, simply by allowing  
it to have to attributes:
1. token delimiter (#)
2. payload delimiter (|)

Then, you could do something like:
was#be|verb
or
was#be|0.3

where "was" and "be" are both tokens at the same position and "verb"  
or "0.3" are payloads on those tokens.  This is a nearly trivial  
variation of the DelimitedPayloadTokenFilter






>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Jul 20, 2009, at 6:43 AM, JCodina wrote:
>>
>>> D: Break things down. The CAS would only produce XML that solr can
>>> process.
>>> Then different Tokenizers can be used to deal with the data in the
>>> CAS. the
>>> main point is that the XML has  the doc and field labels of solr.
>>
>> I just committed the DelimitedPayloadTokenFilterFactory, I suspect
>> this is along the lines of what you are thinking, but I haven't done
>> all that much with UIMA.
>>
>> I also suspect the Tee/Sink capabilities of Lucene could be helpful,
>> but they aren't available in Solr yet.
>>
>>
>>
>>
>>> E: The set of capabilities to process the xml is defined in XML,
>>> similar to
>>> lucas to define the ouput and in the solr schema to define how  
>>> this is
>>> processed.
>>>
>>>
>>> I want to use it in order to index something that is common but I
>>> can't get
>>> any tool to do that with sol: indexing a word and coding at the same
>>> position the syntactic and semantic information. I know that in
>>> Lucene this
>>> is evolving and it will be possible to include metadata but for the
>>> moment
>>
>> What does Lucas do with Lucene?  Is it putting multiple tokens at the
>> same position or using Payloads?
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Solr and UIMA

Posted by JCodina <jo...@barcelonamedia.org>.

Hello, Grant,
there are two ways, to implement this, one is payloads, and the other one is
multiple tokens at the same positions.
Each of them can be useful, let me explain the way I thick they can be used.
Payloads : every token has extra information that can be used in the
processing , for example if I can add Part-of-speech then I can develop
tokenizers that take into account the POS (or for example I can generate
bigrams of Noum Adjective, or Noum prep Noum i can have a better stopwords
algorithm....)

Multiple tokes in one position: If I can have  different tokens at the same
place, I can have different informations like: "was #verb _be" so I can do a
search for "you _be #adjective" to find all the sentences that talk about
"you" for example "you were clever" "you are tall" ......

I have not understood the way that the    DelimitedPayloadTokenFilterFactory
may work in solr, which is the input format? 

so I was thinking in generating an xml where for each token a single string
is generated like "was#verb#be"
and then there is a tokenfilter that splits by # each white space separated
string,  in this case  in three words and adds the trailing character that
allows to search for the right semantic info. But gives them the same
increment. Of course the full processing chain must be aware of this.
But I must think on multiwords tokens  

Grant Ingersoll-6 wrote:
> 
> 
> On Jul 20, 2009, at 6:43 AM, JCodina wrote:
> 
>> D: Break things down. The CAS would only produce XML that solr can  
>> process.
>> Then different Tokenizers can be used to deal with the data in the  
>> CAS. the
>> main point is that the XML has  the doc and field labels of solr.
> 
> I just committed the DelimitedPayloadTokenFilterFactory, I suspect  
> this is along the lines of what you are thinking, but I haven't done  
> all that much with UIMA.
> 
> I also suspect the Tee/Sink capabilities of Lucene could be helpful,  
> but they aren't available in Solr yet.
> 
> 
> 
> 
>> E: The set of capabilities to process the xml is defined in XML,  
>> similar to
>> lucas to define the ouput and in the solr schema to define how this is
>> processed.
>>
>>
>> I want to use it in order to index something that is common but I  
>> can't get
>> any tool to do that with sol: indexing a word and coding at the same
>> position the syntactic and semantic information. I know that in  
>> Lucene this
>> is evolving and it will be possible to include metadata but for the  
>> moment
> 
> What does Lucas do with Lucene?  Is it putting multiple tokens at the  
> same position or using Payloads?
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and UIMA

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 20, 2009, at 6:43 AM, JCodina wrote:

> D: Break things down. The CAS would only produce XML that solr can  
> process.
> Then different Tokenizers can be used to deal with the data in the  
> CAS. the
> main point is that the XML has a the doc and field labels of solr.

I just committed the DelimitedPayloadTokenFilterFactory, I suspect  
this is along the lines of what you are thinking, but I haven't done  
all that much with UIMA.

I also suspect the Tee/Sink capabilities of Lucene could be helpful,  
but they aren't available in Solr yet.


> E: The set of capabilities to process the xml is defined in XML,  
> similar to
> lucas to define the ouput and in the solr schema to define how this is
> processed.
>
>
> I want to use it in order to index something that is common but I  
> can't get
> any tool to do that with sol: indexing a word and coding at the same
> position the syntactic and semantic information. I know that in  
> Lucene this
> is evolving and it will be possible to include metadata but for the  
> moment

What does Lucas do with Lucene?  Is it putting multiple tokens at the  
same position or using Payloads?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search