You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu> on 2013/01/04 00:16:06 UTC

More about storing NLP-type stuff in the index

I think we've been saying that if we put something in a Payload, it will be
indexed.  From what I understand of the indexing format, that means that
what you put in the Payload will be stored in the Lucene index... But it
won't *itself* be indexed & optimized for search.

That's good, but can we build inverted indices on the contents of the
Payloads (or the Attributes) as well?
 Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
is looking for all instances of ARG0.
 Ex2: Say I add payloads to terms indicating that they're named entities
belonging to a semantic group.  Then say my query looks for all instances of
the "Medications" semantic group.

It's almost like just putting these things in different fields, with the
exception that the things in different fields need to be linked so you know
what the original text was.  Maybe the linking can be done via Payloads
(offsets in the original text)?  If I want to store multiple things at the
same startOffset then I just use something like SynonymFilter?

stephen


On 12/21/12 6:45 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>>> If you stuff the end of the span into the payload you'd have to create
>>> a custom variant of PhraseQuery to properly match based on the end
>>> span.
>> 
>> How different is this from the functionality already avaialable through
>> SpanQuery?
> 
> Good question!
> 
> I think the difference would be index-time (payload encoding span-end
> + new Query) vs search time (SpanQuery)?
> 
> Ie, with the former (index-time) you'd have a TokenFilter spotting the
> spans and encoding them into the index, and with the latter all
> spotting happens at search time?
> 
> So net/net I guess (?) the results would be the same, but performance
> should be faster if you do it index-time?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: More about storing NLP-type stuff in the index

Posted by Michael Sokolov <so...@ifactory.com>.
On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote:
> I think we've been saying that if we put something in a Payload, it will be
> indexed.  From what I understand of the indexing format, that means that
> what you put in the Payload will be stored in the Lucene index... But it
> won't *itself* be indexed & optimized for search.
>
> That's good, but can we build inverted indices on the contents of the
> Payloads (or the Attributes) as well?
>   Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
> is looking for all instances of ARG0.
>   Ex2: Say I add payloads to terms indicating that they're named entities
> belonging to a semantic group.  Then say my query looks for all instances of
> the "Medications" semantic group.
>
> It's almost like just putting these things in different fields, with the
> exception that the things in different fields need to be linked so you know
> what the original text was.  Maybe the linking can be done via Payloads
> (offsets in the original text)?  If I want to store multiple things at the
> same startOffset then I just use something like SynonymFilter?
>
I've been working on a different but (in a way) related problem: 
indexing text in XML documents.  In that case, we want to associate the 
names of enclosing elements with each term so that it's possible to 
search for (say) "ermine" in the context /doc/title as distinct from 
"ermine" in the context of //paragraph, or something like that.  Anyway 
what I've done doesn't use payloads.  I index two fields that are 
relevant to this: a full text field, which is just the usual text index 
(per document), and then an element-text field which indexes each term 
as a concatenation of the element name and the term value, so: 
title:ermine, doc:ermine, and paragraph:ermine would be typical terms.  
I index all of the enclosing element names for each word at the same 
position (like synonym filter does). This relies on a magical character 
(":") that isn't allowed to appear in any tokens, which is too bad, but 
not terribly restrictive.

Something like this might work for you.  The prefixing also has the nice 
feature that when you enumerate terms, they are ordered first by prefix: 
of course you could flip the order if it were more interesting to list 
all "contexts" for a word rather than all words in a context (or with 
some POS tag).

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org