You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu> on 2012/12/12 21:02:41 UTC

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

>> Is there any (preliminary) code checked in somewhere that I can look at,
>> that would help me understand the practical issues that would need to be
>> addressed?
> 
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?

For example: 
 - part of speech of a token.
 - syntactic parse subtree (over a span).
 - semantically normalized phrase (to canonical text or ontological code).
 - semantic group (of a span).
 - coreference link.

stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by SUJIT PAL <su...@comcast.net>.

Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single "_" delimited single token and attached the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs down.

Now assume "Big Bad Wolf" and "Three Little Pigs" are spans to which I would like to attach payloads to. I run the tokens through a custom tokenizer that produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the Three_Little_Pigs$payload2 down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure about your use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

> Cool! Sounds great!  :-)
> 
> Any pointers to a (Lucene) example that attaches a payload to a
> start..end span that is more than one token?
> 
> thanks,
> -Glen
> 
> On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog <go...@gmail.com> wrote:
>> I should not have added that note. The Opennlp patch gives a concrete
>> example of adding an annotation to text.
>> 
>> 
>> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>> 
>>> It is not clear this is exactly what is needed/being discussed.
>>> 
>>> From the issue:
>>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>>> the same position."
>>> 
>>> This adds it to a token, not a span. 'same position' does not suggest
>>> it also records the end position.
>>> 
>>> -Glen
>>> 
>>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>>>> 
>>>> Parts-of-speech is available now, in the indexer.
>>>> 
>>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>>> Apache
>>>> project for natural-language processing.
>>>> 
>>>> Some parts are in Solr that could be in Lucene.
>>>> 
>>>> https://issues.apache.org/jira/browse/lucene-2899
>>>> 
>>>> 
>>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>> 
>>>>>>> Is there any (preliminary) code checked in somewhere that I can look
>>>>>>> at,
>>>>>>> that would help me understand the practical issues that would need to
>>>>>>> be
>>>>>>> addressed?
>>>>>> 
>>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>>> needing to record in the postings and access at search time?
>>>>> 
>>>>> For example:
>>>>>   - part of speech of a token.
>>>>>   - syntactic parse subtree (over a span).
>>>>>   - semantically normalized phrase (to canonical text or ontological
>>>>> code).
>>>>>   - semantic group (of a span).
>>>>>   - coreference link.
>>>>> 
>>>>> stephen
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

Cool! Sounds great!  :-)

Any pointers to a (Lucene) example that attaches a payload to a
start..end span that is more than one token?

thanks,
-Glen

On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog <go...@gmail.com> wrote:
> I should not have added that note. The Opennlp patch gives a concrete
> example of adding an annotation to text.
>
>
> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>
>> It is not clear this is exactly what is needed/being discussed.
>>
>>  From the issue:
>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>> the same position."
>>
>> This adds it to a token, not a span. 'same position' does not suggest
>> it also records the end position.
>>
>> -Glen
>>
>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>>>
>>> Parts-of-speech is available now, in the indexer.
>>>
>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>> Apache
>>> project for natural-language processing.
>>>
>>> Some parts are in Solr that could be in Lucene.
>>>
>>> https://issues.apache.org/jira/browse/lucene-2899
>>>
>>>
>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>
>>>>>> Is there any (preliminary) code checked in somewhere that I can look
>>>>>> at,
>>>>>> that would help me understand the practical issues that would need to
>>>>>> be
>>>>>> addressed?
>>>>>
>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>> needing to record in the postings and access at search time?
>>>>
>>>> For example:
>>>>    - part of speech of a token.
>>>>    - syntactic parse subtree (over a span).
>>>>    - semantically normalized phrase (to canonical text or ontological
>>>> code).
>>>>    - semantic group (of a span).
>>>>    - coreference link.
>>>>
>>>> stephen
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Lance Norskog <go...@gmail.com>.

I should not have added that note. The Opennlp patch gives a concrete 
example of adding an annotation to text.

On 12/13/2012 01:54 PM, Glen Newton wrote:
> It is not clear this is exactly what is needed/being discussed.
>
>  From the issue:
> "We are also planning a Tokenizer/TokenFilter that can put parts of
> speech as either payloads (PartOfSpeechAttribute?) on a token or at
> the same position."
>
> This adds it to a token, not a span. 'same position' does not suggest
> it also records the end position.
>
> -Glen
>
> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>> Parts-of-speech is available now, in the indexer.
>>
>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
>> project for natural-language processing.
>>
>> Some parts are in Solr that could be in Lucene.
>>
>> https://issues.apache.org/jira/browse/lucene-2899
>>
>>
>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>>> that would help me understand the practical issues that would need to be
>>>>> addressed?
>>>> Maybe we can make this more concrete: what new attribute are you
>>>> needing to record in the postings and access at search time?
>>> For example:
>>>    - part of speech of a token.
>>>    - syntactic parse subtree (over a span).
>>>    - semantically normalized phrase (to canonical text or ontological
>>> code).
>>>    - semantic group (of a span).
>>>    - coreference link.
>>>
>>> stephen
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

It is not clear this is exactly what is needed/being discussed.

>From the issue:
"We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
> Parts-of-speech is available now, in the indexer.
>
> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
> project for natural-language processing.
>
> Some parts are in Solr that could be in Lucene.
>
> https://issues.apache.org/jira/browse/lucene-2899
>
>
> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>
>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>> that would help me understand the practical issues that would need to be
>>>> addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>   - part of speech of a token.
>>   - syntactic parse subtree (over a span).
>>   - semantically normalized phrase (to canonical text or ontological
>> code).
>>   - semantic group (of a span).
>>   - coreference link.
>>
>> stephen
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Lance Norskog <go...@gmail.com>.

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does 
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an 
Apache project for natural-language processing.

Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899

On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
> For example:
>   - part of speech of a token.
>   - syntactic parse subtree (over a span).
>   - semantically normalized phrase (to canonical text or ontological code).
>   - semantic group (of a span).
>   - coreference link.
>
> stephen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 18.12.2012 12:36, schrieb Michael McCandless:
> On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
> <sc...@ids-mannheim.de> wrote:


>> This is a relatively easy example, but how would deal with e.g.
>> annotations that include multiple tokens (as in spans), such as chunks,
>> or relations between tokens (and token spans), as in the coreference
>> links example given by Steven above?
> 
> I think you'd do something like what SynonymFilter does for
> multi-token synonyms.
> 
> Eg a synonym for "wireless network" - > wifi would insert a new token
> ("wifi"), overlapped on wireless.
> 
> Lucene doesn't store the end span, but if this is really important for
> your use case, you could add a payload to that wifi token that would
> encode the number of positions that the inserted token spans (2 in
> this case), and then the information would be present in the index.
> 
> You'd still need to do something custom at read/search time to decode
> this end position and do something interesting with it ...

Thanks for the pointer!
I'm still puzzled whether something there is an optimal way to encode
(labelled) relations between tokens or even spans; the latter part would
probably lead back to the synonym-like solution.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
<sc...@ids-mannheim.de> wrote:
> Am 13.12.2012 12:27, schrieb Michael McCandless:
>
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>>
>> So for example part-of-speech is a per-Token-position attribute.
>>
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>>
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>
> This is a relatively easy example, but how would deal with e.g.
> annotations that include multiple tokens (as in spans), such as chunks,
> or relations between tokens (and token spans), as in the coreference
> links example given by Steven above?

I think you'd do something like what SynonymFilter does for
multi-token synonyms.

Eg a synonym for "wireless network" - > wifi would insert a new token
("wifi"), overlapped on wireless.

Lucene doesn't store the end span, but if this is really important for
your use case, you could add a payload to that wifi token that would
encode the number of positions that the inserted token spans (2 in
this case), and then the information would be present in the index.

You'd still need to do something custom at read/search time to decode
this end position and do something interesting with it ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 13.12.2012 12:27, schrieb Michael McCandless:

>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
> 
> So for example part-of-speech is a per-Token-position attribute.
> 
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
> 
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.

This is a relatively easy example, but how would deal with e.g.
annotations that include multiple tokens (as in spans), such as chunks,
or relations between tokens (and token spans), as in the coreference
links example given by Steven above?
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: More about storing NLP-type stuff in the index

Posted by Michael Sokolov <so...@ifactory.com>.

On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote:
> I think we've been saying that if we put something in a Payload, it will be
> indexed.  From what I understand of the indexing format, that means that
> what you put in the Payload will be stored in the Lucene index... But it
> won't *itself* be indexed & optimized for search.
>
> That's good, but can we build inverted indices on the contents of the
> Payloads (or the Attributes) as well?
>   Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
> is looking for all instances of ARG0.
>   Ex2: Say I add payloads to terms indicating that they're named entities
> belonging to a semantic group.  Then say my query looks for all instances of
> the "Medications" semantic group.
>
> It's almost like just putting these things in different fields, with the
> exception that the things in different fields need to be linked so you know
> what the original text was.  Maybe the linking can be done via Payloads
> (offsets in the original text)?  If I want to store multiple things at the
> same startOffset then I just use something like SynonymFilter?
>
I've been working on a different but (in a way) related problem: 
indexing text in XML documents.  In that case, we want to associate the 
names of enclosing elements with each term so that it's possible to 
search for (say) "ermine" in the context /doc/title as distinct from 
"ermine" in the context of //paragraph, or something like that.  Anyway 
what I've done doesn't use payloads.  I index two fields that are 
relevant to this: a full text field, which is just the usual text index 
(per document), and then an element-text field which indexes each term 
as a concatenation of the element name and the term value, so: 
title:ermine, doc:ermine, and paragraph:ermine would be typical terms.  
I index all of the enclosing element names for each word at the same 
position (like synonym filter does). This relies on a magical character 
(":") that isn't allowed to appear in any tokens, which is too bad, but 
not terribly restrictive.

Something like this might work for you.  The prefixing also has the nice 
feature that when you enumerate terms, they are ordered first by prefix: 
of course you could flip the order if it were more interesting to list 
all "contexts" for a word rather than all words in a context (or with 
some POS tag).

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

More about storing NLP-type stuff in the index

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

I think we've been saying that if we put something in a Payload, it will be
indexed.  From what I understand of the indexing format, that means that
what you put in the Payload will be stored in the Lucene index... But it
won't *itself* be indexed & optimized for search.

That's good, but can we build inverted indices on the contents of the
Payloads (or the Attributes) as well?
 Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
is looking for all instances of ARG0.
 Ex2: Say I add payloads to terms indicating that they're named entities
belonging to a semantic group.  Then say my query looks for all instances of
the "Medications" semantic group.

It's almost like just putting these things in different fields, with the
exception that the things in different fields need to be linked so you know
what the original text was.  Maybe the linking can be done via Payloads
(offsets in the original text)?  If I want to store multiple things at the
same startOffset then I just use something like SynonymFilter?

stephen

On 12/21/12 6:45 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>>> If you stuff the end of the span into the payload you'd have to create
>>> a custom variant of PhraseQuery to properly match based on the end
>>> span.
>> 
>> How different is this from the functionality already avaialable through
>> SpanQuery?
> 
> Good question!
> 
> I think the difference would be index-time (payload encoding span-end
> + new Query) vs search time (SpanQuery)?
> 
> Ie, with the former (index-time) you'd have a TokenFilter spotting the
> spans and encoding them into the index, and with the latter all
> spotting happens at search time?
> 
> So net/net I guess (?) the results would be the same, but performance
> should be faster if you do it index-time?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>> If you stuff the end of the span into the payload you'd have to create
>> a custom variant of PhraseQuery to properly match based on the end
>> span.
>
> How different is this from the functionality already avaialable through
> SpanQuery?

Good question!

I think the difference would be index-time (payload encoding span-end
+ new Query) vs search time (SpanQuery)?

Ie, with the former (index-time) you'd have a TokenFilter spotting the
spans and encoding them into the index, and with the latter all
spotting happens at search time?

So net/net I guess (?) the results would be the same, but performance
should be faster if you do it index-time?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

> If you stuff the end of the span into the payload you'd have to create
> a custom variant of PhraseQuery to properly match based on the end
> span.

How different is this from the functionality already avaialable through
SpanQuery?

stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton <gl...@gmail.com> wrote:
>>Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.

What would you use the end of the span for?

For example, do you need to do the equivalent of and end-of-span-aware
PhraseQuery?

Ie, so that if the document is "wireless network is down", and I apply
the synonym "wireless network" -> "wifi" at indexing time, then the
end-span-aware-PhraseQuery would match "wifi is down" (unlike today).

If you stuff the end of the span into the payload you'd have to create
a custom variant of PhraseQuery to properly match based on the end
span.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

That would be really nice. Full standoff annotations open a lot of doors.

If we had them, though, I'm not sure exactly which of Mike's methods you'd
use?  I thought payloads were completely token-based and could not be
attached to spans regardless.  And the SynonymFilter is really to mimic the
behavior of multiple tokens/span... (though maybe you could add the other
tokens in as "synonyms" and then skip the tokens you added...?).
Mike, is all this stuff possible if we can just index the ends of spans?

stephen


On 12/13/12 9:09 AM, "Glen Newton" <gl...@gmail.com> wrote:

>> Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
> 
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.
> 
> 
> -Glen
> 
> On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>> <Wu...@mayo.edu> wrote:
>>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>>> that would help me understand the practical issues that would need to be
>>>>> addressed?
>>>> 
>>>> Maybe we can make this more concrete: what new attribute are you
>>>> needing to record in the postings and access at search time?
>>> 
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>> 
>> So for example part-of-speech is a per-Token-position attribute.
>> 
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>> 
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>> 
>> For the span-like attributes (eg a syntactic parse, semantically
>> normalized phrase) I think you'd need to do something like
>> SynonymFilter in your analysis, i.e. insert new tokens at the position
>> where the span started.  Unfortunately, Lucene doesn't properly index
>> spans (it records the start position but not the end position), so
>> that limits what kind of matching you can do at search time.
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

>Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>> that would help me understand the practical issues that would need to be
>>>> addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
>
> So for example part-of-speech is a per-Token-position attribute.
>
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
>
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.
>
> For the span-like attributes (eg a syntactic parse, semantically
> normalized phrase) I think you'd need to do something like
> SynonymFilter in your analysis, i.e. insert new tokens at the position
> where the span started.  Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.

So for example part-of-speech is a per-Token-position attribute.

Today the easiest way to handle this is to encode these attributes
into a Payload, which is straightforward (make a custom TokenFilter
that creates the payload).

At search time you would then use e.g. PayloadTermQuery to decode the
Payload and do something with it to alter how the query is being
scored.

For the span-like attributes (eg a syntactic parse, semantically
normalized phrase) I think you'd need to do something like
SynonymFilter in your analysis, i.e. insert new tokens at the position
where the span started.  Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by wgggfiy <wu...@qq.com>.

Thx very much!
Lingpipe and Gate are very useful, and new to me,
but is it too larger to realize the custom like
class TestPostingItem
{
        int termId;
        long startOffset;
        long endOffset;
        float score;
        int segId;
        long timeStamp;
} ?



-----
--------------------------
Email: wuqiu.main@qq.com
--------------------------
--
View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4026571.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

+10

These are the kind of things you can do in GATE[1] using annotations[2].
A VERY useful feature.

-Glen

[1]http://gate.ac.uk
[2]http://gate.ac.uk/wiki/jape-repository/annotations.html

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.
>
> stephen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org