You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/01/05 13:36:55 UTC

Lucene for a linguistic corpus

Hello!

I'm considering Lucene as an engine for linguistic corpus search.

There's a feature in this search: each word is treated as ambiguuos - i.e., it has got multiple sets of grammatical annotations (there's a fixed maximum of these sets number - a word can have at most 8 parses).
For an example, in the phrase "A man saw a elephant" "saw" has annotations as follows (we also say that its position in index is 1234):

{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}

Normally, we index each annotation as an independent feature (i.e., there will be posting lists for "lemma", "pos", "number", etc.). And the problem is, for the query "pos = Verb AND number = Singular" we DON'T want to find the position 1234 because they appeared in different parses.

As a solution one may consider indexing all annotation subsets (this would increase index size and queries complicatedness), searching for regexps (but the search will be dead slow), or indexing parses, not words (but queries with given distance between words will break up) - these solutions are not acceptable.

I think, it would be more effective to insert parse index in each attribute's posting list entry as a payload and use it at the intersectiion stage. E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|... While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing 'x.1234' will be accepted until the intersection stage at which they will be rejected because of non-corresponding parse indexes.

I am also new to Lucene, so could you please tell me if this idea is implementable in Lucene, and how much effort does the implementation take?


-- 
Best Regards,
Igor Shalyminov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene for a linguistic corpus

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

>> For an example, in the phrase "A man saw a elephant" "saw" has annotations as
>> follows (we also say that its position in index is 1234):
>> 
>> {lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number:
>> singular}
>> 
>> I think, it would be more effective to insert parse index in each attribute's
>> posting list entry as a payload and use it at the intersectiion stage. E.g.,
>> we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a
>> posting list for 'number = Singular': ...|...|2.1234|...|... While processing
>> a query like 'pos = Verb AND number = singular' at all stages of posting list
>> processing 'x.1234' will be accepted until the intersection stage at which
>> they will be rejected because of non-corresponding parse indexes.
We're working on something very similar.
Are there really posting lists like this (e.g., separate lists for pos=Verb,
number=Singular) for things in Payloads?  I think some previous discussion
was saying this kind of posting list is not available.  I couldn't find
anything like that in the documentation about the index format. If there
are, this would be really efficient.

> You might be able to insert your parses as payloads on a term and then
> implement a scorer extension (override computePayloadFactor) to handle your
> join cases for a given word.  You may also need to extend PayloadQuery or
> PayloadTermQuery.  Note, I don't know how well this will perform.
We've done it this way before, storing a slightly different set of
information in the Payload.  I thought making use of a Payload, though,
requires you to iterate through all the tokens, whether in the Analyzer
(i.e., in a TokenFilter) or Similarity (in an overridden scorePayload()
function).

If I'm right, then filtering this out at intersection time might not be
quite as efficient as you're talking about, but it's definitely a reasonable
way to do it.

stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene for a linguistic corpus

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Igor,

On Jan 5, 2013, at 7:36 AM, Igor Shalyminov wrote:

> Hello!
> 
> I'm considering Lucene as an engine for linguistic corpus search.
> 
> There's a feature in this search: each word is treated as ambiguuos - i.e., it has got multiple sets of grammatical annotations (there's a fixed maximum of these sets number - a word can have at most 8 parses).
> For an example, in the phrase "A man saw a elephant" "saw" has annotations as follows (we also say that its position in index is 1234):
> 
> {lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
> 
> Normally, we index each annotation as an independent feature (i.e., there will be posting lists for "lemma", "pos", "number", etc.). And the problem is, for the query "pos = Verb AND number = Singular" we DON'T want to find the position 1234 because they appeared in different parses.
> 
> As a solution one may consider indexing all annotation subsets (this would increase index size and queries complicatedness), searching for regexps (but the search will be dead slow)

Might be worth trying, esp. in Lucene 4 with some of the new automaton stuff.  At least you will have a baseline.

> , or indexing parses, not words (but queries with given distance between words will break up) - these solutions are not acceptable.
> 
> I think, it would be more effective to insert parse index in each attribute's posting list entry as a payload and use it at the intersectiion stage. E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|... While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing 'x.1234' will be accepted until the intersection stage at which they will be rejected because of non-corresponding parse indexes.
> 
> I am also new to Lucene, so could you please tell me if this idea is implementable in Lucene, and how much effort does the implementation take?

You might be able to insert your parses as payloads on a term and then implement a scorer extension (override computePayloadFactor) to handle your join cases for a given word.  You may also need to extend PayloadQuery or PayloadTermQuery.  Note, I don't know how well this will perform.

So, "saw" would have the two lemmas above and then your logic for the new Query class would be able to distinguish, via the two payloads that neither satisfies the Verb, singular option and would then return a score of 0.  There is likely other work involved to make sure it all works appropriately, but that is where I would start.  Others might have different approaches.

I'd have to play around w/ the code a bit more to make this work, but I think it is doable in Lucene.  I can't say how long it would take you, b/c I have no clue as to what you know about Lucene.  For someone who knows Lucene, it's probably anywhere from a few days to a week or two.  

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org