You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Villarejo <vi...@gmail.com> on 2015/03/03 19:21:40 UTC

Part of speech search with lucene

After many google searchs I decided to post my problem here hoping that
someone help me. What I want to achieve is to perform queries as follows
(Don't worry about the query format):

q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
followed by any prep.
q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
"jumps" followed by any prep.
q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj followed
by jumps as verb followed by any preposition.

In a more general form, what I want is
(POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])

For that, I have the text tagged as follows:

the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy] dog|[pos:NN][lemma:dog]

The first thing I thought was to index extra info of each term as payload
and using PayloadNearQuery after in order to access to the payload of each
span. The problem is that PayloadNearQuery match the terms first and then
access its payload, so none of the 3 above queries will work. (correct me
if I'm wrong)

The second thing I thought was to index extra info as synonyms of the term
but, this way, the second query won't work since I can't ask if the first
term is an adj and the specific word "brown" simultaneously.

Any way to address this problem, suggestions, etc. will be appreciated.


David.

Re: Part of speech search with lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

You're welcome; thanks for letting us know

-Mike

On 03/04/2015 01:21 PM, David Villarejo wrote:
> Hi Mike,
>
> Your solution work! I've been trying it with PhraseQuery and It works
> pretty good.
>
> Thank you so much.
>
> David.
>
> 2015-03-03 23:00 GMT+01:00 Michael Sokolov <ms...@safaribooksonline.com>:
>
>> I believe you can accomplish what you are talking about using PhraseQuery,
>> say: note that it has
>>
>> public void add(Term term, int position)
>>
>> which does enable searching for multiple terms at the same position
>>
>> and you should be able to encode different kinds of attributes using text
>> tricks like I suggested, or with payloads: I'm less clear about how to use
>> the payloads in queries though
>>
>> -Mike
>>
>>
>> On 03/03/2015 04:41 PM, David Villarejo wrote:
>>
>>> What you propose is good if you want to index only the pos of a token. But
>>> I want to index some extra info, such as "lemma" of a token, phonetic
>>> encoding, etc. Sorry, I was not too general in my previous post.
>>> Imagine you want to ask this:
>>>
>>> an adj whose lemma is "quick" followed by "brown" followed by a noun whose
>>> phonetic enconding is "fots".
>>>
>>> So, the main problem is you cannot ask if several "synonyms" exist at the
>>> same position.
>>>
>>> Thank you Michael for your answer.
>>>
>>> 2015-03-03 20:52 GMT+01:00 Michael Sokolov <msokolov@safaribooksonline.
>>> com>:
>>>
>>>   What if you indexed every word with two synonyms: the plain unadorned
>>>> word
>>>> and a token formed by concatenating the pos and the word with some
>>>> unusual
>>>> separator character?
>>>>
>>>> For example, "the quick brown fox" would be:
>>>>
>>>> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
>>>> noun:fox }
>>>>
>>>> with punctuation to suggest the token graph
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On 03/03/2015 01:21 PM, David Villarejo wrote:
>>>>
>>>>   After many google searchs I decided to post my problem here hoping that
>>>>> someone help me. What I want to achieve is to perform queries as follows
>>>>> (Don't worry about the query format):
>>>>>
>>>>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>>>>> followed by any prep.
>>>>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>>>>> "jumps" followed by any prep.
>>>>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj
>>>>> followed
>>>>> by jumps as verb followed by any preposition.
>>>>>
>>>>> In a more general form, what I want is
>>>>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>>>>
>>>>> For that, I have the text tagged as follows:
>>>>>
>>>>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>>>>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>>>>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>>>>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy]
>>>>> dog|[pos:NN][lemma:dog]
>>>>>
>>>>> The first thing I thought was to index extra info of each term as
>>>>> payload
>>>>> and using PayloadNearQuery after in order to access to the payload of
>>>>> each
>>>>> span. The problem is that PayloadNearQuery match the terms first and
>>>>> then
>>>>> access its payload, so none of the 3 above queries will work. (correct
>>>>> me
>>>>> if I'm wrong)
>>>>>
>>>>> The second thing I thought was to index extra info as synonyms of the
>>>>> term
>>>>> but, this way, the second query won't work since I can't ask if the
>>>>> first
>>>>> term is an adj and the specific word "brown" simultaneously.
>>>>>
>>>>> Any way to address this problem, suggestions, etc. will be appreciated.
>>>>>
>>>>>
>>>>> David.
>>>>>
>>>>>
>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Part of speech search with lucene

Posted by David Villarejo <vi...@gmail.com>.

Hi Mike,

Your solution work! I've been trying it with PhraseQuery and It works
pretty good.

Thank you so much.

David.

2015-03-03 23:00 GMT+01:00 Michael Sokolov <ms...@safaribooksonline.com>:

> I believe you can accomplish what you are talking about using PhraseQuery,
> say: note that it has
>
> public void add(Term term, int position)
>
> which does enable searching for multiple terms at the same position
>
> and you should be able to encode different kinds of attributes using text
> tricks like I suggested, or with payloads: I'm less clear about how to use
> the payloads in queries though
>
> -Mike
>
>
> On 03/03/2015 04:41 PM, David Villarejo wrote:
>
>> What you propose is good if you want to index only the pos of a token. But
>> I want to index some extra info, such as "lemma" of a token, phonetic
>> encoding, etc. Sorry, I was not too general in my previous post.
>> Imagine you want to ask this:
>>
>> an adj whose lemma is "quick" followed by "brown" followed by a noun whose
>> phonetic enconding is "fots".
>>
>> So, the main problem is you cannot ask if several "synonyms" exist at the
>> same position.
>>
>> Thank you Michael for your answer.
>>
>> 2015-03-03 20:52 GMT+01:00 Michael Sokolov <msokolov@safaribooksonline.
>> com>:
>>
>>  What if you indexed every word with two synonyms: the plain unadorned
>>> word
>>> and a token formed by concatenating the pos and the word with some
>>> unusual
>>> separator character?
>>>
>>> For example, "the quick brown fox" would be:
>>>
>>> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
>>> noun:fox }
>>>
>>> with punctuation to suggest the token graph
>>>
>>> -Mike
>>>
>>>
>>> On 03/03/2015 01:21 PM, David Villarejo wrote:
>>>
>>>  After many google searchs I decided to post my problem here hoping that
>>>> someone help me. What I want to achieve is to perform queries as follows
>>>> (Don't worry about the query format):
>>>>
>>>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>>>> followed by any prep.
>>>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>>>> "jumps" followed by any prep.
>>>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj
>>>> followed
>>>> by jumps as verb followed by any preposition.
>>>>
>>>> In a more general form, what I want is
>>>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>>>
>>>> For that, I have the text tagged as follows:
>>>>
>>>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>>>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>>>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>>>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy]
>>>> dog|[pos:NN][lemma:dog]
>>>>
>>>> The first thing I thought was to index extra info of each term as
>>>> payload
>>>> and using PayloadNearQuery after in order to access to the payload of
>>>> each
>>>> span. The problem is that PayloadNearQuery match the terms first and
>>>> then
>>>> access its payload, so none of the 3 above queries will work. (correct
>>>> me
>>>> if I'm wrong)
>>>>
>>>> The second thing I thought was to index extra info as synonyms of the
>>>> term
>>>> but, this way, the second query won't work since I can't ask if the
>>>> first
>>>> term is an adj and the specific word "brown" simultaneously.
>>>>
>>>> Any way to address this problem, suggestions, etc. will be appreciated.
>>>>
>>>>
>>>> David.
>>>>
>>>>
>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Part of speech search with lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

I believe you can accomplish what you are talking about using 
PhraseQuery, say: note that it has

public void add(Term term, int position)

which does enable searching for multiple terms at the same position

and you should be able to encode different kinds of attributes using 
text tricks like I suggested, or with payloads: I'm less clear about how 
to use the payloads in queries though

-Mike

On 03/03/2015 04:41 PM, David Villarejo wrote:
> What you propose is good if you want to index only the pos of a token. But
> I want to index some extra info, such as "lemma" of a token, phonetic
> encoding, etc. Sorry, I was not too general in my previous post.
> Imagine you want to ask this:
>
> an adj whose lemma is "quick" followed by "brown" followed by a noun whose
> phonetic enconding is "fots".
>
> So, the main problem is you cannot ask if several "synonyms" exist at the
> same position.
>
> Thank you Michael for your answer.
>
> 2015-03-03 20:52 GMT+01:00 Michael Sokolov <ms...@safaribooksonline.com>:
>
>> What if you indexed every word with two synonyms: the plain unadorned word
>> and a token formed by concatenating the pos and the word with some unusual
>> separator character?
>>
>> For example, "the quick brown fox" would be:
>>
>> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
>> noun:fox }
>>
>> with punctuation to suggest the token graph
>>
>> -Mike
>>
>>
>> On 03/03/2015 01:21 PM, David Villarejo wrote:
>>
>>> After many google searchs I decided to post my problem here hoping that
>>> someone help me. What I want to achieve is to perform queries as follows
>>> (Don't worry about the query format):
>>>
>>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>>> followed by any prep.
>>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>>> "jumps" followed by any prep.
>>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj followed
>>> by jumps as verb followed by any preposition.
>>>
>>> In a more general form, what I want is
>>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>>
>>> For that, I have the text tagged as follows:
>>>
>>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy] dog|[pos:NN][lemma:dog]
>>>
>>> The first thing I thought was to index extra info of each term as payload
>>> and using PayloadNearQuery after in order to access to the payload of each
>>> span. The problem is that PayloadNearQuery match the terms first and then
>>> access its payload, so none of the 3 above queries will work. (correct me
>>> if I'm wrong)
>>>
>>> The second thing I thought was to index extra info as synonyms of the term
>>> but, this way, the second query won't work since I can't ask if the first
>>> term is an adj and the specific word "brown" simultaneously.
>>>
>>> Any way to address this problem, suggestions, etc. will be appreciated.
>>>
>>>
>>> David.
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Part of speech search with lucene

Posted by David Villarejo <vi...@gmail.com>.

What you propose is good if you want to index only the pos of a token. But
I want to index some extra info, such as "lemma" of a token, phonetic
encoding, etc. Sorry, I was not too general in my previous post.
Imagine you want to ask this:

an adj whose lemma is "quick" followed by "brown" followed by a noun whose
phonetic enconding is "fots".

So, the main problem is you cannot ask if several "synonyms" exist at the
same position.

Thank you Michael for your answer.

2015-03-03 20:52 GMT+01:00 Michael Sokolov <ms...@safaribooksonline.com>:

> What if you indexed every word with two synonyms: the plain unadorned word
> and a token formed by concatenating the pos and the word with some unusual
> separator character?
>
> For example, "the quick brown fox" would be:
>
> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
> noun:fox }
>
> with punctuation to suggest the token graph
>
> -Mike
>
>
> On 03/03/2015 01:21 PM, David Villarejo wrote:
>
>> After many google searchs I decided to post my problem here hoping that
>> someone help me. What I want to achieve is to perform queries as follows
>> (Don't worry about the query format):
>>
>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>> followed by any prep.
>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>> "jumps" followed by any prep.
>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj followed
>> by jumps as verb followed by any preposition.
>>
>> In a more general form, what I want is
>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>
>> For that, I have the text tagged as follows:
>>
>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy] dog|[pos:NN][lemma:dog]
>>
>> The first thing I thought was to index extra info of each term as payload
>> and using PayloadNearQuery after in order to access to the payload of each
>> span. The problem is that PayloadNearQuery match the terms first and then
>> access its payload, so none of the 3 above queries will work. (correct me
>> if I'm wrong)
>>
>> The second thing I thought was to index extra info as synonyms of the term
>> but, this way, the second query won't work since I can't ask if the first
>> term is an adj and the specific word "brown" simultaneously.
>>
>> Any way to address this problem, suggestions, etc. will be appreciated.
>>
>>
>> David.
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Part of speech search with lucene

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

What if you indexed every word with two synonyms: the plain unadorned 
word and a token formed by concatenating the pos and the word with some 
unusual separator character?

For example, "the quick brown fox" would be:

{ the | article:the } {quick | adj:quick } { brown | adj:brown } { fox | 
noun:fox }

with punctuation to suggest the token graph

-Mike

On 03/03/2015 01:21 PM, David Villarejo wrote:
> After many google searchs I decided to post my problem here hoping that
> someone help me. What I want to achieve is to perform queries as follows
> (Don't worry about the query format):
>
> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
> followed by any prep.
> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
> "jumps" followed by any prep.
> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj followed
> by jumps as verb followed by any preposition.
>
> In a more general form, what I want is
> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>
> For that, I have the text tagged as follows:
>
> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy] dog|[pos:NN][lemma:dog]
>
> The first thing I thought was to index extra info of each term as payload
> and using PayloadNearQuery after in order to access to the payload of each
> span. The problem is that PayloadNearQuery match the terms first and then
> access its payload, so none of the 3 above queries will work. (correct me
> if I'm wrong)
>
> The second thing I thought was to index extra info as synonyms of the term
> but, this way, the second query won't work since I can't ask if the first
> term is an adj and the specific word "brown" simultaneously.
>
> Any way to address this problem, suggestions, etc. will be appreciated.
>
>
> David.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org