You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mike Ree <mi...@olytech.net> on 2013/01/04 19:02:06 UTC

Search across a specified number of boundaries

I have an index of books, and I want to allow the user the ability to find
terms that are in nearby sentences.

IE:
"TermA NEAR3 TermB" would find all TermA's that are within 3 sentences of
TermB.

Have found ways to find TermA within same sentence as TermB using
SpanNotQuery and SpanNearQuery and adding sentence boundaries to the index,
but I'm not able to find a way to extend this idea out to allow it to cross
a limited number of sentence boundaries.

Only thing I can think of is to use a dynamic field per a sentence and a
new type of query to be able to search across them, but before I do
anything I was hoping to get some feedback.

Thanks,
Mike

Re: Search across a specified number of boundaries

Posted by Mike Ree <mi...@olytech.net>.
Mikhail,

Yeah, I considered that originally, but then after analyzing the data
noticed that was not possible. Some of the content we analyze contains
large tables that after ocr get turned into long running sentences which
contain 500k+ words per a sentence. Overall there are probably around 10k
of those anomalies that stop the ranges from working as we run out of
positions with the max value an integer can contain and run the risk of a
future document breaking it.

I found a Jira on what I'm looking for. Going to look into it and see if I
can get it to work for my situation.

https://issues.apache.org/jira/browse/LUCENE-777

Thanks for the help.

Mike

On Mon, Jan 14, 2013 at 11:48 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Mike,
>
> When Lucene's Analyser indexes the text it adds positions into the index
> which are lately used by SpanQueries. Have you considered idea of position
> increment gap? e.g. the first sentence is indexed with words positions:
> 0,1,2,3,... the second sentence with 100,101,102,103,..., third
> 200,201,202.. Then applying some span constraint allows you search
> across/inside of the sentences.
> WDYT?
>
>
> On Sun, Jan 6, 2013 at 6:50 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Mike:
>>
>> I'm _really_ stretching here, but you might be able to do something
>> interesting
>>  with payloads. Say each word had a payload with the sentence number and
>> you _somehow_ made use of that information in a custom scorer. But like I
>> said, I really have no good idea how to accomplish that...
>>
>> BTW, in future this kind of question is better asked on the user's list
>> (either
>> Lucene or Solr), this list if intended for discussing development work....
>>
>> Best
>> Erick
>>
>>
>> On Fri, Jan 4, 2013 at 1:02 PM, Mike Ree <mi...@olytech.net> wrote:
>>
>>> d terms that are in nearby sentences.
>>>
>>> IE:
>>> "TermA NEAR3 TermB" would find all TermA's that are within 3 sentences
>>> of TermB.
>>>
>>> Have found ways to find TermA within same sentence
>>>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Search across a specified number of boundaries

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Mike,

When Lucene's Analyser indexes the text it adds positions into the index
which are lately used by SpanQueries. Have you considered idea of position
increment gap? e.g. the first sentence is indexed with words positions:
0,1,2,3,... the second sentence with 100,101,102,103,..., third
200,201,202.. Then applying some span constraint allows you search
across/inside of the sentences.
WDYT?


On Sun, Jan 6, 2013 at 6:50 PM, Erick Erickson <er...@gmail.com>wrote:

> Mike:
>
> I'm _really_ stretching here, but you might be able to do something
> interesting
>  with payloads. Say each word had a payload with the sentence number and
> you _somehow_ made use of that information in a custom scorer. But like I
> said, I really have no good idea how to accomplish that...
>
> BTW, in future this kind of question is better asked on the user's list
> (either
> Lucene or Solr), this list if intended for discussing development work....
>
> Best
> Erick
>
>
> On Fri, Jan 4, 2013 at 1:02 PM, Mike Ree <mi...@olytech.net> wrote:
>
>> d terms that are in nearby sentences.
>>
>> IE:
>> "TermA NEAR3 TermB" would find all TermA's that are within 3 sentences of
>> TermB.
>>
>> Have found ways to find TermA within same sentence
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Search across a specified number of boundaries

Posted by Erick Erickson <er...@gmail.com>.
Mike:

I'm _really_ stretching here, but you might be able to do something
interesting
with payloads. Say each word had a payload with the sentence number and
you _somehow_ made use of that information in a custom scorer. But like I
said, I really have no good idea how to accomplish that...

BTW, in future this kind of question is better asked on the user's list
(either
Lucene or Solr), this list if intended for discussing development work....

Best
Erick

On Fri, Jan 4, 2013 at 1:02 PM, Mike Ree <mi...@olytech.net> wrote:

> d terms that are in nearby sentences.
>
> IE:
> "TermA NEAR3 TermB" would find all TermA's that are within 3 sentences of
> TermB.
>
> Have found ways to find TermA within same sentence
>