You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Kaminski, Adi" <Ad...@verint.com> on 2019/10/15 06:25:03 UTC

Position search

Hi,
What's the recommended way to search in Solr (assuming 8.2 is used) for specific terms/phrases/expressions while limiting the search from position perspective.
For example to search only in the first/last 100 words of the document ?

Is there any built-in functionality for that ?

Thanks in advance,
Adi


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

Re: Position search

Posted by Tim Casey <tc...@gmail.com>.

Adi,

If you are looking for something specific you might want to try something
different.  Before you would search 'the end of a document', you might
think about segmenting the document and searching specific segments.  At
the end of a lot of things like email will be signatures.  Those are fairly
standard language, although mostly the same in meaning, do differ in
specific language.  They are a common segment.

If you are searching something like research papers, then you would be
thinking about the conclusion (?), bibliography (?).  It does not matter,
but there will be specific segments.

I think you will find the last N tokens of a document have some odd
categories within the search results.  I might guess you have a different
purpose in mind.  Either way, you would likely do better to segment what
you are searching.

tim

On Mon, Oct 14, 2019 at 11:25 PM Kaminski, Adi <Ad...@verint.com>
wrote:

> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for
> specific terms/phrases/expressions while limiting the search from position
> perspective.
> For example to search only in the first/last 100 words of the document ?
>
> Is there any built-in functionality for that ?
>
> Thanks in advance,
> Adi
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>

Re: Position search

Posted by Erick Erickson <er...@gmail.com>.

Three things off the top of my head, in order of how long it’d take to implement:

***
If it’s _always_ some distance from the start or end, index special beginning and end tags. perhaps a nonsense string like BEGINslkdjfhsldkfhsdkfh  and ENDslakshalskdfhj. Now your searches become phrase queries with slop. Searching for “erick in the first 100 words” becomes:

"BEGINslkdjfhsldkfhsdkfh erick”~100

***
Index each term with a payload indicating its position and use a payload function to determine whether the term should count as a hit. You’d probably have to have a field telling you how long is field is to know what offset “50 words from the end” is.

***
Get into the low level Lucene code. After all if you index the position information to support phrase queries, you have exactly the position of the word. NOTE: you’d also probably have to index a separate field with the total length of the field in it so you know what position “100 words from the end” is. I suspect you could make this the most efficient, but I wouldn’t go here unless your performance is poor as it’d take some development work.

Note: I haven’t thought these out very carefully so caveat emptor.

Here’s a place to get started with payloads if you decide to go that route:

https://lucidworks.com/post/solr-payloads/

Best,
Erick


> On Oct 16, 2019, at 5:47 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> 
> So are these really text locations or rather actually sections of the
> document. If later, can you parse out sections during indexing?
> 
> Regards,
>     Alex
> 
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <Ad...@verint.com>
> wrote:
> 
>> Hi,
>> Thanks for the responses.
>> 
>> It's a soft boundary which is resulted by dynamic syntax from our
>> application. So may vary from different user searches, one user can search
>> some "word1" in starting 30 words, and another can search "word2" in
>> starting 10 words. The use case is to match some terms/phrase in specific
>> document places in order to identify scripts/specific word ocuurences.
>> 
>> So I guess copy field won't work here.
>> 
>> Any other suggestions/thoughts ?
>> Maybe some hidden position filters in native level to limit from start/end
>> of the document ?
>> 
>> Thanks,
>> Adi
>> 
>> -----Original Message-----
>> From: Tim Casey <tc...@gmail.com>
>> Sent: Tuesday, October 15, 2019 11:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Position search
>> 
>> If this is about a normalized query, I would put the normalization text
>> into a specific field.  The reason for this is you may want to search the
>> overall text during any form of expansion phase of searching for data.
>> That is, maybe you want to know the context of up to the 120th word.  At
>> least you have both.
>> Also, you may want to note which normalized fields were truncated or were
>> simply too small. This would give some guidance as to the bias of the
>> normalization.  If 95% of the fields were not truncated, there is a chance
>> you are not doing good at normalizing because you have a set of
>> particularly short messages.  So I would expect a small set of side fields
>> remarking this.  This would allow you to carry the measures along with the
>> data.
>> 
>> tim
>> 
>> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <arafalov@gmail.com
>>> 
>> wrote:
>> 
>>> Is the 100 words a hard boundary or a soft one?
>>> 
>>> If it is a hard one (always 100 words), the easiest is probably copy
>>> field and in the (unstored) copy, trim off whatever you don't want to
>>> search. Possibly using regular expressions. Of course, "what's a word"
>>> is an important question here.
>>> 
>>> Similarly, you could do that with Update Request Processors and
>>> clone/process field even before it hits the schema. Then you could
>>> store the extract for highlighting purposes.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <Ad...@verint.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> What's the recommended way to search in Solr (assuming 8.2 is used)
>>>> for
>>> specific terms/phrases/expressions while limiting the search from
>>> position perspective.
>>>> For example to search only in the first/last 100 words of the document
>> ?
>>>> 
>>>> Is there any built-in functionality for that ?
>>>> 
>>>> Thanks in advance,
>>>> Adi
>>>> 
>>>> 
>>>> This electronic message may contain proprietary and confidential
>>> information of Verint Systems Inc., its affiliates and/or
>>> subsidiaries. The information is intended to be for the use of the
>>> individual(s) or
>>> entity(ies) named above. If you are not the intended recipient (or
>>> authorized to receive this e-mail for the intended recipient), you may
>>> not use, copy, disclose or distribute to anyone this message or any
>>> information contained in this message. If you have received this
>>> electronic message in error, please notify us by replying to this e-mail.
>>> 
>> 
>> 
>> This electronic message may contain proprietary and confidential
>> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
>> information is intended to be for the use of the individual(s) or
>> entity(ies) named above. If you are not the intended recipient (or
>> authorized to receive this e-mail for the intended recipient), you may not
>> use, copy, disclose or distribute to anyone this message or any information
>> contained in this message. If you have received this electronic message in
>> error, please notify us by replying to this e-mail.
>>

Re: Position search

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Well, after some digging and trying to recall things:
1) XMLParser allows to specify a query in a different way from normal
query parameters:
https://lucene.apache.org/solr/guide/8_1/other-parsers.html#xml-query-parser
2) SpanFirst allowed to anchor the search to the start of the text and
provide the initial number of tokens to search within. It is not well
documented but apparently somebody did some tests:
https://coding-art.blogspot.com/2016/05/apache-solr-xml-query-parser.html
3) SpanFirst is actually a simpler use case of a more general matcher
(SpanPositionRangeQuery)
4) SpanPositionRangeQuery is not yet exposed in Solr, but will be in
8.3: https://issues.apache.org/jira/browse/SOLR-13663

So, I would test your example with XMLParser and SpanFirst (perhaps on
latest 8.x Solr). If that works, you have an approach for at least
initial X query and know you have an easy upgrade when 8.3 is out
(soon). Alternatively, you can play with SpanFirst and reversal of the
field.

Regards,
   Alex.
P.s. Also, SpanFirst apparently boosts matches early in the text
higher than those later. That's in the mailing list archive
discussions, which you can search on the web. E.,g.
https://lists.apache.org/thread.html/014db9dcef44a8f9641600d19cfaa528f33bac676b7ac68903537b75@%3Csolr-user.lucene.apache.org%3E

On Wed, 16 Oct 2019 at 08:17, Kaminski, Adi <Ad...@verint.com> wrote:
>
> Hi,
> These are really text positions.
> For example I have a document: "hello thanks for calling the support how can I help you"
>
> And in the application I would like to search for documents that match "thanks" NEAR "support" only in first 30 words of the document (greeting part for example), and not in the middle/end part of the document.
>
> Regards,
> Adi
>
> -----Original Message-----
> From: Alexandre Rafalovitch <ar...@gmail.com>
> Sent: Wednesday, October 16, 2019 12:48 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Position search
>
> So are these really text locations or rather actually sections of the document. If later, can you parse out sections during indexing?
>
> Regards,
>      Alex
>
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <Ad...@verint.com>
> wrote:
>
> > Hi,
> > Thanks for the responses.
> >
> > It's a soft boundary which is resulted by dynamic syntax from our
> > application. So may vary from different user searches, one user can
> > search some "word1" in starting 30 words, and another can search
> > "word2" in starting 10 words. The use case is to match some
> > terms/phrase in specific document places in order to identify scripts/specific word ocuurences.
> >
> > So I guess copy field won't work here.
> >
> > Any other suggestions/thoughts ?
> > Maybe some hidden position filters in native level to limit from
> > start/end of the document ?
> >
> > Thanks,
> > Adi
> >
> > -----Original Message-----
> > From: Tim Casey <tc...@gmail.com>
> > Sent: Tuesday, October 15, 2019 11:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Position search
> >
> > If this is about a normalized query, I would put the normalization
> > text into a specific field.  The reason for this is you may want to
> > search the overall text during any form of expansion phase of searching for data.
> > That is, maybe you want to know the context of up to the 120th word.
> > At least you have both.
> > Also, you may want to note which normalized fields were truncated or
> > were simply too small. This would give some guidance as to the bias of
> > the normalization.  If 95% of the fields were not truncated, there is
> > a chance you are not doing good at normalizing because you have a set
> > of particularly short messages.  So I would expect a small set of side
> > fields remarking this.  This would allow you to carry the measures
> > along with the data.
> >
> > tim
> >
> > On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
> > <arafalov@gmail.com
> > >
> > wrote:
> >
> > > Is the 100 words a hard boundary or a soft one?
> > >
> > > If it is a hard one (always 100 words), the easiest is probably copy
> > > field and in the (unstored) copy, trim off whatever you don't want
> > > to search. Possibly using regular expressions. Of course, "what's a word"
> > > is an important question here.
> > >
> > > Similarly, you could do that with Update Request Processors and
> > > clone/process field even before it hits the schema. Then you could
> > > store the extract for highlighting purposes.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi
> > > <Ad...@verint.com>
> > > wrote:
> > > >
> > > > Hi,
> > > > What's the recommended way to search in Solr (assuming 8.2 is
> > > > used) for
> > > specific terms/phrases/expressions while limiting the search from
> > > position perspective.
> > > > For example to search only in the first/last 100 words of the
> > > > document
> > ?
> > > >
> > > > Is there any built-in functionality for that ?
> > > >
> > > > Thanks in advance,
> > > > Adi
> > > >
> > > >
> > > > This electronic message may contain proprietary and confidential
> > > information of Verint Systems Inc., its affiliates and/or
> > > subsidiaries. The information is intended to be for the use of the
> > > individual(s) or
> > > entity(ies) named above. If you are not the intended recipient (or
> > > authorized to receive this e-mail for the intended recipient), you
> > > may not use, copy, disclose or distribute to anyone this message or
> > > any information contained in this message. If you have received this
> > > electronic message in error, please notify us by replying to this e-mail.
> > >
> >
> >
> > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> > not use, copy, disclose or distribute to anyone this message or any
> > information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

RE: Position search

Posted by "Kaminski, Adi" <Ad...@verint.com>.

Hi,
These are really text positions.
For example I have a document: "hello thanks for calling the support how can I help you"

And in the application I would like to search for documents that match "thanks" NEAR "support" only in first 30 words of the document (greeting part for example), and not in the middle/end part of the document.

Regards,
Adi

-----Original Message-----
From: Alexandre Rafalovitch <ar...@gmail.com>
Sent: Wednesday, October 16, 2019 12:48 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: Position search

So are these really text locations or rather actually sections of the document. If later, can you parse out sections during indexing?

Regards,
     Alex

On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <Ad...@verint.com>
wrote:

> Hi,
> Thanks for the responses.
>
> It's a soft boundary which is resulted by dynamic syntax from our
> application. So may vary from different user searches, one user can
> search some "word1" in starting 30 words, and another can search
> "word2" in starting 10 words. The use case is to match some
> terms/phrase in specific document places in order to identify scripts/specific word ocuurences.
>
> So I guess copy field won't work here.
>
> Any other suggestions/thoughts ?
> Maybe some hidden position filters in native level to limit from
> start/end of the document ?
>
> Thanks,
> Adi
>
> -----Original Message-----
> From: Tim Casey <tc...@gmail.com>
> Sent: Tuesday, October 15, 2019 11:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Position search
>
> If this is about a normalized query, I would put the normalization
> text into a specific field.  The reason for this is you may want to
> search the overall text during any form of expansion phase of searching for data.
> That is, maybe you want to know the context of up to the 120th word.
> At least you have both.
> Also, you may want to note which normalized fields were truncated or
> were simply too small. This would give some guidance as to the bias of
> the normalization.  If 95% of the fields were not truncated, there is
> a chance you are not doing good at normalizing because you have a set
> of particularly short messages.  So I would expect a small set of side
> fields remarking this.  This would allow you to carry the measures
> along with the data.
>
> tim
>
> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
> <arafalov@gmail.com
> >
> wrote:
>
> > Is the 100 words a hard boundary or a soft one?
> >
> > If it is a hard one (always 100 words), the easiest is probably copy
> > field and in the (unstored) copy, trim off whatever you don't want
> > to search. Possibly using regular expressions. Of course, "what's a word"
> > is an important question here.
> >
> > Similarly, you could do that with Update Request Processors and
> > clone/process field even before it hits the schema. Then you could
> > store the extract for highlighting purposes.
> >
> > Regards,
> >    Alex.
> >
> > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi
> > <Ad...@verint.com>
> > wrote:
> > >
> > > Hi,
> > > What's the recommended way to search in Solr (assuming 8.2 is
> > > used) for
> > specific terms/phrases/expressions while limiting the search from
> > position perspective.
> > > For example to search only in the first/last 100 words of the
> > > document
> ?
> > >
> > > Is there any built-in functionality for that ?
> > >
> > > Thanks in advance,
> > > Adi
> > >
> > >
> > > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you
> > may not use, copy, disclose or distribute to anyone this message or
> > any information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

Re: Position search

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

So are these really text locations or rather actually sections of the
document. If later, can you parse out sections during indexing?

Regards,
     Alex

On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, <Ad...@verint.com>
wrote:

> Hi,
> Thanks for the responses.
>
> It's a soft boundary which is resulted by dynamic syntax from our
> application. So may vary from different user searches, one user can search
> some "word1" in starting 30 words, and another can search "word2" in
> starting 10 words. The use case is to match some terms/phrase in specific
> document places in order to identify scripts/specific word ocuurences.
>
> So I guess copy field won't work here.
>
> Any other suggestions/thoughts ?
> Maybe some hidden position filters in native level to limit from start/end
> of the document ?
>
> Thanks,
> Adi
>
> -----Original Message-----
> From: Tim Casey <tc...@gmail.com>
> Sent: Tuesday, October 15, 2019 11:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Position search
>
> If this is about a normalized query, I would put the normalization text
> into a specific field.  The reason for this is you may want to search the
> overall text during any form of expansion phase of searching for data.
> That is, maybe you want to know the context of up to the 120th word.  At
> least you have both.
> Also, you may want to note which normalized fields were truncated or were
> simply too small. This would give some guidance as to the bias of the
> normalization.  If 95% of the fields were not truncated, there is a chance
> you are not doing good at normalizing because you have a set of
> particularly short messages.  So I would expect a small set of side fields
> remarking this.  This would allow you to carry the measures along with the
> data.
>
> tim
>
> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <arafalov@gmail.com
> >
> wrote:
>
> > Is the 100 words a hard boundary or a soft one?
> >
> > If it is a hard one (always 100 words), the easiest is probably copy
> > field and in the (unstored) copy, trim off whatever you don't want to
> > search. Possibly using regular expressions. Of course, "what's a word"
> > is an important question here.
> >
> > Similarly, you could do that with Update Request Processors and
> > clone/process field even before it hits the schema. Then you could
> > store the extract for highlighting purposes.
> >
> > Regards,
> >    Alex.
> >
> > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <Ad...@verint.com>
> > wrote:
> > >
> > > Hi,
> > > What's the recommended way to search in Solr (assuming 8.2 is used)
> > > for
> > specific terms/phrases/expressions while limiting the search from
> > position perspective.
> > > For example to search only in the first/last 100 words of the document
> ?
> > >
> > > Is there any built-in functionality for that ?
> > >
> > > Thanks in advance,
> > > Adi
> > >
> > >
> > > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> > not use, copy, disclose or distribute to anyone this message or any
> > information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>

RE: Position search

Posted by "Kaminski, Adi" <Ad...@verint.com>.

Hi,
Thanks for the responses.

It's a soft boundary which is resulted by dynamic syntax from our application. So may vary from different user searches, one user can search some "word1" in starting 30 words, and another can search "word2" in
starting 10 words. The use case is to match some terms/phrase in specific document places in order to identify scripts/specific word ocuurences.

So I guess copy field won't work here.

Any other suggestions/thoughts ?
Maybe some hidden position filters in native level to limit from start/end of the document ?

Thanks,
Adi

-----Original Message-----
From: Tim Casey <tc...@gmail.com>
Sent: Tuesday, October 15, 2019 11:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Position search

If this is about a normalized query, I would put the normalization text into a specific field.  The reason for this is you may want to search the overall text during any form of expansion phase of searching for data.
That is, maybe you want to know the context of up to the 120th word.  At least you have both.
Also, you may want to note which normalized fields were truncated or were simply too small. This would give some guidance as to the bias of the normalization.  If 95% of the fields were not truncated, there is a chance you are not doing good at normalizing because you have a set of particularly short messages.  So I would expect a small set of side fields remarking this.  This would allow you to carry the measures along with the data.

tim

On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Is the 100 words a hard boundary or a soft one?
>
> If it is a hard one (always 100 words), the easiest is probably copy
> field and in the (unstored) copy, trim off whatever you don't want to
> search. Possibly using regular expressions. Of course, "what's a word"
> is an important question here.
>
> Similarly, you could do that with Update Request Processors and
> clone/process field even before it hits the schema. Then you could
> store the extract for highlighting purposes.
>
> Regards,
>    Alex.
>
> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <Ad...@verint.com>
> wrote:
> >
> > Hi,
> > What's the recommended way to search in Solr (assuming 8.2 is used)
> > for
> specific terms/phrases/expressions while limiting the search from
> position perspective.
> > For example to search only in the first/last 100 words of the document ?
> >
> > Is there any built-in functionality for that ?
> >
> > Thanks in advance,
> > Adi
> >
> >
> > This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>

This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.

Re: Position search

Posted by Tim Casey <tc...@gmail.com>.

If this is about a normalized query, I would put the normalization text
into a specific field.  The reason for this is you may want to search the
overall text during any form of expansion phase of searching for data.
That is, maybe you want to know the context of up to the 120th word.  At
least you have both.
Also, you may want to note which normalized fields were truncated or were
simply too small. This would give some guidance as to the bias of the
normalization.  If 95% of the fields were not truncated, there is a chance
you are not doing good at normalizing because you have a set of
particularly short messages.  So I would expect a small set of side fields
remarking this.  This would allow you to carry the measures along with the
data.

tim

On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Is the 100 words a hard boundary or a soft one?
>
> If it is a hard one (always 100 words), the easiest is probably copy
> field and in the (unstored) copy, trim off whatever you don't want to
> search. Possibly using regular expressions. Of course, "what's a word"
> is an important question here.
>
> Similarly, you could do that with Update Request Processors and
> clone/process field even before it hits the schema. Then you could
> store the extract for highlighting purposes.
>
> Regards,
>    Alex.
>
> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <Ad...@verint.com>
> wrote:
> >
> > Hi,
> > What's the recommended way to search in Solr (assuming 8.2 is used) for
> specific terms/phrases/expressions while limiting the search from position
> perspective.
> > For example to search only in the first/last 100 words of the document ?
> >
> > Is there any built-in functionality for that ?
> >
> > Thanks in advance,
> > Adi
> >
> >
> > This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>

Re: Position search

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Is the 100 words a hard boundary or a soft one?

If it is a hard one (always 100 words), the easiest is probably copy
field and in the (unstored) copy, trim off whatever you don't want to
search. Possibly using regular expressions. Of course, "what's a word"
is an important question here.

Similarly, you could do that with Update Request Processors and
clone/process field even before it hits the schema. Then you could
store the extract for highlighting purposes.

Regards,
   Alex.

On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi <Ad...@verint.com> wrote:
>
> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for specific terms/phrases/expressions while limiting the search from position perspective.
> For example to search only in the first/last 100 words of the document ?
>
> Is there any built-in functionality for that ?
>
> Thanks in advance,
> Adi
>
>
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.