You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Hastings <ha...@gmail.com> on 2018/07/25 18:47:45 UTC

Section symbol, ignore in some queries but not others?

Hey all.  have a situation that seems pretty rough.  currently in our data
we have a lot of sentences like this:

elements comprise the "stuff" of the tax. 3 Reg. § 1.901-2(a)(2). 4 Only
non-Saudis are subject to the
<https://heinonline.org/HOL/SearchVolumeSOLR?input=(((%223%20Regulation%201%22%20OR%20%223%20Regulation%201%22%20OR%20%223%20Reg.%201%22)%20AND%20NOT%20id:hein.journals/rcatorbg3.14))&div=13&handle=hein.journals/taxlr53&collection=journals>
By default the word delimiter is treating all punctuation as a space.  So
when you search for:
3 Reg. 1, your results can include  3 Reg. § 1.901

I Have experimented with the WDF and added § => ALPHA and this works, and
treats the character as a letter.  however during some queries, I still
need searches such as

Servitudes 2.10

to return results with:


Servitudes § 2.10


I at the moment can not conceive of a way to to this aside from two
separate text fields, and effectively doubling the size of my index.
which currently sits at 300 gb optimized, and 500gb if left to its
own.


Thanks for any help or suggestions

Re: Section symbol, ignore in some queries but not others?

Posted by David Hastings <ha...@gmail.com>.

Ah, so I could index the text including the § character as an alpha, use no
qs value when trying to ignore it, and for users add i a qs value assuming
I use edismax, whic I currently am.

Tested this method and it works as expected.  Thanks, saved me a lot of
time!
-David

On Wed, Jul 25, 2018 at 3:15 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> If you copyField and don't store the copy, then it is only the indexed
> (term) representation for the copy that is much smaller. Just a
> thought.
>
> The other thing is that you seem to be saying that you want to do a
> match phrase but with a token gap, right? Like an eDisMax slop?
> http://lucene.apache.org/solr/guide/7_4/the-extended-dismax-
> query-parser.html
>
> Regards,
>    Alex.
>
> On 25 July 2018 at 14:47, David Hastings <ha...@gmail.com>
> wrote:
> > Hey all.  have a situation that seems pretty rough.  currently in our
> data
> > we have a lot of sentences like this:
> >
> > elements comprise the "stuff" of the tax. 3 Reg. § 1.901-2(a)(2). 4 Only
> > non-Saudis are subject to the
> > <https://heinonline.org/HOL/SearchVolumeSOLR?input=(((%
> 223%20Regulation%201%22%20OR%20%223%20Regulation%201%22%
> 20OR%20%223%20Reg.%201%22)%20AND%20NOT%20id:hein.
> journals/rcatorbg3.14))&div=13&handle=hein.journals/
> taxlr53&collection=journals>
> > By default the word delimiter is treating all punctuation as a space.  So
> > when you search for:
> > 3 Reg. 1, your results can include  3 Reg. § 1.901
> >
> > I Have experimented with the WDF and added § => ALPHA and this works, and
> > treats the character as a letter.  however during some queries, I still
> > need searches such as
> >
> > Servitudes 2.10
> >
> > to return results with:
> >
> >
> > Servitudes § 2.10
> >
> >
> > I at the moment can not conceive of a way to to this aside from two
> > separate text fields, and effectively doubling the size of my index.
> > which currently sits at 300 gb optimized, and 500gb if left to its
> > own.
> >
> >
> > Thanks for any help or suggestions
>

Re: Section symbol, ignore in some queries but not others?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

If you copyField and don't store the copy, then it is only the indexed
(term) representation for the copy that is much smaller. Just a
thought.

The other thing is that you seem to be saying that you want to do a
match phrase but with a token gap, right? Like an eDisMax slop?
http://lucene.apache.org/solr/guide/7_4/the-extended-dismax-query-parser.html

Regards,
   Alex.

On 25 July 2018 at 14:47, David Hastings <ha...@gmail.com> wrote:
> Hey all.  have a situation that seems pretty rough.  currently in our data
> we have a lot of sentences like this:
>
> elements comprise the "stuff" of the tax. 3 Reg. § 1.901-2(a)(2). 4 Only
> non-Saudis are subject to the
> <https://heinonline.org/HOL/SearchVolumeSOLR?input=(((%223%20Regulation%201%22%20OR%20%223%20Regulation%201%22%20OR%20%223%20Reg.%201%22)%20AND%20NOT%20id:hein.journals/rcatorbg3.14))&div=13&handle=hein.journals/taxlr53&collection=journals>
> By default the word delimiter is treating all punctuation as a space.  So
> when you search for:
> 3 Reg. 1, your results can include  3 Reg. § 1.901
>
> I Have experimented with the WDF and added § => ALPHA and this works, and
> treats the character as a letter.  however during some queries, I still
> need searches such as
>
> Servitudes 2.10
>
> to return results with:
>
>
> Servitudes § 2.10
>
>
> I at the moment can not conceive of a way to to this aside from two
> separate text fields, and effectively doubling the size of my index.
> which currently sits at 300 gb optimized, and 500gb if left to its
> own.
>
>
> Thanks for any help or suggestions