You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Rogers <pa...@gmail.com> on 2014/07/31 18:31:17 UTC

How to search for phrase "IAE_UPC_0001"

Hi Guys

I have a Solr application searching on data uploaded by Nutch.  The search
I wish to carry out is for a particular document reference contained within
the "url" field, e.g. IAE-UPC-0001.

The problem is is that the file names that comprise the url's are not
consistent, so a url might contain the reference as IAE-UPC-0001 or
IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
not both.

I have created the query (in the solr admin interface):

url:"IAE-UPC-0001"

which works (returning the single expected document), as do:

url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
a delimiter).

However:

url:"IAE_UPC_0001"
url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

do not work (returning zero documents) when the doc ref is in the format
IAE_UPC_0001 (ie using the underscore character as the delimiter).

I'm assuming the underscore is a special character but have tried looking
at the solr wiki but can't find anything to say what the problem is.  Also
the minus sign also has a specific meaning but is nullified by adding the
quotes.

Can anyone suggest what I'm doing wrong?

Many thanks

Paul

Re: How to search for phrase "IAE_UPC_0001"

Posted by Paul Rogers <pa...@gmail.com>.
Hi Erick

Thanks for the reply.  I'll have a look and see if it is any help.  Again
thanks for pointing me in the right direction.

regards

Paul


On 31 July 2014 11:58, Erick Erickson <er...@gmail.com> wrote:

> Take a look at WordDelimiterFilterFactory. It has a bunch of
> options to allow this kind of thing to be indexed and searched.
>
> Note that in the default schema, the definition in the index part
> of the fieldType definition has slightly different parameters than
> the query time WordDelimiterFilterFactory, that's a good place
> to start.
>
> WARNING: WDFF is a bit complex, you _really_ would be well
> served by spending some time with the Admin/Analysis page to
> understand the effects of these parameters...
>
> Best,
> Erick
>
>
>
>
> On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers <pa...@gmail.com>
> wrote:
>
> > Hi Guys
> >
> > I have a Solr application searching on data uploaded by Nutch.  The
> search
> > I wish to carry out is for a particular document reference contained
> within
> > the "url" field, e.g. IAE-UPC-0001.
> >
> > The problem is is that the file names that comprise the url's are not
> > consistent, so a url might contain the reference as IAE-UPC-0001 or
> > IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
> but
> > not both.
> >
> > I have created the query (in the solr admin interface):
> >
> > url:"IAE-UPC-0001"
> >
> > which works (returning the single expected document), as do:
> >
> > url:"IAE*UPC*0001"
> > url:"IAE?UPC?0001"
> >
> > when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign
> as
> > a delimiter).
> >
> > However:
> >
> > url:"IAE_UPC_0001"
> > url:"IAE*UPC*0001"
> > url:"IAE?UPC?0001"
> >
> > do not work (returning zero documents) when the doc ref is in the format
> > IAE_UPC_0001 (ie using the underscore character as the delimiter).
> >
> > I'm assuming the underscore is a special character but have tried looking
> > at the solr wiki but can't find anything to say what the problem is.
>  Also
> > the minus sign also has a specific meaning but is nullified by adding the
> > quotes.
> >
> > Can anyone suggest what I'm doing wrong?
> >
> > Many thanks
> >
> > Paul
> >
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Paul Rogers <pa...@gmail.com>.
Hi Jack

Thanks for the info. I'll take a look and see if I can figure it out (just
purchased the book).

P


On 31 July 2014 17:16, Jack Krupansky <ja...@basetechnology.com> wrote:

> And I have a lot more explanation and examples for word delimiter filter
> in my e-book:
> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-
> deep-dive-early-access-release-7/ebook/product-21203548.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Erick Erickson
> Sent: Thursday, July 31, 2014 12:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to search for phrase "IAE_UPC_0001"
>
>
> Take a look at WordDelimiterFilterFactory. It has a bunch of
> options to allow this kind of thing to be indexed and searched.
>
> Note that in the default schema, the definition in the index part
> of the fieldType definition has slightly different parameters than
> the query time WordDelimiterFilterFactory, that's a good place
> to start.
>
> WARNING: WDFF is a bit complex, you _really_ would be well
> served by spending some time with the Admin/Analysis page to
> understand the effects of these parameters...
>
> Best,
> Erick
>
>
>
>
> On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers <pa...@gmail.com>
> wrote:
>
>  Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is.  Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Jack Krupansky <ja...@basetechnology.com>.
And I have a lot more explanation and examples for word delimiter filter in 
my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Thursday, July 31, 2014 12:58 PM
To: solr-user@lucene.apache.org
Subject: Re: How to search for phrase "IAE_UPC_0001"

Take a look at WordDelimiterFilterFactory. It has a bunch of
options to allow this kind of thing to be indexed and searched.

Note that in the default schema, the definition in the index part
of the fieldType definition has slightly different parameters than
the query time WordDelimiterFilterFactory, that's a good place
to start.

WARNING: WDFF is a bit complex, you _really_ would be well
served by spending some time with the Admin/Analysis page to
understand the effects of these parameters...

Best,
Erick




On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers <pa...@gmail.com> wrote:

> Hi Guys
>
> I have a Solr application searching on data uploaded by Nutch.  The search
> I wish to carry out is for a particular document reference contained 
> within
> the "url" field, e.g. IAE-UPC-0001.
>
> The problem is is that the file names that comprise the url's are not
> consistent, so a url might contain the reference as IAE-UPC-0001 or
> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) 
> but
> not both.
>
> I have created the query (in the solr admin interface):
>
> url:"IAE-UPC-0001"
>
> which works (returning the single expected document), as do:
>
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
> a delimiter).
>
> However:
>
> url:"IAE_UPC_0001"
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> do not work (returning zero documents) when the doc ref is in the format
> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>
> I'm assuming the underscore is a special character but have tried looking
> at the solr wiki but can't find anything to say what the problem is.  Also
> the minus sign also has a specific meaning but is nullified by adding the
> quotes.
>
> Can anyone suggest what I'm doing wrong?
>
> Many thanks
>
> Paul
> 


Re: How to search for phrase "IAE_UPC_0001"

Posted by Erick Erickson <er...@gmail.com>.
Take a look at WordDelimiterFilterFactory. It has a bunch of
options to allow this kind of thing to be indexed and searched.

Note that in the default schema, the definition in the index part
of the fieldType definition has slightly different parameters than
the query time WordDelimiterFilterFactory, that's a good place
to start.

WARNING: WDFF is a bit complex, you _really_ would be well
served by spending some time with the Admin/Analysis page to
understand the effects of these parameters...

Best,
Erick




On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers <pa...@gmail.com> wrote:

> Hi Guys
>
> I have a Solr application searching on data uploaded by Nutch.  The search
> I wish to carry out is for a particular document reference contained within
> the "url" field, e.g. IAE-UPC-0001.
>
> The problem is is that the file names that comprise the url's are not
> consistent, so a url might contain the reference as IAE-UPC-0001 or
> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
> not both.
>
> I have created the query (in the solr admin interface):
>
> url:"IAE-UPC-0001"
>
> which works (returning the single expected document), as do:
>
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
> a delimiter).
>
> However:
>
> url:"IAE_UPC_0001"
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> do not work (returning zero documents) when the doc ref is in the format
> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>
> I'm assuming the underscore is a special character but have tried looking
> at the solr wiki but can't find anything to say what the problem is.  Also
> the minus sign also has a specific meaning but is nullified by adding the
> quotes.
>
> Can anyone suggest what I'm doing wrong?
>
> Many thanks
>
> Paul
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Erick Erickson <er...@gmail.com>.
NP, glad you're making forward progress!

Erick

On Mon, Aug 18, 2014 at 12:31 PM, Paul Rogers <pa...@gmail.com> wrote:
> Hi Erick
>
> Thanks for the assist.  Did as you suggested (tho' I used Nutch).  Cleared
> out solr's index and Nutch's crawl DB and then emptied all the documents
> out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####).
>  Then crawled the site using Nutch.
>
> Then confirmed that all 20 docs had been uploaded and that *.* search
> returned all 20 docs.
>
> Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or
> q="IAE_UPC_0001" I get a result returned for each as expected, ie it now
> works as expected.
>
> So seems I now need to figure out why Nutch isn't crawling the documents.
>
> Again many thanks.
>
> P
>
>
>
>
> On 18 August 2014 11:22, Erick Erickson <er...@gmail.com> wrote:
>
>> I'd pull Nutch out of the mix here as a test. Create
>> some test docs (use the exampleDocs directory?) and
>> go from there at least long enough to insure that Solr
>> does what you expect if the data gets there properly.
>>
>> You can set this up in about 10 minutes, and test it
>> in about 15 more. May save you endless hours.
>>
>> Because you're conflating two issues here:
>> 1> whether Nutch is sending the data
>> 2> whether Solr is indexing and searching as you expect.
>>
>> Some of the Solr/Lucene analysis chains do transformations
>> that may not be what you assume, particularly things
>> like StandardTokenizer and WordDelimiterFilterFactory.
>>
>> So I'd take the time to see that the values you're dealing
>> with are behaving as you expect. The admin/analysis page
>> will help you a _lot_ here.
>>
>> Best,
>> Erick
>>
>>
>>
>>
>> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <pa...@gmail.com>
>> wrote:
>> > Hi Guys
>> >
>> > I've been checking into this further and have deleted the index a couple
>> of
>> > times and rebuilt it with the suggestions you've supplied.
>> >
>> > I had a bit of an epiphany last week and decided to check if the
>> document I
>> > was searching for was actually in the index (did this by doing a *.*
>> query
>> > to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it
>> isn't!!
>> > Not sure if it was in the original index or not, tho' I suspect not.
>> >
>> > As far as I can see anything with the reference in the form IAE_UPC_####
>> > has not been indexed while those with the reference in the form
>> > IAE-UPC-#### has.  Not sure if that's a coincidence or not.
>> >
>> > Need to see if I can get the docs into the index and then check if the
>> > search works or not.  Will see if the guys on the Nutch list can shed any
>> > light.
>> >
>> > All the best.
>> >
>> > P
>> >
>> >
>> > On 4 August 2014 17:09, Jack Krupansky <ja...@basetechnology.com> wrote:
>> >
>> >> The standard tokenizer treats underscore as a valid token character,
>> not a
>> >> delimiter.
>> >>
>> >> The word delimiter filter will treat underscore as a delimiter though.
>> >>
>> >> Make sure your query-time WDF does not have preserveOriginal="1" - but
>> the
>> >> index-time WDF should have preserveOriginal="1". Otherwise, the query
>> >> phrase will generate an extra token which will participate in the
>> matching
>> >> and might cause a mismatch.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -----Original Message----- From: Paul Rogers
>> >> Sent: Monday, August 4, 2014 5:55 PM
>> >>
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: How to search for phrase "IAE_UPC_0001"
>> >>
>> >> Hi Guys
>> >>
>> >> Thanks for the replies.  I've had a look at the
>> WordDelimiterFilterFactory
>> >> and the Term Info for the url field.  It seems that all the terms exist
>> and
>> >> I now understand that each url is being broken up using the delimiters
>> >> specified.  But I think I'm still missing something.
>> >>
>> >> Am I correct in assuming the minus sign (-) is also a delimiter?
>> >>
>> >> If so why then does  url:"IAE-UPC-0001" return a result (when the url
>> >> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
>> >> (when the url contains the substring IAE_UPC_0001)?
>> >>
>> >> Secondly if the url has indeed been broken into the terms IAE UPC and
>> 0001
>> >> why do all the searches suggested or tried succeed when the delimiter
>> is a
>> >> minus sign (-) but not when the delimiter is an underscore (_),
>> returning
>> >> zero matches?
>> >>
>> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
>> >> looking for is the three terms?
>> >>
>> >> Many thanks for any enlightenment.
>> >>
>> >> P
>> >>
>> >>
>> >>
>> >>
>> >> On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com>
>> wrote:
>> >>
>> >>  This all depends on how the tokenizers take your URLs apart. To quickly
>> >>> see what ended up in the index, go to a core in the UI, select Schema
>> >>> Browser, select the field containing your URLs, click on "Load Term
>> Info".
>> >>>
>> >>> In your case, for the field holding the URL you could try to switch to
>> a
>> >>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>> >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>> >>> separation
>> >>> characters like dash, underscore, slash, dot and the like would never
>> be
>> >>> part of a token, i.e. they don't make a difference.
>> >>>
>> >>> Then you can search the url parts with a phrase query (
>> >>> https://cwiki.apache.org/confluence/display/solr/The+
>> >>> Standard+Query+Parser#TheStandardQueryParser-
>> >>> SpecifyingTermsfortheStandardQueryParserwhich) like
>> >>>
>> >>>  url:"IAE-UPC-0001"
>> >>>
>> >>> In the same way as during indexing, the dashes are removed to end up
>> with
>> >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>> >>> order. Naturally this will then match anything like:
>> >>>
>> >>>   "IAE_UPC_0001"
>> >>>   "IAE UPC 0001"
>> >>>   "IAE/UPC+0001"
>> >>>   "IAE\UPC\0001"
>> >>>   "IAE.UPC,0001"
>> >>>
>> >>> Depending on how your URLs are structured, there is the chance for
>> false
>> >>> positives, of course.
>> >>>
>> >>> The Really Good Thing here is, that you don't need to use wildcards.
>> >>>
>> >>> I have not yet looked at the wildcard-queries implementation in
>> >>> Solr/Lucene, but with the  commercial search engines I know, they are a
>> >>> great way to loose the confidence of your users, because they just
>> don't
>> >>> work as expected by anyone not knowing the implementation. Either they
>> >>> deliver only partial results or they kill the performance or they even
>> go
>> >>> OOM. If Solr committers have not done something really ingenious,
>> >>> Solr/Lucene does have the same problems.
>> >>>
>> >>> Harald.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 31.07.2014 18:31, Paul Rogers wrote:
>> >>>
>> >>>  Hi Guys
>> >>>>
>> >>>> I have a Solr application searching on data uploaded by Nutch.  The
>> >>>> search
>> >>>> I wish to carry out is for a particular document reference contained
>> >>>> within
>> >>>> the "url" field, e.g. IAE-UPC-0001.
>> >>>>
>> >>>> The problem is is that the file names that comprise the url's are not
>> >>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> >>>> IAE_UPC_0001 (ie using either the minus or underscore as the
>> delimiter)
>> >>>> but
>> >>>> not both.
>> >>>>
>> >>>> I have created the query (in the solr admin interface):
>> >>>>
>> >>>> url:"IAE-UPC-0001"
>> >>>>
>> >>>> which works (returning the single expected document), as do:
>> >>>>
>> >>>> url:"IAE*UPC*0001"
>> >>>> url:"IAE?UPC?0001"
>> >>>>
>> >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus
>> sign
>> >>>> as
>> >>>> a delimiter).
>> >>>>
>> >>>> However:
>> >>>>
>> >>>> url:"IAE_UPC_0001"
>> >>>> url:"IAE*UPC*0001"
>> >>>> url:"IAE?UPC?0001"
>> >>>>
>> >>>> do not work (returning zero documents) when the doc ref is in the
>> format
>> >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>> >>>>
>> >>>> I'm assuming the underscore is a special character but have tried
>> looking
>> >>>> at the solr wiki but can't find anything to say what the problem is.
>> Also
>> >>>> the minus sign also has a specific meaning but is nullified by adding
>> the
>> >>>> quotes.
>> >>>>
>> >>>> Can anyone suggest what I'm doing wrong?
>> >>>>
>> >>>> Many thanks
>> >>>>
>> >>>> Paul
>> >>>>
>> >>>>
>> >>>>  --
>> >>> Harald Kirsch
>> >>> Raytion GmbH
>> >>> Kaiser-Friedrich-Ring 74
>> >>> 40547 Duesseldorf
>> >>> Fon +49 211 53883-216
>> >>> Fax +49-211-550266-19
>> >>> http://www.raytion.com
>> >>>
>> >>>
>> >>
>>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Paul Rogers <pa...@gmail.com>.
Hi Erick

Thanks for the assist.  Did as you suggested (tho' I used Nutch).  Cleared
out solr's index and Nutch's crawl DB and then emptied all the documents
out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####).
 Then crawled the site using Nutch.

Then confirmed that all 20 docs had been uploaded and that *.* search
returned all 20 docs.

Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or
q="IAE_UPC_0001" I get a result returned for each as expected, ie it now
works as expected.

So seems I now need to figure out why Nutch isn't crawling the documents.

Again many thanks.

P




On 18 August 2014 11:22, Erick Erickson <er...@gmail.com> wrote:

> I'd pull Nutch out of the mix here as a test. Create
> some test docs (use the exampleDocs directory?) and
> go from there at least long enough to insure that Solr
> does what you expect if the data gets there properly.
>
> You can set this up in about 10 minutes, and test it
> in about 15 more. May save you endless hours.
>
> Because you're conflating two issues here:
> 1> whether Nutch is sending the data
> 2> whether Solr is indexing and searching as you expect.
>
> Some of the Solr/Lucene analysis chains do transformations
> that may not be what you assume, particularly things
> like StandardTokenizer and WordDelimiterFilterFactory.
>
> So I'd take the time to see that the values you're dealing
> with are behaving as you expect. The admin/analysis page
> will help you a _lot_ here.
>
> Best,
> Erick
>
>
>
>
> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <pa...@gmail.com>
> wrote:
> > Hi Guys
> >
> > I've been checking into this further and have deleted the index a couple
> of
> > times and rebuilt it with the suggestions you've supplied.
> >
> > I had a bit of an epiphany last week and decided to check if the
> document I
> > was searching for was actually in the index (did this by doing a *.*
> query
> > to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it
> isn't!!
> > Not sure if it was in the original index or not, tho' I suspect not.
> >
> > As far as I can see anything with the reference in the form IAE_UPC_####
> > has not been indexed while those with the reference in the form
> > IAE-UPC-#### has.  Not sure if that's a coincidence or not.
> >
> > Need to see if I can get the docs into the index and then check if the
> > search works or not.  Will see if the guys on the Nutch list can shed any
> > light.
> >
> > All the best.
> >
> > P
> >
> >
> > On 4 August 2014 17:09, Jack Krupansky <ja...@basetechnology.com> wrote:
> >
> >> The standard tokenizer treats underscore as a valid token character,
> not a
> >> delimiter.
> >>
> >> The word delimiter filter will treat underscore as a delimiter though.
> >>
> >> Make sure your query-time WDF does not have preserveOriginal="1" - but
> the
> >> index-time WDF should have preserveOriginal="1". Otherwise, the query
> >> phrase will generate an extra token which will participate in the
> matching
> >> and might cause a mismatch.
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Paul Rogers
> >> Sent: Monday, August 4, 2014 5:55 PM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How to search for phrase "IAE_UPC_0001"
> >>
> >> Hi Guys
> >>
> >> Thanks for the replies.  I've had a look at the
> WordDelimiterFilterFactory
> >> and the Term Info for the url field.  It seems that all the terms exist
> and
> >> I now understand that each url is being broken up using the delimiters
> >> specified.  But I think I'm still missing something.
> >>
> >> Am I correct in assuming the minus sign (-) is also a delimiter?
> >>
> >> If so why then does  url:"IAE-UPC-0001" return a result (when the url
> >> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
> >> (when the url contains the substring IAE_UPC_0001)?
> >>
> >> Secondly if the url has indeed been broken into the terms IAE UPC and
> 0001
> >> why do all the searches suggested or tried succeed when the delimiter
> is a
> >> minus sign (-) but not when the delimiter is an underscore (_),
> returning
> >> zero matches?
> >>
> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
> >> looking for is the three terms?
> >>
> >> Many thanks for any enlightenment.
> >>
> >> P
> >>
> >>
> >>
> >>
> >> On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com>
> wrote:
> >>
> >>  This all depends on how the tokenizers take your URLs apart. To quickly
> >>> see what ended up in the index, go to a core in the UI, select Schema
> >>> Browser, select the field containing your URLs, click on "Load Term
> Info".
> >>>
> >>> In your case, for the field holding the URL you could try to switch to
> a
> >>> tokenizer that defines tokens as a sequence of alphanumeric characters,
> >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
> >>> separation
> >>> characters like dash, underscore, slash, dot and the like would never
> be
> >>> part of a token, i.e. they don't make a difference.
> >>>
> >>> Then you can search the url parts with a phrase query (
> >>> https://cwiki.apache.org/confluence/display/solr/The+
> >>> Standard+Query+Parser#TheStandardQueryParser-
> >>> SpecifyingTermsfortheStandardQueryParserwhich) like
> >>>
> >>>  url:"IAE-UPC-0001"
> >>>
> >>> In the same way as during indexing, the dashes are removed to end up
> with
> >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> >>> order. Naturally this will then match anything like:
> >>>
> >>>   "IAE_UPC_0001"
> >>>   "IAE UPC 0001"
> >>>   "IAE/UPC+0001"
> >>>   "IAE\UPC\0001"
> >>>   "IAE.UPC,0001"
> >>>
> >>> Depending on how your URLs are structured, there is the chance for
> false
> >>> positives, of course.
> >>>
> >>> The Really Good Thing here is, that you don't need to use wildcards.
> >>>
> >>> I have not yet looked at the wildcard-queries implementation in
> >>> Solr/Lucene, but with the  commercial search engines I know, they are a
> >>> great way to loose the confidence of your users, because they just
> don't
> >>> work as expected by anyone not knowing the implementation. Either they
> >>> deliver only partial results or they kill the performance or they even
> go
> >>> OOM. If Solr committers have not done something really ingenious,
> >>> Solr/Lucene does have the same problems.
> >>>
> >>> Harald.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 31.07.2014 18:31, Paul Rogers wrote:
> >>>
> >>>  Hi Guys
> >>>>
> >>>> I have a Solr application searching on data uploaded by Nutch.  The
> >>>> search
> >>>> I wish to carry out is for a particular document reference contained
> >>>> within
> >>>> the "url" field, e.g. IAE-UPC-0001.
> >>>>
> >>>> The problem is is that the file names that comprise the url's are not
> >>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
> >>>> IAE_UPC_0001 (ie using either the minus or underscore as the
> delimiter)
> >>>> but
> >>>> not both.
> >>>>
> >>>> I have created the query (in the solr admin interface):
> >>>>
> >>>> url:"IAE-UPC-0001"
> >>>>
> >>>> which works (returning the single expected document), as do:
> >>>>
> >>>> url:"IAE*UPC*0001"
> >>>> url:"IAE?UPC?0001"
> >>>>
> >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus
> sign
> >>>> as
> >>>> a delimiter).
> >>>>
> >>>> However:
> >>>>
> >>>> url:"IAE_UPC_0001"
> >>>> url:"IAE*UPC*0001"
> >>>> url:"IAE?UPC?0001"
> >>>>
> >>>> do not work (returning zero documents) when the doc ref is in the
> format
> >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
> >>>>
> >>>> I'm assuming the underscore is a special character but have tried
> looking
> >>>> at the solr wiki but can't find anything to say what the problem is.
> Also
> >>>> the minus sign also has a specific meaning but is nullified by adding
> the
> >>>> quotes.
> >>>>
> >>>> Can anyone suggest what I'm doing wrong?
> >>>>
> >>>> Many thanks
> >>>>
> >>>> Paul
> >>>>
> >>>>
> >>>>  --
> >>> Harald Kirsch
> >>> Raytion GmbH
> >>> Kaiser-Friedrich-Ring 74
> >>> 40547 Duesseldorf
> >>> Fon +49 211 53883-216
> >>> Fax +49-211-550266-19
> >>> http://www.raytion.com
> >>>
> >>>
> >>
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Erick Erickson <er...@gmail.com>.
I'd pull Nutch out of the mix here as a test. Create
some test docs (use the exampleDocs directory?) and
go from there at least long enough to insure that Solr
does what you expect if the data gets there properly.

You can set this up in about 10 minutes, and test it
in about 15 more. May save you endless hours.

Because you're conflating two issues here:
1> whether Nutch is sending the data
2> whether Solr is indexing and searching as you expect.

Some of the Solr/Lucene analysis chains do transformations
that may not be what you assume, particularly things
like StandardTokenizer and WordDelimiterFilterFactory.

So I'd take the time to see that the values you're dealing
with are behaving as you expect. The admin/analysis page
will help you a _lot_ here.

Best,
Erick




On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <pa...@gmail.com> wrote:
> Hi Guys
>
> I've been checking into this further and have deleted the index a couple of
> times and rebuilt it with the suggestions you've supplied.
>
> I had a bit of an epiphany last week and decided to check if the document I
> was searching for was actually in the index (did this by doing a *.* query
> to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it isn't!!
> Not sure if it was in the original index or not, tho' I suspect not.
>
> As far as I can see anything with the reference in the form IAE_UPC_####
> has not been indexed while those with the reference in the form
> IAE-UPC-#### has.  Not sure if that's a coincidence or not.
>
> Need to see if I can get the docs into the index and then check if the
> search works or not.  Will see if the guys on the Nutch list can shed any
> light.
>
> All the best.
>
> P
>
>
> On 4 August 2014 17:09, Jack Krupansky <ja...@basetechnology.com> wrote:
>
>> The standard tokenizer treats underscore as a valid token character, not a
>> delimiter.
>>
>> The word delimiter filter will treat underscore as a delimiter though.
>>
>> Make sure your query-time WDF does not have preserveOriginal="1" - but the
>> index-time WDF should have preserveOriginal="1". Otherwise, the query
>> phrase will generate an extra token which will participate in the matching
>> and might cause a mismatch.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Paul Rogers
>> Sent: Monday, August 4, 2014 5:55 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to search for phrase "IAE_UPC_0001"
>>
>> Hi Guys
>>
>> Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
>> and the Term Info for the url field.  It seems that all the terms exist and
>> I now understand that each url is being broken up using the delimiters
>> specified.  But I think I'm still missing something.
>>
>> Am I correct in assuming the minus sign (-) is also a delimiter?
>>
>> If so why then does  url:"IAE-UPC-0001" return a result (when the url
>> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
>> (when the url contains the substring IAE_UPC_0001)?
>>
>> Secondly if the url has indeed been broken into the terms IAE UPC and 0001
>> why do all the searches suggested or tried succeed when the delimiter is a
>> minus sign (-) but not when the delimiter is an underscore (_), returning
>> zero matches?
>>
>> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
>> looking for is the three terms?
>>
>> Many thanks for any enlightenment.
>>
>> P
>>
>>
>>
>>
>> On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com> wrote:
>>
>>  This all depends on how the tokenizers take your URLs apart. To quickly
>>> see what ended up in the index, go to a core in the UI, select Schema
>>> Browser, select the field containing your URLs, click on "Load Term Info".
>>>
>>> In your case, for the field holding the URL you could try to switch to a
>>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>>> separation
>>> characters like dash, underscore, slash, dot and the like would never be
>>> part of a token, i.e. they don't make a difference.
>>>
>>> Then you can search the url parts with a phrase query (
>>> https://cwiki.apache.org/confluence/display/solr/The+
>>> Standard+Query+Parser#TheStandardQueryParser-
>>> SpecifyingTermsfortheStandardQueryParserwhich) like
>>>
>>>  url:"IAE-UPC-0001"
>>>
>>> In the same way as during indexing, the dashes are removed to end up with
>>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>>> order. Naturally this will then match anything like:
>>>
>>>   "IAE_UPC_0001"
>>>   "IAE UPC 0001"
>>>   "IAE/UPC+0001"
>>>   "IAE\UPC\0001"
>>>   "IAE.UPC,0001"
>>>
>>> Depending on how your URLs are structured, there is the chance for false
>>> positives, of course.
>>>
>>> The Really Good Thing here is, that you don't need to use wildcards.
>>>
>>> I have not yet looked at the wildcard-queries implementation in
>>> Solr/Lucene, but with the  commercial search engines I know, they are a
>>> great way to loose the confidence of your users, because they just don't
>>> work as expected by anyone not knowing the implementation. Either they
>>> deliver only partial results or they kill the performance or they even go
>>> OOM. If Solr committers have not done something really ingenious,
>>> Solr/Lucene does have the same problems.
>>>
>>> Harald.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 31.07.2014 18:31, Paul Rogers wrote:
>>>
>>>  Hi Guys
>>>>
>>>> I have a Solr application searching on data uploaded by Nutch.  The
>>>> search
>>>> I wish to carry out is for a particular document reference contained
>>>> within
>>>> the "url" field, e.g. IAE-UPC-0001.
>>>>
>>>> The problem is is that the file names that comprise the url's are not
>>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>>>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>>>> but
>>>> not both.
>>>>
>>>> I have created the query (in the solr admin interface):
>>>>
>>>> url:"IAE-UPC-0001"
>>>>
>>>> which works (returning the single expected document), as do:
>>>>
>>>> url:"IAE*UPC*0001"
>>>> url:"IAE?UPC?0001"
>>>>
>>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign
>>>> as
>>>> a delimiter).
>>>>
>>>> However:
>>>>
>>>> url:"IAE_UPC_0001"
>>>> url:"IAE*UPC*0001"
>>>> url:"IAE?UPC?0001"
>>>>
>>>> do not work (returning zero documents) when the doc ref is in the format
>>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>>>
>>>> I'm assuming the underscore is a special character but have tried looking
>>>> at the solr wiki but can't find anything to say what the problem is. Also
>>>> the minus sign also has a specific meaning but is nullified by adding the
>>>> quotes.
>>>>
>>>> Can anyone suggest what I'm doing wrong?
>>>>
>>>> Many thanks
>>>>
>>>> Paul
>>>>
>>>>
>>>>  --
>>> Harald Kirsch
>>> Raytion GmbH
>>> Kaiser-Friedrich-Ring 74
>>> 40547 Duesseldorf
>>> Fon +49 211 53883-216
>>> Fax +49-211-550266-19
>>> http://www.raytion.com
>>>
>>>
>>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Paul Rogers <pa...@gmail.com>.
Hi Guys

I've been checking into this further and have deleted the index a couple of
times and rebuilt it with the suggestions you've supplied.

I had a bit of an epiphany last week and decided to check if the document I
was searching for was actually in the index (did this by doing a *.* query
to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it isn't!!
Not sure if it was in the original index or not, tho' I suspect not.

As far as I can see anything with the reference in the form IAE_UPC_####
has not been indexed while those with the reference in the form
IAE-UPC-#### has.  Not sure if that's a coincidence or not.

Need to see if I can get the docs into the index and then check if the
search works or not.  Will see if the guys on the Nutch list can shed any
light.

All the best.

P


On 4 August 2014 17:09, Jack Krupansky <ja...@basetechnology.com> wrote:

> The standard tokenizer treats underscore as a valid token character, not a
> delimiter.
>
> The word delimiter filter will treat underscore as a delimiter though.
>
> Make sure your query-time WDF does not have preserveOriginal="1" - but the
> index-time WDF should have preserveOriginal="1". Otherwise, the query
> phrase will generate an extra token which will participate in the matching
> and might cause a mismatch.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Paul Rogers
> Sent: Monday, August 4, 2014 5:55 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: How to search for phrase "IAE_UPC_0001"
>
> Hi Guys
>
> Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
> and the Term Info for the url field.  It seems that all the terms exist and
> I now understand that each url is being broken up using the delimiters
> specified.  But I think I'm still missing something.
>
> Am I correct in assuming the minus sign (-) is also a delimiter?
>
> If so why then does  url:"IAE-UPC-0001" return a result (when the url
> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
> (when the url contains the substring IAE_UPC_0001)?
>
> Secondly if the url has indeed been broken into the terms IAE UPC and 0001
> why do all the searches suggested or tried succeed when the delimiter is a
> minus sign (-) but not when the delimiter is an underscore (_), returning
> zero matches?
>
> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
> looking for is the three terms?
>
> Many thanks for any enlightenment.
>
> P
>
>
>
>
> On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com> wrote:
>
>  This all depends on how the tokenizers take your URLs apart. To quickly
>> see what ended up in the index, go to a core in the UI, select Schema
>> Browser, select the field containing your URLs, click on "Load Term Info".
>>
>> In your case, for the field holding the URL you could try to switch to a
>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>> separation
>> characters like dash, underscore, slash, dot and the like would never be
>> part of a token, i.e. they don't make a difference.
>>
>> Then you can search the url parts with a phrase query (
>> https://cwiki.apache.org/confluence/display/solr/The+
>> Standard+Query+Parser#TheStandardQueryParser-
>> SpecifyingTermsfortheStandardQueryParserwhich) like
>>
>>  url:"IAE-UPC-0001"
>>
>> In the same way as during indexing, the dashes are removed to end up with
>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>> order. Naturally this will then match anything like:
>>
>>   "IAE_UPC_0001"
>>   "IAE UPC 0001"
>>   "IAE/UPC+0001"
>>   "IAE\UPC\0001"
>>   "IAE.UPC,0001"
>>
>> Depending on how your URLs are structured, there is the chance for false
>> positives, of course.
>>
>> The Really Good Thing here is, that you don't need to use wildcards.
>>
>> I have not yet looked at the wildcard-queries implementation in
>> Solr/Lucene, but with the  commercial search engines I know, they are a
>> great way to loose the confidence of your users, because they just don't
>> work as expected by anyone not knowing the implementation. Either they
>> deliver only partial results or they kill the performance or they even go
>> OOM. If Solr committers have not done something really ingenious,
>> Solr/Lucene does have the same problems.
>>
>> Harald.
>>
>>
>>
>>
>>
>>
>> On 31.07.2014 18:31, Paul Rogers wrote:
>>
>>  Hi Guys
>>>
>>> I have a Solr application searching on data uploaded by Nutch.  The
>>> search
>>> I wish to carry out is for a particular document reference contained
>>> within
>>> the "url" field, e.g. IAE-UPC-0001.
>>>
>>> The problem is is that the file names that comprise the url's are not
>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>>> but
>>> not both.
>>>
>>> I have created the query (in the solr admin interface):
>>>
>>> url:"IAE-UPC-0001"
>>>
>>> which works (returning the single expected document), as do:
>>>
>>> url:"IAE*UPC*0001"
>>> url:"IAE?UPC?0001"
>>>
>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign
>>> as
>>> a delimiter).
>>>
>>> However:
>>>
>>> url:"IAE_UPC_0001"
>>> url:"IAE*UPC*0001"
>>> url:"IAE?UPC?0001"
>>>
>>> do not work (returning zero documents) when the doc ref is in the format
>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>>
>>> I'm assuming the underscore is a special character but have tried looking
>>> at the solr wiki but can't find anything to say what the problem is. Also
>>> the minus sign also has a specific meaning but is nullified by adding the
>>> quotes.
>>>
>>> Can anyone suggest what I'm doing wrong?
>>>
>>> Many thanks
>>>
>>> Paul
>>>
>>>
>>>  --
>> Harald Kirsch
>> Raytion GmbH
>> Kaiser-Friedrich-Ring 74
>> 40547 Duesseldorf
>> Fon +49 211 53883-216
>> Fax +49-211-550266-19
>> http://www.raytion.com
>>
>>
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Jack Krupansky <ja...@basetechnology.com>.
The standard tokenizer treats underscore as a valid token character, not a 
delimiter.

The word delimiter filter will treat underscore as a delimiter though.

Make sure your query-time WDF does not have preserveOriginal="1" - but the 
index-time WDF should have preserveOriginal="1". Otherwise, the query phrase 
will generate an extra token which will participate in the matching and 
might cause a mismatch.

-- Jack Krupansky

-----Original Message----- 
From: Paul Rogers
Sent: Monday, August 4, 2014 5:55 PM
To: solr-user@lucene.apache.org
Subject: Re: How to search for phrase "IAE_UPC_0001"

Hi Guys

Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
and the Term Info for the url field.  It seems that all the terms exist and
I now understand that each url is being broken up using the delimiters
specified.  But I think I'm still missing something.

Am I correct in assuming the minus sign (-) is also a delimiter?

If so why then does  url:"IAE-UPC-0001" return a result (when the url
contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
(when the url contains the substring IAE_UPC_0001)?

Secondly if the url has indeed been broken into the terms IAE UPC and 0001
why do all the searches suggested or tried succeed when the delimiter is a
minus sign (-) but not when the delimiter is an underscore (_), returning
zero matches?

Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
looking for is the three terms?

Many thanks for any enlightenment.

P




On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com> wrote:

> This all depends on how the tokenizers take your URLs apart. To quickly
> see what ended up in the index, go to a core in the UI, select Schema
> Browser, select the field containing your URLs, click on "Load Term Info".
>
> In your case, for the field holding the URL you could try to switch to a
> tokenizer that defines tokens as a sequence of alphanumeric characters,
> roughly [a-z0-9]+ plus diacritics. In particular punctuation and 
> separation
> characters like dash, underscore, slash, dot and the like would never be
> part of a token, i.e. they don't make a difference.
>
> Then you can search the url parts with a phrase query (
> https://cwiki.apache.org/confluence/display/solr/The+
> Standard+Query+Parser#TheStandardQueryParser-
> SpecifyingTermsfortheStandardQueryParserwhich) like
>
>  url:"IAE-UPC-0001"
>
> In the same way as during indexing, the dashes are removed to end up with
> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> order. Naturally this will then match anything like:
>
>   "IAE_UPC_0001"
>   "IAE UPC 0001"
>   "IAE/UPC+0001"
>   "IAE\UPC\0001"
>   "IAE.UPC,0001"
>
> Depending on how your URLs are structured, there is the chance for false
> positives, of course.
>
> The Really Good Thing here is, that you don't need to use wildcards.
>
> I have not yet looked at the wildcard-queries implementation in
> Solr/Lucene, but with the  commercial search engines I know, they are a
> great way to loose the confidence of your users, because they just don't
> work as expected by anyone not knowing the implementation. Either they
> deliver only partial results or they kill the performance or they even go
> OOM. If Solr committers have not done something really ingenious,
> Solr/Lucene does have the same problems.
>
> Harald.
>
>
>
>
>
>
> On 31.07.2014 18:31, Paul Rogers wrote:
>
>> Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The 
>> search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign 
>> as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is. 
>> Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49 211 53883-216
> Fax +49-211-550266-19
> http://www.raytion.com
> 


Re: How to search for phrase "IAE_UPC_0001"

Posted by Paul Rogers <pa...@gmail.com>.
Hi Guys

Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
and the Term Info for the url field.  It seems that all the terms exist and
I now understand that each url is being broken up using the delimiters
specified.  But I think I'm still missing something.

Am I correct in assuming the minus sign (-) is also a delimiter?

If so why then does  url:"IAE-UPC-0001" return a result (when the url
contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
(when the url contains the substring IAE_UPC_0001)?

Secondly if the url has indeed been broken into the terms IAE UPC and 0001
why do all the searches suggested or tried succeed when the delimiter is a
minus sign (-) but not when the delimiter is an underscore (_), returning
zero matches?

Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
looking for is the three terms?

Many thanks for any enlightenment.

P




On 4 August 2014 01:33, Harald Kirsch <Ha...@raytion.com> wrote:

> This all depends on how the tokenizers take your URLs apart. To quickly
> see what ended up in the index, go to a core in the UI, select Schema
> Browser, select the field containing your URLs, click on "Load Term Info".
>
> In your case, for the field holding the URL you could try to switch to a
> tokenizer that defines tokens as a sequence of alphanumeric characters,
> roughly [a-z0-9]+ plus diacritics. In particular punctuation and separation
> characters like dash, underscore, slash, dot and the like would never be
> part of a token, i.e. they don't make a difference.
>
> Then you can search the url parts with a phrase query (
> https://cwiki.apache.org/confluence/display/solr/The+
> Standard+Query+Parser#TheStandardQueryParser-
> SpecifyingTermsfortheStandardQueryParserwhich) like
>
>  url:"IAE-UPC-0001"
>
> In the same way as during indexing, the dashes are removed to end up with
> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> order. Naturally this will then match anything like:
>
>   "IAE_UPC_0001"
>   "IAE UPC 0001"
>   "IAE/UPC+0001"
>   "IAE\UPC\0001"
>   "IAE.UPC,0001"
>
> Depending on how your URLs are structured, there is the chance for false
> positives, of course.
>
> The Really Good Thing here is, that you don't need to use wildcards.
>
> I have not yet looked at the wildcard-queries implementation in
> Solr/Lucene, but with the  commercial search engines I know, they are a
> great way to loose the confidence of your users, because they just don't
> work as expected by anyone not knowing the implementation. Either they
> deliver only partial results or they kill the performance or they even go
> OOM. If Solr committers have not done something really ingenious,
> Solr/Lucene does have the same problems.
>
> Harald.
>
>
>
>
>
>
> On 31.07.2014 18:31, Paul Rogers wrote:
>
>> Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is.  Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49 211 53883-216
> Fax +49-211-550266-19
> http://www.raytion.com
>

Re: How to search for phrase "IAE_UPC_0001"

Posted by Harald Kirsch <Ha...@raytion.com>.
This all depends on how the tokenizers take your URLs apart. To quickly 
see what ended up in the index, go to a core in the UI, select Schema 
Browser, select the field containing your URLs, click on "Load Term Info".

In your case, for the field holding the URL you could try to switch to a 
tokenizer that defines tokens as a sequence of alphanumeric characters, 
roughly [a-z0-9]+ plus diacritics. In particular punctuation and 
separation characters like dash, underscore, slash, dot and the like 
would never be part of a token, i.e. they don't make a difference.

Then you can search the url parts with a phrase query 
(https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-SpecifyingTermsfortheStandardQueryParserwhich) 
like

  url:"IAE-UPC-0001"

In the same way as during indexing, the dashes are removed to end up 
with three tokens, namely IAE, UPC and 0001. Further they have to be in 
that order. Naturally this will then match anything like:

   "IAE_UPC_0001"
   "IAE UPC 0001"
   "IAE/UPC+0001"
   "IAE\UPC\0001"
   "IAE.UPC,0001"

Depending on how your URLs are structured, there is the chance for false 
positives, of course.

The Really Good Thing here is, that you don't need to use wildcards.

I have not yet looked at the wildcard-queries implementation in 
Solr/Lucene, but with the  commercial search engines I know, they are a 
great way to loose the confidence of your users, because they just don't 
work as expected by anyone not knowing the implementation. Either they 
deliver only partial results or they kill the performance or they even 
go OOM. If Solr committers have not done something really ingenious, 
Solr/Lucene does have the same problems.

Harald.





On 31.07.2014 18:31, Paul Rogers wrote:
> Hi Guys
>
> I have a Solr application searching on data uploaded by Nutch.  The search
> I wish to carry out is for a particular document reference contained within
> the "url" field, e.g. IAE-UPC-0001.
>
> The problem is is that the file names that comprise the url's are not
> consistent, so a url might contain the reference as IAE-UPC-0001 or
> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
> not both.
>
> I have created the query (in the solr admin interface):
>
> url:"IAE-UPC-0001"
>
> which works (returning the single expected document), as do:
>
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
> a delimiter).
>
> However:
>
> url:"IAE_UPC_0001"
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> do not work (returning zero documents) when the doc ref is in the format
> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>
> I'm assuming the underscore is a special character but have tried looking
> at the solr wiki but can't find anything to say what the problem is.  Also
> the minus sign also has a specific meaning but is nullified by adding the
> quotes.
>
> Can anyone suggest what I'm doing wrong?
>
> Many thanks
>
> Paul
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com