You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Phil Scadden <P....@gns.cri.nz> on 2017/06/09 02:39:01 UTC

including a minus sign "-" in the token

We have important entities referenced in indexed documents which have convention naming of geographicname-number. Eg Wainui-8
I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 but not with Wainui-9 or plain Wainui.

Docs are pdfs, and I have using tika to extract text.

How do I set up solr for queries like this?

Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: including a minus sign "-" in the token

Posted by Phil Scadden <P....@gns.cri.nz>.

Looking at the Classic tokenizer I notice that it does not split on hyphen if there is a  number in the word. Pretty much exactly what I want. What are the downsides to using Classic?

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: Monday, 12 June 2017 2:44 a.m.
To: Phil Scadden <P....@gns.cri.nz>
Subject: Re: including a minus sign "-" in the token

On 6/9/2017 8:12 PM, Phil Scadden wrote:
> So, the field I am using for search has type of:
>   <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>       <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> You are saying "wainui-8" will indexed as one token? But I should add a worddelimiterfilter to the analyser to prevent it being split? Or I guess the Worddelimitergraphfilter.

No, I was saying that the query parser won't look at the hyphen in
wainui-8 and treat it as a "NOT" operator.

Whatever you've got for index/query analysis will still take effect after that -- and it will do that even if you escape characters with a backslash.

Your index and query analysis are almost the same, but query analysis does synonym replacement.  The StandardTokenizerFactory will split "wainui-8" into two tokens and remove the hyphen, even if you escape it at query time.

> Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen followed by a number to NOT be treated as a hyphenated. That would mean catenateWords:1 but catenateNumbers:0???
> What would it do with Wainui-10A?

I'm not sure that there is any single built-in analysis component that will do what you want.  Your index analysis includes StandardTokenizerFactory, so it is going to remove hyphens and split tokens at those locations, whether it is followed by numbers or not.
You're going to need to switch to the whitespace tokenizer and add a filter (like the word delimeter filter) to do further splitting.  The "splitOnNumerics" setting for the word delimeter filter *might* do it, but I'm not sure.  It might take a combination of filters.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: including a minus sign "-" in the token

Posted by Phil Scadden <P....@gns.cri.nz>.

So, the field I am using for search has type of:
  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

You are saying "wainui-8" will indexed as one token? But I should add a worddelimiterfilter to the analyser to prevent it being split? Or I guess the Worddelimitergraphfilter.

Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen followed by a number to NOT be treated as a hyphenated. That would mean catenateWords:1 but catenateNumbers:0???
What would it do with Wainui-10A?

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: Saturday, 10 June 2017 12:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: including a minus sign "-" in the token

On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have
> convention naming of geographicname-number. Eg Wainui-8 I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 but not with Wainui-9 or plain Wainui.
>
> Docs are pdfs, and I have using tika to extract text.
>
> How do I set up solr for queries like this?

At indexing time, Solr does not treat the hyphen as a special character like it does at query time.  Many analysis components do, though.  If your analysis chain includes certain components (the standard tokenizer, the ICU tokenizer, and WordDelimeterFilter are on that list), then the hypen may be treated as a word break character and the analysis could remove it.

At query time, a hyphen in the middle of a word is not treated as a special character.  It would need to be at the beginning of the query text or after a space for the query parser to treat it as a negation.
So Wainui-8 would not be a problem, but -7 would, and you'd need to specify it as \-7 for it to work like you want.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: including a minus sign "-" in the token

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have convention naming of geographicname-number. Eg Wainui-8
> I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 but not with Wainui-9 or plain Wainui.
>
> Docs are pdfs, and I have using tika to extract text.
>
> How do I set up solr for queries like this?

At indexing time, Solr does not treat the hyphen as a special character
like it does at query time.  Many analysis components do, though.  If
your analysis chain includes certain components (the standard tokenizer,
the ICU tokenizer, and WordDelimeterFilter are on that list), then the
hypen may be treated as a word break character and the analysis could
remove it.

At query time, a hyphen in the middle of a word is not treated as a
special character.  It would need to be at the beginning of the query
text or after a space for the query parser to treat it as a negation. 
So Wainui-8 would not be a problem, but -7 would, and you'd need to
specify it as \-7 for it to work like you want.

Thanks,
Shawn

Re: including a minus sign "-" in the token

Posted by Susheel Kumar <su...@gmail.com>.

Hi Phil,

The WordDelimiterFilterFactory (
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory)
can be used to avoid splitting at hypen etc along with
WhiteSpaceTokenizerFactory.  Use  generateWordParts="0"...

Thnx

On Thu, Jun 8, 2017 at 10:39 PM, Phil Scadden <P....@gns.cri.nz> wrote:

> We have important entities referenced in indexed documents which have
> convention naming of geographicname-number. Eg Wainui-8
> I want the tokenizer to treat it as Wainui-8 when indexing, and when I
> search I want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to
> return docs with Wainui-8 but not with Wainui-9 or plain Wainui.
>
> Docs are pdfs, and I have using tika to extract text.
>
> How do I set up solr for queries like this?
>
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>