You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bernd Schmidt <b....@eggheads.de> on 2017/12/06 16:09:14 UTC

Howto search for § character

Hi all,


we have defined a field named "_text_" for a full text search based on field-type "text_general":
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>"


When trying to search for the "§" character, we have strange behaviour:


q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes where entityClass:StructureNodeImpl)
q=_text_:§ => numFound:0


How can we search for the occurence of the § character?


Best regards, 
    Bernd

 Mit freundlichen Grüßen

 Bernd Schmidt
 SOFTWARE-ENTWICKLUNG 

 b.schmidt@eggheads.de



 eggheads GmbH
 Herner Straße 370
44807 Bochum

Fon +49 234 89397-0
Fax +49 234 89397-28
 
 www.eggheads.de
 -----------------------------------------------


Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS COOK, TUI, WILO SE, WÜRTH, u.v.m.


Leistungen Standardsoftware für Product Information Management, Cross Media Publishing & Multi Channel Commerce, Prozessberatung


Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in der Kategorie E-Commerce. Mehr

-----------------------------------------------

Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 12.12.2017. Mehr

-----------------------------------------------

Re: Howto search for § character

Posted by Rick Leir <rl...@leirtech.com>.

Bernd,
What is the analysis chain you have in schema.xml? The chain tokenizes text and filters characters. There is an index time chain and a query time chain. My suspicion is that your analysis chain is mapping that char to a plain ascii char. Use the SolrAdmin analysis tab to debug this.
Cheers -- Rick

On December 6, 2017 11:09:14 AM EST, Bernd Schmidt <b....@eggheads.de> wrote:
>
>Hi all,
>
>
>we have defined a field named "_text_" for a full text search based on
>field-type "text_general":
><field name="_text_" type="text_general" multiValued="true"
>indexed="true" stored="false"/>"
>
>
>When trying to search for the "§" character, we have strange behaviour:
>
>
>q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all
>nodes where entityClass:StructureNodeImpl)
>q=_text_:§ => numFound:0
>
>
>How can we search for the occurence of the § character?
>
>
>Best regards, 
>    Bernd
>
> Mit freundlichen Grüßen
>
> Bernd Schmidt
> SOFTWARE-ENTWICKLUNG 
>
> b.schmidt@eggheads.de
>
>
>
> eggheads GmbH
> Herner Straße 370
>44807 Bochum
>
>Fon +49 234 89397-0
>Fax +49 234 89397-28
> 
> www.eggheads.de
> -----------------------------------------------
>
>
>Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP,
>THOMAS COOK, TUI, WILO SE, WÜRTH, u.v.m.
>
>
>Leistungen Standardsoftware für Product Information Management, Cross
>Media Publishing & Multi Channel Commerce, Prozessberatung
>
>
>Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017
>in der Kategorie E-Commerce. Mehr
>
>-----------------------------------------------
>
>Webinar Vorstellung der neuen Funktionalität der eggheads Suite am
>12.12.2017. Mehr
>
>-----------------------------------------------

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Howto search for § character

Posted by Erick Erickson <er...@gmail.com>.

The admin UI/(select core)/analysis page will help you see exactly
what happens. Additionally, the "schema browser" bit will show you
exactly what's in the index, i.e. the terms as they actually appear
after all the analysis chain is completed. Those will definitively
tell you what exactly happens with that character.

Best,
Erick

On Thu, Dec 7, 2017 at 7:37 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
>> we have defined a field named "_text_" for a full text search based on field-type "text_general":
>> <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>"
>>
>> When trying to search for the "§" character, we have strange behaviour:
>>
>> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes where entityClass:StructureNodeImpl)
>> q=_text_:§ => numFound:0
>>
>> How can we search for the occurence of the § character?
>
> We can't see how your "text_general" type is defined, but if it is
> anything like the same type included in Solr examples, then it probably
> is using StandardTokenizerFactory.  It appears that this tokenizer
> treats the § character as a word break and removes it from the token
> stream.  Most likely, the reason the search with the extra clause works
> is that the part with that character is removed, and the query ends up
> ONLY being the extra clause.
>
> You will need a fieldType with an analysis chain that doesn't remove the
> § character, and it's almost guaranteed that you'll need to reindex.
> Unless you do that, searching for that character is not going to be
> possible.
>
> Also keep in mind that searching for a single character may not do what
> you expect if that character is not a single word in the text, and that
> certain filters can end up trimming out really short terms like that.
>
> Thanks,
> Shawn
>

Re: Howto search for § character

Posted by Erick Erickson <er...@gmail.com>.

You have to use a different analysis chain. There are about a zillion
options, here's a _start_:
https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html
You'll probably be defining one similar to how text_general is
defined, a <fieldType> then use your new type in your <field>. This is
really the heart of how you make Solr do what you want when it comes
to what's searchable and what's not.

When you use the admin/analysis page, hover over the light gray
two-letter abbreviations and it'll pop up the class used for that
transformation.

You can start with WhitespaceTokenizerFactory which will break only on
whitespace. Be aware that other filters can then also manipulate the
tokens created by the tokenizer. WhitespaceTokenizerFactory will _not_
remove punctuation for instance, so you have to deal with that. For
example periods at the end of a sentence "I Like Cake." would be
included in the emitted tokens, so you'e have
I
Like
Cake.

You can use one of the filters to deal with that.

I would be very reluctant to use the "string" type, it's not analyzed
in any way and is almost always the wrong solution for something like
this. So input like this
I Like Cake.
would match _only_ I\ Like\ Cake.
You couldn't search on just the term "like", or even "Like" but only
"*Like*" which rather defeats the purpose of using tokenized search.

Best,
Erick

On Thu, Dec 7, 2017 at 8:37 AM, Bernd Schmidt <b....@eggheads.de> wrote:
>
> Indeed, I saw in the analysis tab of the solr admin that the § char will be removed when using type text_general.
> But in this use case we want to make a full text search like "_text_:§45" or "_text_:§*" to find words starting with §.
> We need a text field here, not a string field!
> What is your recommended way to deal with it?
> Is it possible to remove the word break behaviour for the  § char?
> Or is the best way to encode all § chars when indexing and searching?
>
>
>
> Thanks, Bernd
>
>
>
>  Mit freundlichen Grüßen
>
>  Bernd Schmidt
>  SOFTWARE-ENTWICKLUNG
>
>  b.schmidt@eggheads.de
>
>
>
>  Von:   Shawn Heisey <ap...@elyograg.org>
>  An:   <so...@lucene.apache.org>
>  Gesendet:   07.12.2017 16:37
>  Betreff:   Re: Howto search for § character
>
> On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
>> we have defined a field named "_text_" for a full text search based on field-type "text_general":
>> <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>"
>>
>> When trying to search for the "§" character, we have strange behaviour:
>>
>> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes where entityClass:StructureNodeImpl)
>> q=_text_:§ => numFound:0
>>
>> How can we search for the occurence of the § character?
>
> We can't see how your "text_general" type is defined, but if it is
> anything like the same type included in Solr examples, then it probably
> is using StandardTokenizerFactory.  It appears that this tokenizer
> treats the § character as a word break and removes it from the token
> stream.  Most likely, the reason the search with the extra clause works
> is that the part with that character is removed, and the query ends up
> ONLY being the extra clause.
>
> You will need a fieldType with an analysis chain that doesn't remove the
> § character, and it's almost guaranteed that you'll need to reindex.
> Unless you do that, searching for that character is not going to be
> possible.
>
> Also keep in mind that searching for a single character may not do what
> you expect if that character is not a single word in the text, and that
> certain filters can end up trimming out really short terms like that.
>
> Thanks,
> Shawn
>
>
>
>
>
>  eggheads GmbH
>  Herner Straße 370
> 44807 Bochum
>
> Fon +49 234 89397-0
> Fax +49 234 89397-28
>
>  www.eggheads.de
>  -----------------------------------------------
>
>
> Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS COOK, TUI, WILO SE, WÜRTH, u.v.m.
>
>
> Leistungen Standardsoftware für Product Information Management, Cross Media Publishing & Multi Channel Commerce, Prozessberatung
>
>
> Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in der Kategorie E-Commerce. Mehr
>
> -----------------------------------------------
>
> Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 12.12.2017. Mehr
>
> -----------------------------------------------

Re: Howto search for § character

Posted by Tim Casey <tc...@gmail.com>.

My last company we ended up writing a custom analyzer to handle
punctuation.  But this was for lucent 2 or 3.  That analyzer was carried
forward as we updated and was used for all human derived text.

Although now there are way better analyzers and way better ways to hook
them up, as noted above by Erick, We really cared about how this was done
and all of the work put into the analyzer paid off.

I would expect there to be an analyzer which would maintain punctuation
tokens for search.  One of the issues which comes up is if you want
multiple-runs of punctuation to be a single token or separate tokens.  So
what happens to "§!"  or "§?" or "?§", and in the case of things like
text/email what happens to "§!!!!".

In any event, my 2 pence worth....

tim

On Thu, Dec 7, 2017 at 10:00 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> > Indeed, I saw in the analysis tab of the solr admin that the § char will
> be removed when using type text_general.
> > But in this use case we want to make a full text search like
> "_text_:§45" or "_text_:§*" to find words starting with §.
> > We need a text field here, not a string field!
> > What is your recommended way to deal with it?
> > Is it possible to remove the word break behaviour for the  § char?
> > Or is the best way to encode all § chars when indexing and searching?
>
> This character is classified by Unicode as punctuation:
>
> http://www.fileformat.info/info/unicode/char/00a7/index.htm
>
> Almost any example field type for full-text search that you're likely to
> encounter is going to be designed to split on punctuation and remove it
> from the token stream.  That's one of the most common things that
> full-text search engines do.
>
> You're going to need to design a new analysis chain that *doesn't* do
> this, apply the fieldType containing that analysis to your field,
> restart/reload, and reindex.
>
> Designing analysis chains is an art form, and tends to be one of the
> hardest parts of setting up a production Solr install.  It took me at
> least a month of almost constant work to settle on the schema design for
> the indexes that I maintain.  All of the "solr.TextField" types in my
> schema are completely custom -- none of the analysis chains in Solr
> examples are in that schema.
>
> Thanks,
> Shawn
>
>

Re: Howto search for § character

Posted by Bernd Schmidt <b....@eggheads.de>.

Thanks for all the infos.
That helps so far to understand the issue .

Cheers, Bernd

Mit freundlichen Grüßen

Bernd Schmidt
SOFTWARE-ENTWICKLUNG

b.schmidt@eggheads.de

Von: Shawn Heisey <ap...@elyograg.org>
An: <so...@lucene.apache.org>
Gesendet: 07.12.2017 19:00
Betreff: Re: Howto search for § character

On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> Indeed, I saw in the analysis tab of the solr admin that the § char will be removed when using type text_general.
> But in this use case we want to make a full text search like "_text_:§45" or "_text_:§*" to find words starting with §.
> We need a text field here, not a string field!
> What is your recommended way to deal with it?
> Is it possible to remove the word break behaviour for the § char?
> Or is the best way to encode all § chars when indexing and searching?

This character is classified by Unicode as punctuation:

http://www.fileformat.info/info/unicode/char/00a7/index.htm

Almost any example field type for full-text search that you're likely to
encounter is going to be designed to split on punctuation and remove it
from the token stream. That's one of the most common things that
full-text search engines do.

You're going to need to design a new analysis chain that *doesn't* do
this, apply the fieldType containing that analysis to your field,
restart/reload, and reindex.

Designing analysis chains is an art form, and tends to be one of the
hardest parts of setting up a production Solr install. It took me at
least a month of almost constant work to settle on the schema design for
the indexes that I maintain. All of the "solr.TextField" types in my
schema are completely custom -- none of the analysis chains in Solr
examples are in that schema.

Thanks,
Shawn

eggheads GmbH
Herner Straße 370
44807 Bochum

Fon +49 234 89397-0
Fax +49 234 89397-28

www.eggheads.de
-----------------------------------------------

Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS COOK, TUI, WILO SE, WÜRTH, u.v.m.

Leistungen Standardsoftware für Product Information Management, Cross Media Publishing & Multi Channel Commerce, Prozessberatung

Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in der Kategorie E-Commerce. Mehr

-----------------------------------------------

Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 12.12.2017. Mehr

-----------------------------------------------

Re: Howto search for § character

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> Indeed, I saw in the analysis tab of the solr admin that the § char will be removed when using type text_general.
> But in this use case we want to make a full text search like "_text_:§45" or "_text_:§*" to find words starting with §.
> We need a text field here, not a string field!
> What is your recommended way to deal with it? 
> Is it possible to remove the word break behaviour for the  § char?
> Or is the best way to encode all § chars when indexing and searching?

This character is classified by Unicode as punctuation:

http://www.fileformat.info/info/unicode/char/00a7/index.htm

Almost any example field type for full-text search that you're likely to
encounter is going to be designed to split on punctuation and remove it
from the token stream.  That's one of the most common things that
full-text search engines do.

You're going to need to design a new analysis chain that *doesn't* do
this, apply the fieldType containing that analysis to your field,
restart/reload, and reindex.

Designing analysis chains is an art form, and tends to be one of the
hardest parts of setting up a production Solr install.  It took me at
least a month of almost constant work to settle on the schema design for
the indexes that I maintain.  All of the "solr.TextField" types in my
schema are completely custom -- none of the analysis chains in Solr
examples are in that schema.

Thanks,
Shawn

Re: Howto search for § character

Posted by Bernd Schmidt <b....@eggheads.de>.

Indeed, I saw in the analysis tab of the solr admin that the § char will be removed when using type text_general.
But in this use case we want to make a full text search like "_text_:§45" or "_text_:§*" to find words starting with §.
We need a text field here, not a string field!
What is your recommended way to deal with it? 
Is it possible to remove the word break behaviour for the  § char?
Or is the best way to encode all § chars when indexing and searching?

Thanks, Bernd

 Mit freundlichen Grüßen

 Bernd Schmidt
 SOFTWARE-ENTWICKLUNG 

 b.schmidt@eggheads.de

 Von:   Shawn Heisey <ap...@elyograg.org> 
 An:   <so...@lucene.apache.org> 
 Gesendet:   07.12.2017 16:37 
 Betreff:   Re: Howto search for § character 

On 12/6/2017 9:09 AM, Bernd Schmidt wrote: 
> we have defined a field named "_text_" for a full text search based on field-type "text_general": 
> <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>" 
> 
> When trying to search for the "§" character, we have strange behaviour: 
> 
> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes where entityClass:StructureNodeImpl) 
> q=_text_:§ => numFound:0 
> 
> How can we search for the occurence of the § character? 

We can't see how your "text_general" type is defined, but if it is 
anything like the same type included in Solr examples, then it probably 
is using StandardTokenizerFactory.  It appears that this tokenizer 
treats the § character as a word break and removes it from the token 
stream.  Most likely, the reason the search with the extra clause works 
is that the part with that character is removed, and the query ends up 
ONLY being the extra clause. 

You will need a fieldType with an analysis chain that doesn't remove the 
§ character, and it's almost guaranteed that you'll need to reindex.  
Unless you do that, searching for that character is not going to be 
possible. 

Also keep in mind that searching for a single character may not do what 
you expect if that character is not a single word in the text, and that 
certain filters can end up trimming out really short terms like that. 

Thanks, 
Shawn 

 eggheads GmbH
 Herner Straße 370
44807 Bochum

Fon +49 234 89397-0
Fax +49 234 89397-28

 www.eggheads.de
 -----------------------------------------------

Kunden DER TOURISTIK, EMSA, FRIATEC, MAMMUT, SUTTERLÜTY, SCHÄFER SHOP, THOMAS COOK, TUI, WILO SE, WÜRTH, u.v.m.

Leistungen Standardsoftware für Product Information Management, Cross Media Publishing & Multi Channel Commerce, Prozessberatung

Innovationspreis 2017 eggheads ist Sieger beim Innovationspreis-IT 2017 in der Kategorie E-Commerce. Mehr

-----------------------------------------------

Webinar Vorstellung der neuen Funktionalität der eggheads Suite am 12.12.2017. Mehr

-----------------------------------------------

Re: Howto search for § character

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/6/2017 9:09 AM, Bernd Schmidt wrote:
> we have defined a field named "_text_" for a full text search based on field-type "text_general":
> <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>"
>
> When trying to search for the "§" character, we have strange behaviour:
>
> q=_text_:§ AND entityClass:StructureNodeImpl  => numFound:469 (all nodes where entityClass:StructureNodeImpl)
> q=_text_:§ => numFound:0
>
> How can we search for the occurence of the § character?

We can't see how your "text_general" type is defined, but if it is
anything like the same type included in Solr examples, then it probably
is using StandardTokenizerFactory.  It appears that this tokenizer
treats the § character as a word break and removes it from the token
stream.  Most likely, the reason the search with the extra clause works
is that the part with that character is removed, and the query ends up
ONLY being the extra clause.

You will need a fieldType with an analysis chain that doesn't remove the
§ character, and it's almost guaranteed that you'll need to reindex. 
Unless you do that, searching for that character is not going to be
possible.

Also keep in mind that searching for a single character may not do what
you expect if that character is not a single word in the text, and that
certain filters can end up trimming out really short terms like that.

Thanks,
Shawn