You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by arik <ar...@gmail.com> on 2017/06/07 14:54:31 UTC

international characters in facet.prefix

I'm finding that the facet.prefix query parameter does not seem to support
international characters, regardless of url encoding.  All the other
parameters work fine, but that one seems unique in this respect.

For example with this data:

François Nédélec

*These queries produce relevant facets:*

/select?facet=on&facet.field=description&facet.prefix=ned&indent=on&q=description:françois&wt=json
/select?facet=on&facet.field=description&facet.prefix=franc&indent=on&q=description:nédélec&wt=json
/select?facet=on&facet.field=description&facet.prefix=ned&indent=on&q=description:fran%C3%A7ois&wt=json

*But these do not produce any facets:
*

/select?facet=on&facet.field=description&facet.prefix=néd&indent=on&q=description:françois&wt=json
/select?facet=on&facet.field=description&facet.prefix=franç&indent=on&q=description:nédélec&wt=json
/select?facet=on&facet.field=description&facet.prefix=n%C3%A9d&indent=on&q=description:fran%C3%A7ois&wt=json

It seems therefore that facet.prefix only supports the "post-filtered" text,
not the raw original text?  Note how the international characters are
working fine in the "q" param, just not in facet.prefix.  

Any info or clarity or work-arounds on this would be much appreciated.



--
View this message in context: http://lucene.472066.n3.nabble.com/international-characters-in-facet-prefix-tp4339415.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: international characters in facet.prefix

Posted by Stefan Matheis <ma...@gmail.com>.
If you don't mind, my question is what're trying to do in the first place?

And please don't describe it with the technical approach you're already
using (or at least trying to) but rather in basic/business terms.

-Stefan

On Jun 8, 2017 3:03 AM, "arik" <ar...@gmail.com> wrote:

> Thanks Erick, indeed your hunch is correct, it's the analyzing filters that
> facet.prefix seems to bypass, and getting rid of my
> ASCIIFoldingFilterFactory and MappingCharFilterFactory make it work ok.
>
> The problem is I need those filters... otherwise how should I create facets
> which match against both Anglicized as well as international prefix
> spellings?  I could of course maintain separate fields and do multiple
> queries, but seems like that quickly gets out of hand if I also want to
> support mixed case and other filtering dimensions.
>
> Is there a way to route facet.prefix through the field type filters like
> all
> the other params? I suppose I could manually instantiate and pre-apply the
> filters in the client code... any other ideas?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/international-characters-in-facet-prefix-tp4339415p4339534.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: international characters in facet.prefix

Posted by arik <ar...@gmail.com>.
Thanks for the guidance.  I have a reasonable "middle ground" blend of
client-side and server side tweaks working now.  In solr I copied my field
into a duplicate field sans folding filters, so that I essentially have
"myfield_raw" and "myfield_analyzed".  Then on the client side include both
these fields in my facet query.  Then finally I prefer results from
myfield_analyzed when they exist, and fallback to the myfield_raw results
when the analyzed one turns up empty, which is what happens when foreign
characters are in the facet.prefix.

So I still get all my results in a single solr call.  Capitalization is
lost, all results are lowercased (I kept the lowercase analyzer in the raw)
but that's ok for my needs.



--
View this message in context: http://lucene.472066.n3.nabble.com/international-characters-in-facet-prefix-tp4339415p4339877.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: international characters in facet.prefix

Posted by Erick Erickson <er...@gmail.com>.
If you require that the facets show both the folded and non-folded
versions, then you have no choice except to index both somehow.

But I think you're saying that you expect "néd" and "ned" to be
counted in one bucket. Then, indeed, you have to somehow pre-apply the
relevant filters. You can do that in the client code or you could
write a QueryComponent that intercepted the query (probably a
first-component) and "did the right thing". The advantage there is
that since this is running on the server it has full access to the
analysis chain and could force the token to go through selected parts
of the chain without having to change the client code.

I say "parts of the chain" because some things just wouldn't make
sense. Say you had WordDelimiterFilterFactory in your chain. If your
prefix has a change in case, you'd get two tokens, definitely not what
you want. Which is one of the reasons facet prefixes don't do this by
default. Another gotcha would be, say, stemming. facet.prefix=runn
doesn't stem like "runner" for instance. In fact it doesn't stem at
all....

Note that case sensitivity matters here too. If you specified a prefix
of Ned I don't think you'd get anything counted in that bucket.

If I were going to make a queryComponent out of it, I'd probably just
define a new field that has selected filters in it (lowerCase,
folding, etc). and force the prefix through that.

Here's some background on the general problem:
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

Skimming that again it _does_ seem possible that sending a facet
prefix through the analysis chain as though it were a wildcarded term
would do what you're asking, but nobody has yet volunteered to write
the code. It would probably require a new facet parameter like
facet.analyze=true or something.

But frankly I think that's overkill. My bet is that you could do this
on the client side "well enough" and much more quickly....

Best,
Erick

On Wed, Jun 7, 2017 at 6:03 PM, arik <ar...@gmail.com> wrote:
> Thanks Erick, indeed your hunch is correct, it's the analyzing filters that
> facet.prefix seems to bypass, and getting rid of my
> ASCIIFoldingFilterFactory and MappingCharFilterFactory make it work ok.
>
> The problem is I need those filters... otherwise how should I create facets
> which match against both Anglicized as well as international prefix
> spellings?  I could of course maintain separate fields and do multiple
> queries, but seems like that quickly gets out of hand if I also want to
> support mixed case and other filtering dimensions.
>
> Is there a way to route facet.prefix through the field type filters like all
> the other params? I suppose I could manually instantiate and pre-apply the
> filters in the client code... any other ideas?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/international-characters-in-facet-prefix-tp4339415p4339534.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: international characters in facet.prefix

Posted by arik <ar...@gmail.com>.
Thanks Erick, indeed your hunch is correct, it's the analyzing filters that
facet.prefix seems to bypass, and getting rid of my
ASCIIFoldingFilterFactory and MappingCharFilterFactory make it work ok.

The problem is I need those filters... otherwise how should I create facets
which match against both Anglicized as well as international prefix
spellings?  I could of course maintain separate fields and do multiple
queries, but seems like that quickly gets out of hand if I also want to
support mixed case and other filtering dimensions.

Is there a way to route facet.prefix through the field type filters like all
the other params? I suppose I could manually instantiate and pre-apply the
filters in the client code... any other ideas?



--
View this message in context: http://lucene.472066.n3.nabble.com/international-characters-in-facet-prefix-tp4339415p4339534.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: international characters in facet.prefix

Posted by Erick Erickson <er...@gmail.com>.
I'll bet your field definition has one of the folding filters in it.
I'm pretty sure that the facet.prefix parameter doesn't send the value
through your analysis chain, it uses it "as is". So my guess (without
looking at the code) is that the facet.prefix value franç is not _in_
your index, rather the term is just franc. If you prefix is fran, what
do you get back? I bet you get no terms back with the cedilla since
the values returned are the values after they've been through the
indexing analysis chain.

So it's not that the facet.prefix doesn't handle international
characters, it's just that it doesn't go through any of your analysis
chain. You can test this by defining a field without any folding and
using facet.prefix with diacritics....

Of course if your analysis chain for the field doesn't have any
folding filters this theory is out the window...



Best,
Erick

On Wed, Jun 7, 2017 at 7:54 AM, arik <ar...@gmail.com> wrote:
> I'm finding that the facet.prefix query parameter does not seem to support
> international characters, regardless of url encoding.  All the other
> parameters work fine, but that one seems unique in this respect.
>
> For example with this data:
>
> François Nédélec
>
> *These queries produce relevant facets:*
>
> /select?facet=on&facet.field=description&facet.prefix=ned&indent=on&q=description:françois&wt=json
> /select?facet=on&facet.field=description&facet.prefix=franc&indent=on&q=description:nédélec&wt=json
> /select?facet=on&facet.field=description&facet.prefix=ned&indent=on&q=description:fran%C3%A7ois&wt=json
>
> *But these do not produce any facets:
> *
>
> /select?facet=on&facet.field=description&facet.prefix=néd&indent=on&q=description:françois&wt=json
> /select?facet=on&facet.field=description&facet.prefix=franç&indent=on&q=description:nédélec&wt=json
> /select?facet=on&facet.field=description&facet.prefix=n%C3%A9d&indent=on&q=description:fran%C3%A7ois&wt=json
>
> It seems therefore that facet.prefix only supports the "post-filtered" text,
> not the raw original text?  Note how the international characters are
> working fine in the "q" param, just not in facet.prefix.
>
> Any info or clarity or work-arounds on this would be much appreciated.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/international-characters-in-facet-prefix-tp4339415.html
> Sent from the Solr - User mailing list archive at Nabble.com.