You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Angel Todorov <at...@gmail.com> on 2017/07/24 20:27:00 UTC

FreeTextSuggester throwing error "token must not contain separator byte"

Hi guys,

I am trying to setup the FreeTextSuggester/ Lookup Factory in a suggester
definition in SOLR. Unfortunately while the index is building, I am
encountering the following errors:

*"msg":"tokens must not contain separator byte; got token=[30 20 30 20 32
20 72 20 61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than
expected max ngram size=5","trace":"java.lang.IllegalArgumentException:
tokens must not contain separator byte; got token=[30 20 30 20 32 20 72 20
61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than expected max
ngram size=5\r\n\tat
org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(FreeTextSuggester.java:362)\r\n\tat
*

I've also opened the following issue, because i don't think it's right not
to handle this exception:

https://issues.apache.org/jira/browse/SOLR-11139

But my question is about the error in general - why is it occurring? I only
have English text, nothing special.

Thanks,
Angel

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by Angel Todorov <at...@gmail.com>.
Hi guys,

Thank you very much for the help. I think I see what is going on. yes it is
related to the Shingle filter added to the analyzer. It shouldn't be there
if a FreeTextLookup factory is used in the suggester, because it creates
conflict. The StandardTokenizer removes punctuation, including spaces, but
then after the shingles are generated extra whitespace is added in between
the shingles, and this makes the freetext  analyzer  / lookup throw an
error.

Unfortunately, I have tried without the shingles approach, and made it work
some time ago, but it doesn't produce the expected results. I mean, it's
not doing what Google's auto suggest is doing so to speak. Let me give you
a couple of examples:

Input: don (without the quotes)
Output: only single terms. "donald", but not "donald trump", for example

Input: "don" (with quotes)
Output: multi-terms only, but the first term must start with don. So it
still doesn't output "donald trump".

Input: "donald t" (with quotes)
Output: I also get all terms starting with "t", which I don't want

So I am thinking SOLR / Elasticsearch really needs a brand new suggester
implementation. Since most people are using Google as the "example", it
should work as it works there.

Thanks again,
Angel


On Tue, Jul 25, 2017 at 12:00 PM, alessandro.benedetti <a.benedetti@sease.io
> wrote:

> I think this bit is the problem :
>
> "I am using a Shingle filter right after the StandardTokenizer, not sure if
> that has anything to do with it. "
>
> When using the FreeTextLookup approach, you don't need to use shingles in
> your analyser, shingles are added by the suggester itself.
> As Erick mentioned, the reason spaces come back is because you produce
> shingles on your own and then the Lookup approach will add additional
> shingles.
>
> I recommend to read this section of my blog [1] ( you may have read it as
> there is one comment with a similar problem to you)
>
>
> [1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/FreeTextSuggester-throwing-error-token-must-not-
> contain-separator-byte-tp4347406p4347454.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by "alessandro.benedetti" <a....@sease.io>.
I think this bit is the problem :

"I am using a Shingle filter right after the StandardTokenizer, not sure if 
that has anything to do with it. "

When using the FreeTextLookup approach, you don't need to use shingles in
your analyser, shingles are added by the suggester itself.
As Erick mentioned, the reason spaces come back is because you produce
shingles on your own and then the Lookup approach will add additional
shingles.

I recommend to read this section of my blog [1] ( you may have read it as
there is one comment with a similar problem to you)


[1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/FreeTextSuggester-throwing-error-token-must-not-contain-separator-byte-tp4347406p4347454.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by govind nitk <go...@gmail.com>.
Hi Angel,
please share the freesuggester defined in the config.

I guess you might have mentioned whitespace as separator in the
freesuggester definition as :

<str name="separator"> </str>
Which is creaing the trouble.




On Tue, Jul 25, 2017 at 9:01 AM, Erick Erickson <er...@gmail.com>
wrote:

> The shingle filter may use space as the separator between shingles that it
> generates. The admin/ analysis page is your friend.
>
> On Jul 24, 2017 2:45 PM, "Angel Todorov" <at...@gmail.com> wrote:
>
> > Hi Rick,
> >
> > Yep, that's really weird, because I am using the
> StandardTokenizerFactory,
> > which is supposed to remove whitespace. Also tried the
> > WhitespaceTokenizerFactory. I'll have a look at other analyzers or if
> > nothing works maybe implement my own.
> >
> > I am using a Shingle filter right after the StandardTokenizer, not sure
> if
> > that has anything to do with it.
> >
> >
> > Thanks,
> > Angel
> >
> >
> > On Tue, Jul 25, 2017 at 12:09 AM Rick Leir <rl...@leirtech.com> wrote:
> >
> > > Angel,
> > > The 20 byte is an ASCII space character, which is a separator in most
> > > contexts. Breaking the buffer at spaces, you can see 6 non-space
> tokens.
> > >
> > > Have a look at your analysis chain and see why you are getting this.
> > > Cheers -- Rick
> > >
> > > On July 24, 2017 4:27:00 PM EDT, Angel Todorov <at...@gmail.com>
> > > wrote:
> > > >Hi guys,
> > > >
> > > >I am trying to setup the FreeTextSuggester/ Lookup Factory in a
> > > >suggester
> > > >definition in SOLR. Unfortunately while the index is building, I am
> > > >encountering the following errors:
> > > >
> > > >*"msg":"tokens must not contain separator byte; got token=[30 20 30 20
> > > >32
> > > >20 72 20 61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than
> > > >expected max ngram size=5","trace":"java.lang.
> IllegalArgumentException:
> > > >tokens must not contain separator byte; got token=[30 20 30 20 32 20
> 72
> > > >20
> > > >61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than expected
> > > >max
> > > >ngram size=5\r\n\tat
> > >
> > > >org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(
> > FreeTextSuggester.java:362)\r\n\tat
> > > >*
> > > >
> > > >I've also opened the following issue, because i don't think it's right
> > > >not
> > > >to handle this exception:
> > > >
> > > >https://issues.apache.org/jira/browse/SOLR-11139
> > > >
> > > >But my question is about the error in general - why is it occurring? I
> > > >only
> > > >have English text, nothing special.
> > > >
> > > >Thanks,
> > > >Angel
> > >
> > > --
> > > Sorry for being brief. Alternate email is rickleir at yahoo dot com
> >
>

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by Erick Erickson <er...@gmail.com>.
The shingle filter may use space as the separator between shingles that it
generates. The admin/ analysis page is your friend.

On Jul 24, 2017 2:45 PM, "Angel Todorov" <at...@gmail.com> wrote:

> Hi Rick,
>
> Yep, that's really weird, because I am using the StandardTokenizerFactory,
> which is supposed to remove whitespace. Also tried the
> WhitespaceTokenizerFactory. I'll have a look at other analyzers or if
> nothing works maybe implement my own.
>
> I am using a Shingle filter right after the StandardTokenizer, not sure if
> that has anything to do with it.
>
>
> Thanks,
> Angel
>
>
> On Tue, Jul 25, 2017 at 12:09 AM Rick Leir <rl...@leirtech.com> wrote:
>
> > Angel,
> > The 20 byte is an ASCII space character, which is a separator in most
> > contexts. Breaking the buffer at spaces, you can see 6 non-space tokens.
> >
> > Have a look at your analysis chain and see why you are getting this.
> > Cheers -- Rick
> >
> > On July 24, 2017 4:27:00 PM EDT, Angel Todorov <at...@gmail.com>
> > wrote:
> > >Hi guys,
> > >
> > >I am trying to setup the FreeTextSuggester/ Lookup Factory in a
> > >suggester
> > >definition in SOLR. Unfortunately while the index is building, I am
> > >encountering the following errors:
> > >
> > >*"msg":"tokens must not contain separator byte; got token=[30 20 30 20
> > >32
> > >20 72 20 61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than
> > >expected max ngram size=5","trace":"java.lang.IllegalArgumentException:
> > >tokens must not contain separator byte; got token=[30 20 30 20 32 20 72
> > >20
> > >61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than expected
> > >max
> > >ngram size=5\r\n\tat
> >
> > >org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(
> FreeTextSuggester.java:362)\r\n\tat
> > >*
> > >
> > >I've also opened the following issue, because i don't think it's right
> > >not
> > >to handle this exception:
> > >
> > >https://issues.apache.org/jira/browse/SOLR-11139
> > >
> > >But my question is about the error in general - why is it occurring? I
> > >only
> > >have English text, nothing special.
> > >
> > >Thanks,
> > >Angel
> >
> > --
> > Sorry for being brief. Alternate email is rickleir at yahoo dot com
>

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by Angel Todorov <at...@gmail.com>.
Hi Rick,

Yep, that's really weird, because I am using the StandardTokenizerFactory,
which is supposed to remove whitespace. Also tried the
WhitespaceTokenizerFactory. I'll have a look at other analyzers or if
nothing works maybe implement my own.

I am using a Shingle filter right after the StandardTokenizer, not sure if
that has anything to do with it.


Thanks,
Angel


On Tue, Jul 25, 2017 at 12:09 AM Rick Leir <rl...@leirtech.com> wrote:

> Angel,
> The 20 byte is an ASCII space character, which is a separator in most
> contexts. Breaking the buffer at spaces, you can see 6 non-space tokens.
>
> Have a look at your analysis chain and see why you are getting this.
> Cheers -- Rick
>
> On July 24, 2017 4:27:00 PM EDT, Angel Todorov <at...@gmail.com>
> wrote:
> >Hi guys,
> >
> >I am trying to setup the FreeTextSuggester/ Lookup Factory in a
> >suggester
> >definition in SOLR. Unfortunately while the index is building, I am
> >encountering the following errors:
> >
> >*"msg":"tokens must not contain separator byte; got token=[30 20 30 20
> >32
> >20 72 20 61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than
> >expected max ngram size=5","trace":"java.lang.IllegalArgumentException:
> >tokens must not contain separator byte; got token=[30 20 30 20 32 20 72
> >20
> >61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than expected
> >max
> >ngram size=5\r\n\tat
>
> >org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(FreeTextSuggester.java:362)\r\n\tat
> >*
> >
> >I've also opened the following issue, because i don't think it's right
> >not
> >to handle this exception:
> >
> >https://issues.apache.org/jira/browse/SOLR-11139
> >
> >But my question is about the error in general - why is it occurring? I
> >only
> >have English text, nothing special.
> >
> >Thanks,
> >Angel
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: FreeTextSuggester throwing error "token must not contain separator byte"

Posted by Rick Leir <rl...@leirtech.com>.
Angel,
The 20 byte is an ASCII space character, which is a separator in most contexts. Breaking the buffer at spaces, you can see 6 non-space tokens.

Have a look at your analysis chain and see why you are getting this. Cheers -- Rick

On July 24, 2017 4:27:00 PM EDT, Angel Todorov <at...@gmail.com> wrote:
>Hi guys,
>
>I am trying to setup the FreeTextSuggester/ Lookup Factory in a
>suggester
>definition in SOLR. Unfortunately while the index is building, I am
>encountering the following errors:
>
>*"msg":"tokens must not contain separator byte; got token=[30 20 30 20
>32
>20 72 20 61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than
>expected max ngram size=5","trace":"java.lang.IllegalArgumentException:
>tokens must not contain separator byte; got token=[30 20 30 20 32 20 72
>20
>61 6c 6c 65 6e 20 72] but gramCount=6, which is greater than expected
>max
>ngram size=5\r\n\tat
>org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(FreeTextSuggester.java:362)\r\n\tat
>*
>
>I've also opened the following issue, because i don't think it's right
>not
>to handle this exception:
>
>https://issues.apache.org/jira/browse/SOLR-11139
>
>But my question is about the error in general - why is it occurring? I
>only
>have English text, nothing special.
>
>Thanks,
>Angel

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com