You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by sr...@gmail.com on 2011/08/22 07:47:08 UTC

Issue with StandardAnalyzer which splits single word with _(Lucene Version: 3.0)

Hello All,
           I observed  some unexpected behavior using StandardAnalyzer to parse the query. Here is the demonstration.

I am passing the query as (key:xyz_abc) && (text:blabla)

Expecting the parsed query to be +key:xyz_abc +text:blabla

Actual Result is +key:"xyz abc" +text:blabla

As per my understanding StandardAnalyzer splits the word boundaries into multiple words but the above word xyz_abc is a single word. Please correct me if i am wrong.

I also observed if number is there after underscore the parsed query is as expected. i.e

If i give the query as (key:xyz_1abc) && (text:blabla) the parsed query is +key:xyz_1abc +text:blabla

This is the behavior i am expecting.

Please help.

Thanks,
Srinivas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Issue with StandardAnalyzer which splits single word with _(Lucene Version: 3.0)

Posted by govind bhardwaj <go...@gmail.com>.

Hi Eric,

Thanks for your reply.

I verified Srinivas' query by changing Lucene version ( in the constructor
of StandardAnalyzer ) to LUCENE_30 to find that parsed query
indeed changes to xyz abc (input query was 'xyz_abc') while that does not
happen in case of LUCENE_33 and the parsed query remains 'xyz_abc'.
I can't figure out why that may be happening.

Regards,
Govind



On Mon, Aug 22, 2011 at 7:22 PM, Erick Erickson <er...@gmail.com>wrote:

> No, that's expected. StandardAnalyzer breaks on '_' as far as I know.
>
> NOTE: the behavior changed a bit as of Solr 3.1. To get the old
> StandardAnalyzer behavior, I believe you need ClassicAnalyzer...
>
> More than you ever want to know about breaking lines (3.1+)
> http://unicode.org/reports/tr29/#Word_Boundaries
> Linked to from:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
>
>
> Best
> ERick
>
> On Mon, Aug 22, 2011 at 1:47 AM,  <sr...@gmail.com> wrote:
> > Hello All,
> >           I observed  some unexpected behavior using StandardAnalyzer to
> parse the query. Here is the demonstration.
> >
> > I am passing the query as (key:xyz_abc) && (text:blabla)
> >
> > Expecting the parsed query to be +key:xyz_abc +text:blabla
> >
> > Actual Result is +key:"xyz abc" +text:blabla
> >
> > As per my understanding StandardAnalyzer splits the word boundaries into
> multiple words but the above word xyz_abc is a single word. Please correct
> me if i am wrong.
> >
> > I also observed if number is there after underscore the parsed query is
> as expected. i.e
> >
> > If i give the query as (key:xyz_1abc) && (text:blabla) the parsed query
> is +key:xyz_1abc +text:blabla
> >
> > This is the behavior i am expecting.
> >
> > Please help.
> >
> > Thanks,
> > Srinivas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.

Re: Issue with StandardAnalyzer which splits single word with _(Lucene Version: 3.0)

Posted by Erick Erickson <er...@gmail.com>.

No, that's expected. StandardAnalyzer breaks on '_' as far as I know.

NOTE: the behavior changed a bit as of Solr 3.1. To get the old
StandardAnalyzer behavior, I believe you need ClassicAnalyzer...

More than you ever want to know about breaking lines (3.1+)
http://unicode.org/reports/tr29/#Word_Boundaries
Linked to from:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory


Best
ERick

On Mon, Aug 22, 2011 at 1:47 AM,  <sr...@gmail.com> wrote:
> Hello All,
>           I observed  some unexpected behavior using StandardAnalyzer to parse the query. Here is the demonstration.
>
> I am passing the query as (key:xyz_abc) && (text:blabla)
>
> Expecting the parsed query to be +key:xyz_abc +text:blabla
>
> Actual Result is +key:"xyz abc" +text:blabla
>
> As per my understanding StandardAnalyzer splits the word boundaries into multiple words but the above word xyz_abc is a single word. Please correct me if i am wrong.
>
> I also observed if number is there after underscore the parsed query is as expected. i.e
>
> If i give the query as (key:xyz_1abc) && (text:blabla) the parsed query is +key:xyz_1abc +text:blabla
>
> This is the behavior i am expecting.
>
> Please help.
>
> Thanks,
> Srinivas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Issue with StandardAnalyzer which splits single word with _(Lucene Version: 3.0)

Posted by govind bhardwaj <go...@gmail.com>.

Hi Srinivas,

It works for the latest Lucene Version 3.3.0 (in fact for versions after
3.0.0). Standard Analyzer just splits the text ignoring a set of
STOP_WORDS like "is", "in", etc.

In the class definition of StandardAnalyzer in Lucene 3.3.0 API, it is
clearly stated :-
"As of 3.1, StandardTokenizer implements Unicode text segmentation, and
StopFilter correctly handles Unicode 4.0 supplementary characters in
stopwords." I guess that takes care of the 'underscore' character now.

So I suggest that you should switch to the latest version for better
performance and functionality. Hope that helps.

Regards,
Govind

On Mon, Aug 22, 2011 at 11:17 AM, <sr...@gmail.com> wrote:

> Hello All,
>           I observed  some unexpected behavior using StandardAnalyzer to
> parse the query. Here is the demonstration.
>
> I am passing the query as (key:xyz_abc) && (text:blabla)
>
> Expecting the parsed query to be +key:xyz_abc +text:blabla
>
> Actual Result is +key:"xyz abc" +text:blabla
>
> As per my understanding StandardAnalyzer splits the word boundaries into
> multiple words but the above word xyz_abc is a single word. Please correct
> me if i am wrong.
>
> I also observed if number is there after underscore the parsed query is as
> expected. i.e
>
> If i give the query as (key:xyz_1abc) && (text:blabla) the parsed query is
> +key:xyz_1abc +text:blabla
>
> This is the behavior i am expecting.
>
> Please help.
>
> Thanks,
> Srinivas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.