You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by li...@alphamatrix.org on 2012/06/25 03:40:54 UTC

how to remove the dash

hi

I have strings like "drinks - water" and I've read in "Lucene in 
Action" that
the StandardAnalyzer and other analyzers removes the "-" from the 
string
but so far none of them worked... All of them change my string to 
something like
"drinks -water" so the "-" is used as an "prohibit operator" and this 
is a BIG problem for me.

I'm using Lucene 3.6.
I'am also using my own Analyzer, Filters and a Tokenizer based on 
StandardTokenizer with changes
on the Flex file to remove some othe stuff.

How can i remove the "-"?

Thanks
xpete

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by Jack Krupansky <ja...@basetechnology.com>.
Oopd... I was mistaken to suggest that "a simple term query" would invoke 
the field analyzer - it passes the literal text without invoking any field 
analyzer.

-- Jack Krupansky

-----Original Message----- 
From: Jack Krupansky
Sent: Monday, June 25, 2012 10:14 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash

Most query parsers will "parse" a leading hyphen as an operator, so it will
never get to the analyzer for any field. Whether white space is permitted
between the "-" operator and the following term is dependent on the specific
query parser, and not guaranteed.

So, "bebidas - agua" is parsed by the query parser the same as
"bebidas -agua", which is the "prohibit" operator. This is all as it should
be.

Generally, all operators, including "+", "-", parentheses, "AND", "OR, etc.
need to be escaped if you want them to be passed through to the field
analyzers. Operators embedded within terms do not need to be escaped, except
for parentheses.

So, if you want user input to be treated as raw English text, as opposed to
a "structured" query, be sure to filter or escape the user query text before
parsing it. Or, consider using a simple term query that does no query
"parsing", but does pass the term through the field analyzer for the desired
field type.

-- Jack Krupansky

-----Original Message----- 
From: listas@alphamatrix.org
Sent: Monday, June 25, 2012 4:12 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash

More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"

As I already tried many diferent Analyzers and I always get the same
result maybe it's a problem on the query parser??


A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it
still
> used as "prohibit operator".
>
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
>
> Test is the original string.
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by Jack Krupansky <ja...@basetechnology.com>.
Most query parsers will "parse" a leading hyphen as an operator, so it will 
never get to the analyzer for any field. Whether white space is permitted 
between the "-" operator and the following term is dependent on the specific 
query parser, and not guaranteed.

So, "bebidas - agua" is parsed by the query parser the same as 
"bebidas -agua", which is the "prohibit" operator. This is all as it should 
be.

Generally, all operators, including "+", "-", parentheses, "AND", "OR, etc. 
need to be escaped if you want them to be passed through to the field 
analyzers. Operators embedded within terms do not need to be escaped, except 
for parentheses.

So, if you want user input to be treated as raw English text, as opposed to 
a "structured" query, be sure to filter or escape the user query text before 
parsing it. Or, consider using a simple term query that does no query 
"parsing", but does pass the term through the field analyzer for the desired 
field type.

-- Jack Krupansky

-----Original Message----- 
From: listas@alphamatrix.org
Sent: Monday, June 25, 2012 4:12 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash

More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"

As I already tried many diferent Analyzers and I always get the same
result maybe it's a problem on the query parser??


A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it
still
> used as "prohibit operator".
>
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
>
> Test is the original string.
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by li...@alphamatrix.org.
More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"

As I already tried many diferent Analyzers and I always get the same 
result maybe it's a problem on the query parser??


A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org 
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it 
still
> used as "prohibit operator".
> 
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
> 
> Test is the original string.
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by li...@alphamatrix.org.
You are right... i'am not geting the hyphen inside any token... but it still 
used as "prohibit operator".

This is my output:
Test: bebidas - agua
Query: bebidas -agua
Tokens:
1: [bebidas:0->7:<ALPHANUM>]
2: [agua:10->14:<ALPHANUM>]

Test is the original string.
Thanks

A Segunda, 25 de Junho de 2012 19:28:06 Steven A Rowe escreveu:
> I added the following to both TestStandardAnalyzer and 
TestClassicAnalyzer
> in branches/lucene_solr_3_6/, and it passed in both cases:
> 
>   public void testWhitespaceHyphenWhitespace() throws Exception {
>     BaseTokenStreamTestCase.assertAnalyzesTo
>       (a, "drinks - water", new String[]{"drinks", "water"});
>   }
> 
> So I'm not seeing the same behavior as you guys - the hyphen is not 
part of
> any emitted token.
> 
> Steve
> 
> -----Original Message-----
> From: listas@alphamatrix.org [mailto:listas@alphamatrix.org]
> Sent: Monday, June 25, 2012 11:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: how to remove the dash
> 
> A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> > My apologies - you are right.
> > 
> > With both ClassicAnalyzer and StandardAnalyzer, "drinks - water"
> 
> comes
> 
> > out as "drinks -water" whereas "drinks-water" comes out as "drinks
> > water", as I'd expected.
> > 
> > I guess this is fixable in JFlex, or I think there is some replace
> > tokenizer somewhere that can replace character X with character Y
> 
> e.g.
> 
> > "-" with " ".  Or pre-process your text/queries with a regexp.  Maybe
> > someone else has better ideas.
> 
> I guess the same... I'am already using my own Tokenizer(based on
> StandardTokenizer) to mark some strings for replacement or removal 
and i'am
> using a a filter to replace them and the filter to remove... And tried to
> do that with the "-" but didn't worked... I can't even mark the "-". I'am
> avoiding pre-process...
> I'am hoping that somebody could tell what can I change on 
StandardTokenizer
> JFlex to changes this behavior.
> 
> Thanks
> 
> > --
> > Ian.
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-
help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: how to remove the dash

Posted by Steven A Rowe <sa...@syr.edu>.
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in branches/lucene_solr_3_6/, and it passed in both cases:

  public void testWhitespaceHyphenWhitespace() throws Exception {
    BaseTokenStreamTestCase.assertAnalyzesTo
      (a, "drinks - water", new String[]{"drinks", "water"});
  }

So I'm not seeing the same behavior as you guys - the hyphen is not part of any emitted token.

Steve

-----Original Message-----
From: listas@alphamatrix.org [mailto:listas@alphamatrix.org] 
Sent: Monday, June 25, 2012 11:33 AM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash

A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> My apologies - you are right.
> 
> With both ClassicAnalyzer and StandardAnalyzer, "drinks - water" 
comes
> out as "drinks -water" whereas "drinks-water" comes out as "drinks 
> water", as I'd expected.
> 
> I guess this is fixable in JFlex, or I think there is some replace 
> tokenizer somewhere that can replace character X with character Y
e.g.
> "-" with " ".  Or pre-process your text/queries with a regexp.  Maybe 
> someone else has better ideas.

I guess the same... I'am already using my own Tokenizer(based on
StandardTokenizer) to mark some strings for replacement or removal and i'am using a a filter to replace them and the filter to remove... And tried to do that with the "-" but didn't worked... I can't even mark the "-".
I'am avoiding pre-process...
I'am hoping that somebody could tell what can I change on StandardTokenizer JFlex to changes this behavior.

Thanks

> 
> 
> --
> Ian.




> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by li...@alphamatrix.org.
A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> My apologies - you are right.
> 
> With both ClassicAnalyzer and StandardAnalyzer, "drinks - water" 
comes
> out as "drinks -water" whereas "drinks-water" comes out as "drinks
> water", as I'd expected.
> 
> I guess this is fixable in JFlex, or I think there is some replace
> tokenizer somewhere that can replace character X with character Y 
e.g.
> "-" with " ".  Or pre-process your text/queries with a regexp.  Maybe
> someone else has better ideas.

I guess the same... I'am already using my own Tokenizer(based on 
StandardTokenizer) to mark some strings for replacement or removal 
and i'am using a a filter to replace them and the filter to remove... And 
tried to do that with the "-" but didn't worked... I can't even mark the "-".
I'am avoiding pre-process...
I'am hoping that somebody could tell what can I change on 
StandardTokenizer JFlex to changes this behavior.

Thanks

> 
> 
> --
> Ian.




> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by Ian Lea <ia...@gmail.com>.
My apologies - you are right.

With both ClassicAnalyzer and StandardAnalyzer, "drinks - water" comes
out as "drinks -water" whereas "drinks-water" comes out as "drinks
water", as I'd expected.

I guess this is fixable in JFlex, or I think there is some replace
tokenizer somewhere that can replace character X with character Y e.g.
"-" with " ".  Or pre-process your text/queries with a regexp.  Maybe
someone else has better ideas.


--
Ian.


On Mon, Jun 25, 2012 at 3:35 PM,  <li...@alphamatrix.org> wrote:
> As I said i've tried with StandardAnalyzer(without changes) and
> others(WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer).
> Now i've tried with ClassicAnalyzer as well... same result.
>
> Code:
>  ClassicAnalyzer analyzer = new ClassicAnalyzer(Version.LUCENE_36);
>  QueryParser parser = new QueryParser(Version.LUCENE_36, "contents",
> analyzer);
>  Query query = parser.parse(food);
>  System.out.println("Query: " + query.toString("contents"));
>  TopDocs results = searcher.search(query, 10);
>
> Thanks
> xpete
>
> A Segunda, 25 de Junho de 2012 14:37:37 Ian Lea escreveu:
>> I'm positive that StandardAnalyzer won't change "drinks - water" to
>> "drinks -water".  So it must be something in your code.  Which you
>> don't show us.  Best guess is that the changes you've made to the
> Flex
>> file have caused the problem.  If you created your tokenizer by
>> copying and modifying StandardTokenizer you could start again or do
> a
>> diff or something.  Good luck.
>>
>>
>> --
>> Ian.
>>
>> On Mon, Jun 25, 2012 at 2:40 AM,  <li...@alphamatrix.org> wrote:
>> > hi
>> >
>> > I have strings like "drinks - water" and I've read in "Lucene in
> Action"
>> > that
>> > the StandardAnalyzer and other analyzers removes the "-" from the
> string
>> > but so far none of them worked... All of them change my string to
>> > something
>> > like
>> > "drinks -water" so the "-" is used as an "prohibit operator" and this
> is a
>> > BIG problem for me.
>> >
>> > I'm using Lucene 3.6.
>> > I'am also using my own Analyzer, Filters and a Tokenizer based on
>> > StandardTokenizer with changes
>> > on the Flex file to remove some othe stuff.
>> >
>> > How can i remove the "-"?
>> >
>> > Thanks
>> > xpete
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-
> help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by li...@alphamatrix.org.
As I said i've tried with StandardAnalyzer(without changes) and 
others(WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer).
Now i've tried with ClassicAnalyzer as well... same result.

Code:
 ClassicAnalyzer analyzer = new ClassicAnalyzer(Version.LUCENE_36);
 QueryParser parser = new QueryParser(Version.LUCENE_36, "contents", 
analyzer);
 Query query = parser.parse(food);
 System.out.println("Query: " + query.toString("contents"));
 TopDocs results = searcher.search(query, 10);

Thanks
xpete

A Segunda, 25 de Junho de 2012 14:37:37 Ian Lea escreveu:
> I'm positive that StandardAnalyzer won't change "drinks - water" to
> "drinks -water".  So it must be something in your code.  Which you
> don't show us.  Best guess is that the changes you've made to the 
Flex
> file have caused the problem.  If you created your tokenizer by
> copying and modifying StandardTokenizer you could start again or do 
a
> diff or something.  Good luck.
> 
> 
> --
> Ian.
> 
> On Mon, Jun 25, 2012 at 2:40 AM,  <li...@alphamatrix.org> wrote:
> > hi
> > 
> > I have strings like "drinks - water" and I've read in "Lucene in 
Action"
> > that
> > the StandardAnalyzer and other analyzers removes the "-" from the 
string
> > but so far none of them worked... All of them change my string to
> > something
> > like
> > "drinks -water" so the "-" is used as an "prohibit operator" and this 
is a
> > BIG problem for me.
> > 
> > I'm using Lucene 3.6.
> > I'am also using my own Analyzer, Filters and a Tokenizer based on
> > StandardTokenizer with changes
> > on the Flex file to remove some othe stuff.
> > 
> > How can i remove the "-"?
> > 
> > Thanks
> > xpete
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-
help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to remove the dash

Posted by Ian Lea <ia...@gmail.com>.
I'm positive that StandardAnalyzer won't change "drinks - water" to
"drinks -water".  So it must be something in your code.  Which you
don't show us.  Best guess is that the changes you've made to the Flex
file have caused the problem.  If you created your tokenizer by
copying and modifying StandardTokenizer you could start again or do a
diff or something.  Good luck.


--
Ian.


On Mon, Jun 25, 2012 at 2:40 AM,  <li...@alphamatrix.org> wrote:
> hi
>
> I have strings like "drinks - water" and I've read in "Lucene in Action"
> that
> the StandardAnalyzer and other analyzers removes the "-" from the string
> but so far none of them worked... All of them change my string to something
> like
> "drinks -water" so the "-" is used as an "prohibit operator" and this is a
> BIG problem for me.
>
> I'm using Lucene 3.6.
> I'am also using my own Analyzer, Filters and a Tokenizer based on
> StandardTokenizer with changes
> on the Flex file to remove some othe stuff.
>
> How can i remove the "-"?
>
> Thanks
> xpete
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org