You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by li...@alphamatrix.org on 2012/06/25 03:40:54 UTC
how to remove the dash
hi
I have strings like "drinks - water" and I've read in "Lucene in
Action" that
the StandardAnalyzer and other analyzers removes the "-" from the
string
but so far none of them worked... All of them change my string to
something like
"drinks -water" so the "-" is used as an "prohibit operator" and this
is a BIG problem for me.
I'm using Lucene 3.6.
I'am also using my own Analyzer, Filters and a Tokenizer based on
StandardTokenizer with changes
on the Flex file to remove some othe stuff.
How can i remove the "-"?
Thanks
xpete
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by Jack Krupansky <ja...@basetechnology.com>.
Oopd... I was mistaken to suggest that "a simple term query" would invoke
the field analyzer - it passes the literal text without invoking any field
analyzer.
-- Jack Krupansky
-----Original Message-----
From: Jack Krupansky
Sent: Monday, June 25, 2012 10:14 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash
Most query parsers will "parse" a leading hyphen as an operator, so it will
never get to the analyzer for any field. Whether white space is permitted
between the "-" operator and the following term is dependent on the specific
query parser, and not guaranteed.
So, "bebidas - agua" is parsed by the query parser the same as
"bebidas -agua", which is the "prohibit" operator. This is all as it should
be.
Generally, all operators, including "+", "-", parentheses, "AND", "OR, etc.
need to be escaped if you want them to be passed through to the field
analyzers. Operators embedded within terms do not need to be escaped, except
for parentheses.
So, if you want user input to be treated as raw English text, as opposed to
a "structured" query, be sure to filter or escape the user query text before
parsing it. Or, consider using a simple term query that does no query
"parsing", but does pass the term through the field analyzer for the desired
field type.
-- Jack Krupansky
-----Original Message-----
From: listas@alphamatrix.org
Sent: Monday, June 25, 2012 4:12 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash
More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"
As I already tried many diferent Analyzers and I always get the same
result maybe it's a problem on the query parser??
A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it
still
> used as "prohibit operator".
>
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
>
> Test is the original string.
> Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by Jack Krupansky <ja...@basetechnology.com>.
Most query parsers will "parse" a leading hyphen as an operator, so it will
never get to the analyzer for any field. Whether white space is permitted
between the "-" operator and the following term is dependent on the specific
query parser, and not guaranteed.
So, "bebidas - agua" is parsed by the query parser the same as
"bebidas -agua", which is the "prohibit" operator. This is all as it should
be.
Generally, all operators, including "+", "-", parentheses, "AND", "OR, etc.
need to be escaped if you want them to be passed through to the field
analyzers. Operators embedded within terms do not need to be escaped, except
for parentheses.
So, if you want user input to be treated as raw English text, as opposed to
a "structured" query, be sure to filter or escape the user query text before
parsing it. Or, consider using a simple term query that does no query
"parsing", but does pass the term through the field analyzer for the desired
field type.
-- Jack Krupansky
-----Original Message-----
From: listas@alphamatrix.org
Sent: Monday, June 25, 2012 4:12 PM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash
More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"
As I already tried many diferent Analyzers and I always get the same
result maybe it's a problem on the query parser??
A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it
still
> used as "prohibit operator".
>
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
>
> Test is the original string.
> Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by li...@alphamatrix.org.
More information...
If I change
System.out.println("Query: " + query.toString("contents"));
to this:
System.out.println("Query: " + query.toString());
I get this result:
"Query: contents:bebidas -contents:agua"
As I already tried many diferent Analyzers and I always get the same
result maybe it's a problem on the query parser??
A Segunda, 25 de Junho de 2012 21:10:02 listas@alphamatrix.org
escreveu:
> You are right... i'am not geting the hyphen inside any token... but it
still
> used as "prohibit operator".
>
> This is my output:
> Test: bebidas - agua
> Query: bebidas -agua
> Tokens:
> 1: [bebidas:0->7:<ALPHANUM>]
> 2: [agua:10->14:<ALPHANUM>]
>
> Test is the original string.
> Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by li...@alphamatrix.org.
You are right... i'am not geting the hyphen inside any token... but it still
used as "prohibit operator".
This is my output:
Test: bebidas - agua
Query: bebidas -agua
Tokens:
1: [bebidas:0->7:<ALPHANUM>]
2: [agua:10->14:<ALPHANUM>]
Test is the original string.
Thanks
A Segunda, 25 de Junho de 2012 19:28:06 Steven A Rowe escreveu:
> I added the following to both TestStandardAnalyzer and
TestClassicAnalyzer
> in branches/lucene_solr_3_6/, and it passed in both cases:
>
> public void testWhitespaceHyphenWhitespace() throws Exception {
> BaseTokenStreamTestCase.assertAnalyzesTo
> (a, "drinks - water", new String[]{"drinks", "water"});
> }
>
> So I'm not seeing the same behavior as you guys - the hyphen is not
part of
> any emitted token.
>
> Steve
>
> -----Original Message-----
> From: listas@alphamatrix.org [mailto:listas@alphamatrix.org]
> Sent: Monday, June 25, 2012 11:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: how to remove the dash
>
> A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> > My apologies - you are right.
> >
> > With both ClassicAnalyzer and StandardAnalyzer, "drinks - water"
>
> comes
>
> > out as "drinks -water" whereas "drinks-water" comes out as "drinks
> > water", as I'd expected.
> >
> > I guess this is fixable in JFlex, or I think there is some replace
> > tokenizer somewhere that can replace character X with character Y
>
> e.g.
>
> > "-" with " ". Or pre-process your text/queries with a regexp. Maybe
> > someone else has better ideas.
>
> I guess the same... I'am already using my own Tokenizer(based on
> StandardTokenizer) to mark some strings for replacement or removal
and i'am
> using a a filter to replace them and the filter to remove... And tried to
> do that with the "-" but didn't worked... I can't even mark the "-". I'am
> avoiding pre-process...
> I'am hoping that somebody could tell what can I change on
StandardTokenizer
> JFlex to changes this behavior.
>
> Thanks
>
> > --
> > Ian.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-
help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: how to remove the dash
Posted by Steven A Rowe <sa...@syr.edu>.
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in branches/lucene_solr_3_6/, and it passed in both cases:
public void testWhitespaceHyphenWhitespace() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(a, "drinks - water", new String[]{"drinks", "water"});
}
So I'm not seeing the same behavior as you guys - the hyphen is not part of any emitted token.
Steve
-----Original Message-----
From: listas@alphamatrix.org [mailto:listas@alphamatrix.org]
Sent: Monday, June 25, 2012 11:33 AM
To: java-user@lucene.apache.org
Subject: Re: how to remove the dash
A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> My apologies - you are right.
>
> With both ClassicAnalyzer and StandardAnalyzer, "drinks - water"
comes
> out as "drinks -water" whereas "drinks-water" comes out as "drinks
> water", as I'd expected.
>
> I guess this is fixable in JFlex, or I think there is some replace
> tokenizer somewhere that can replace character X with character Y
e.g.
> "-" with " ". Or pre-process your text/queries with a regexp. Maybe
> someone else has better ideas.
I guess the same... I'am already using my own Tokenizer(based on
StandardTokenizer) to mark some strings for replacement or removal and i'am using a a filter to replace them and the filter to remove... And tried to do that with the "-" but didn't worked... I can't even mark the "-".
I'am avoiding pre-process...
I'am hoping that somebody could tell what can I change on StandardTokenizer JFlex to changes this behavior.
Thanks
>
>
> --
> Ian.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by li...@alphamatrix.org.
A Segunda, 25 de Junho de 2012 16:10:38 Ian Lea escreveu:
> My apologies - you are right.
>
> With both ClassicAnalyzer and StandardAnalyzer, "drinks - water"
comes
> out as "drinks -water" whereas "drinks-water" comes out as "drinks
> water", as I'd expected.
>
> I guess this is fixable in JFlex, or I think there is some replace
> tokenizer somewhere that can replace character X with character Y
e.g.
> "-" with " ". Or pre-process your text/queries with a regexp. Maybe
> someone else has better ideas.
I guess the same... I'am already using my own Tokenizer(based on
StandardTokenizer) to mark some strings for replacement or removal
and i'am using a a filter to replace them and the filter to remove... And
tried to do that with the "-" but didn't worked... I can't even mark the "-".
I'am avoiding pre-process...
I'am hoping that somebody could tell what can I change on
StandardTokenizer JFlex to changes this behavior.
Thanks
>
>
> --
> Ian.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by Ian Lea <ia...@gmail.com>.
My apologies - you are right.
With both ClassicAnalyzer and StandardAnalyzer, "drinks - water" comes
out as "drinks -water" whereas "drinks-water" comes out as "drinks
water", as I'd expected.
I guess this is fixable in JFlex, or I think there is some replace
tokenizer somewhere that can replace character X with character Y e.g.
"-" with " ". Or pre-process your text/queries with a regexp. Maybe
someone else has better ideas.
--
Ian.
On Mon, Jun 25, 2012 at 3:35 PM, <li...@alphamatrix.org> wrote:
> As I said i've tried with StandardAnalyzer(without changes) and
> others(WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer).
> Now i've tried with ClassicAnalyzer as well... same result.
>
> Code:
> ClassicAnalyzer analyzer = new ClassicAnalyzer(Version.LUCENE_36);
> QueryParser parser = new QueryParser(Version.LUCENE_36, "contents",
> analyzer);
> Query query = parser.parse(food);
> System.out.println("Query: " + query.toString("contents"));
> TopDocs results = searcher.search(query, 10);
>
> Thanks
> xpete
>
> A Segunda, 25 de Junho de 2012 14:37:37 Ian Lea escreveu:
>> I'm positive that StandardAnalyzer won't change "drinks - water" to
>> "drinks -water". So it must be something in your code. Which you
>> don't show us. Best guess is that the changes you've made to the
> Flex
>> file have caused the problem. If you created your tokenizer by
>> copying and modifying StandardTokenizer you could start again or do
> a
>> diff or something. Good luck.
>>
>>
>> --
>> Ian.
>>
>> On Mon, Jun 25, 2012 at 2:40 AM, <li...@alphamatrix.org> wrote:
>> > hi
>> >
>> > I have strings like "drinks - water" and I've read in "Lucene in
> Action"
>> > that
>> > the StandardAnalyzer and other analyzers removes the "-" from the
> string
>> > but so far none of them worked... All of them change my string to
>> > something
>> > like
>> > "drinks -water" so the "-" is used as an "prohibit operator" and this
> is a
>> > BIG problem for me.
>> >
>> > I'm using Lucene 3.6.
>> > I'am also using my own Analyzer, Filters and a Tokenizer based on
>> > StandardTokenizer with changes
>> > on the Flex file to remove some othe stuff.
>> >
>> > How can i remove the "-"?
>> >
>> > Thanks
>> > xpete
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-
> help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by li...@alphamatrix.org.
As I said i've tried with StandardAnalyzer(without changes) and
others(WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer).
Now i've tried with ClassicAnalyzer as well... same result.
Code:
ClassicAnalyzer analyzer = new ClassicAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents",
analyzer);
Query query = parser.parse(food);
System.out.println("Query: " + query.toString("contents"));
TopDocs results = searcher.search(query, 10);
Thanks
xpete
A Segunda, 25 de Junho de 2012 14:37:37 Ian Lea escreveu:
> I'm positive that StandardAnalyzer won't change "drinks - water" to
> "drinks -water". So it must be something in your code. Which you
> don't show us. Best guess is that the changes you've made to the
Flex
> file have caused the problem. If you created your tokenizer by
> copying and modifying StandardTokenizer you could start again or do
a
> diff or something. Good luck.
>
>
> --
> Ian.
>
> On Mon, Jun 25, 2012 at 2:40 AM, <li...@alphamatrix.org> wrote:
> > hi
> >
> > I have strings like "drinks - water" and I've read in "Lucene in
Action"
> > that
> > the StandardAnalyzer and other analyzers removes the "-" from the
string
> > but so far none of them worked... All of them change my string to
> > something
> > like
> > "drinks -water" so the "-" is used as an "prohibit operator" and this
is a
> > BIG problem for me.
> >
> > I'm using Lucene 3.6.
> > I'am also using my own Analyzer, Filters and a Tokenizer based on
> > StandardTokenizer with changes
> > on the Flex file to remove some othe stuff.
> >
> > How can i remove the "-"?
> >
> > Thanks
> > xpete
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-
help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: how to remove the dash
Posted by Ian Lea <ia...@gmail.com>.
I'm positive that StandardAnalyzer won't change "drinks - water" to
"drinks -water". So it must be something in your code. Which you
don't show us. Best guess is that the changes you've made to the Flex
file have caused the problem. If you created your tokenizer by
copying and modifying StandardTokenizer you could start again or do a
diff or something. Good luck.
--
Ian.
On Mon, Jun 25, 2012 at 2:40 AM, <li...@alphamatrix.org> wrote:
> hi
>
> I have strings like "drinks - water" and I've read in "Lucene in Action"
> that
> the StandardAnalyzer and other analyzers removes the "-" from the string
> but so far none of them worked... All of them change my string to something
> like
> "drinks -water" so the "-" is used as an "prohibit operator" and this is a
> BIG problem for me.
>
> I'm using Lucene 3.6.
> I'am also using my own Analyzer, Filters and a Tokenizer based on
> StandardTokenizer with changes
> on the Flex file to remove some othe stuff.
>
> How can i remove the "-"?
>
> Thanks
> xpete
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org