You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jay Luker <lb...@reallywow.com> on 2011/11/28 18:01:58 UTC

PatternTokenizer failure

Hi all,

I'm trying to use PatternTokenizer and not getting expected results.
Not sure where the failure lies. What I'm trying to do is split my
input on whitespace except in cases where the whitespace is preceded
by a hyphen character. So to do this I'm using a negative look behind
assertion in the pattern, e.g. "(?<!-)\s+".

Expected behavior:
"foo bar" -> ["foo","bar"] - OK
"foo \n bar" -> ["foo","bar"] - OK
"foo- bar" -> ["foo- bar"] - OK
"foo-\nbar" -> ["foo-\nbar"] - OK
"foo- \n bar" -> ["foo- \n bar"] - FAILS

Here's a test case that demonstrates the failure:

	public void testPattern() throws Exception {
		Map<String,String> args = new HashMap<String, String>();
		args.put( PatternTokenizerFactory.GROUP, "-1" );
		args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" );
	  	Reader reader = new StringReader("blah \n foo bar- baz\nfoo-\nbar-
baz foo- \n bar");
	    PatternTokenizerFactory tokFactory = new PatternTokenizerFactory();
	    tokFactory.init( args );
	    TokenStream stream = tokFactory.create( reader );
	    assertTokenStreamContents(stream, new String[] { "blah", "foo",
"bar- baz", "foo-\nbar- baz", "foo- \n bar" });
	}

This fails with the following output:
"org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>"

Am I doing something wrong? Incorrect expectations? Or could this be a bug?

Thanks,
--jay

Re: PatternTokenizer failure

Posted by Jay Luker <lb...@reallywow.com>.
On Tue, Nov 29, 2011 at 9:37 AM, Michael Kuhlmann <ku...@solarier.de> wrote:
> Jay,
> I think the problem is this:
>
> You're checking whether the character preceding the array of at least one
> whitespace is not a hyphen.
>
> However, when you've more than one whitespace, like this:
> "foo- \n bar"
> then there's another array of whitespaces - "\n " - which is precedes by the
> first whitespace - " ".
>
> Therefore, you'll need to not only check for preceding hyphens, but also for
> preceding whitespaces.
>
> I'll leave this as an exercise for you. ;)
>
> -Kuli

Just for the sake of closure, you were correct. I needed to update the
regex to include a whitespace character in the negative look-behind,
i.e., "(?<![-\s])\s+".

Thanks,
--jay

Re: PatternTokenizer failure

Posted by Michael Kuhlmann <ku...@solarier.de>.
Am 29.11.2011 15:20, schrieb Erick Erickson:
> Hmmm, I tried this in straight Java, no Solr/Lucene involved and the
> behavior I'm seeing is that no example works if it has more than
> one whitespace character after the hyphen, including your failure
> example.
>
> I haven't lived inside regexes for long enough that I don't know what
> the right regex should be, but it doesn't appear to be a Solr problem

Jay,
I think the problem is this:

You're checking whether the character preceding the array of at least 
one whitespace is not a hyphen.

However, when you've more than one whitespace, like this:
"foo- \n bar"
then there's another array of whitespaces - "\n " - which is precedes by 
the first whitespace - " ".

Therefore, you'll need to not only check for preceding hyphens, but also 
for preceding whitespaces.

I'll leave this as an exercise for you. ;)

-Kuli

Re: PatternTokenizer failure

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, I tried this in straight Java, no Solr/Lucene involved and the
behavior I'm seeing is that no example works if it has more than
one whitespace character after the hyphen, including your failure
example.

I haven't lived inside regexes for long enough that I don't know what
the right regex should be, but it doesn't appear to be a Solr problem

Sorry I can't be more helpful.
Erick

On Mon, Nov 28, 2011 at 12:01 PM, Jay Luker <lb...@reallywow.com> wrote:
> Hi all,
>
> I'm trying to use PatternTokenizer and not getting expected results.
> Not sure where the failure lies. What I'm trying to do is split my
> input on whitespace except in cases where the whitespace is preceded
> by a hyphen character. So to do this I'm using a negative look behind
> assertion in the pattern, e.g. "(?<!-)\s+".
>
> Expected behavior:
> "foo bar" -> ["foo","bar"] - OK
> "foo \n bar" -> ["foo","bar"] - OK
> "foo- bar" -> ["foo- bar"] - OK
> "foo-\nbar" -> ["foo-\nbar"] - OK
> "foo- \n bar" -> ["foo- \n bar"] - FAILS
>
> Here's a test case that demonstrates the failure:
>
>        public void testPattern() throws Exception {
>                Map<String,String> args = new HashMap<String, String>();
>                args.put( PatternTokenizerFactory.GROUP, "-1" );
>                args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" );
>                Reader reader = new StringReader("blah \n foo bar- baz\nfoo-\nbar-
> baz foo- \n bar");
>            PatternTokenizerFactory tokFactory = new PatternTokenizerFactory();
>            tokFactory.init( args );
>            TokenStream stream = tokFactory.create( reader );
>            assertTokenStreamContents(stream, new String[] { "blah", "foo",
> "bar- baz", "foo-\nbar- baz", "foo- \n bar" });
>        }
>
> This fails with the following output:
> "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>"
>
> Am I doing something wrong? Incorrect expectations? Or could this be a bug?
>
> Thanks,
> --jay