You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by "Daniel F. Savarese" <df...@savarese.org> on 2004/04/04 22:04:33 UTC

Re: splitting a search string into tokens

In message <OC...@mulework.com>, "Robert Taylor"
 writes:
>I need to parse the search string into tokens in the manner that search engine
>s would.

Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Sometimes you can get away with combining the two, but you'll find you can
only do so much with split.  Define a regular expression for each of
your tokens and consume the input matching against each in a specified
order.  In your case, tokens appear to be either \s+ (i.e., the separator
which would be discarded), \S+, and "[^"]+"?.  You have to test for the last
token first to avoid misidentifying whitespace.  It so happens that you can
manage this with split.  You appear to have almost gotten there already with:

>Here is what I've tried (but it doesn't cover escaping metacharacters
>which might be in the search string):
>
>    /"(.*?)"|(\w+)/

I don't understand where your search string and escaped metacharacters
enter the picture.  If you need to escape metacharacters in a string,
use Perl5Compiler.quotemeta.  I hope that helps.

daniel



---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org


RE: splitting a search string into tokens

Posted by Robert Taylor <rt...@mulework.com>.
Daniel, 

> Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Of which I no little about either.

> Sometimes you can get away with combining the two, but you'll find you can
> only do so much with split.  
I started coming to that conclusion as well.

>Define a regular expression for each of
> your tokens and consume the input matching against each in a specified
> order.
Since my knowlege with regexp is slim to nil at best and I hadn't heard back from
anyone on this list and my Googling produced only "almost" solutions, I was going
to brute force it and extract all quoted strings and then what ever was left, split
it against white space.


> token first to avoid misidentifying whitespace.  It so happens that you can
> manage this with split.  You appear to have almost gotten there already with:
> 
> >Here is what I've tried (but it doesn't cover escaping metacharacters
> >which might be in the search string):
> >
> >    /"(.*?)"|(\w+)/
I didn't come up with this myself, I found it on the web from someone who
was trying to solve the same problem as myself. Call it lazy, but if you've
been in the Windoz world and aren't exposed to regular expressions often, then
they (and the rules involved) are quite intimidating. To me, my problem seemed so
common (parsing a search string into tokens)... and someone mentioned trying to
use String.split() which uses regexp which is why I ended up here.



Thanks for the advice and help.

robert

> -----Original Message-----
> From: Daniel F. Savarese [mailto:dfs@savarese.org]
> Sent: Sunday, April 04, 2004 4:05 PM
> To: ORO Users List
> Subject: Re: splitting a search string into tokens 
> 
> 
> 
> In message <OC...@mulework.com>, "Robert Taylor"
>  writes:
> >I need to parse the search string into tokens in the manner that search engine
> >s would.
> 
> Lexical analysis (i.e., tokenization) and parsing are two separate activities.
> Sometimes you can get away with combining the two, but you'll find you can
> only do so much with split.  Define a regular expression for each of
> your tokens and consume the input matching against each in a specified
> order.  In your case, tokens appear to be either \s+ (i.e., the separator
> which would be discarded), \S+, and "[^"]+"?.  You have to test for the last
> token first to avoid misidentifying whitespace.  It so happens that you can
> manage this with split.  You appear to have almost gotten there already with:
> 
> >Here is what I've tried (but it doesn't cover escaping metacharacters
> >which might be in the search string):
> >
> >    /"(.*?)"|(\w+)/
> 
> I don't understand where your search string and escaped metacharacters
> enter the picture.  If you need to escape metacharacters in a string,
> use Perl5Compiler.quotemeta.  I hope that helps.
> 
> daniel
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: oro-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org