You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Stephen Colebourne <sc...@btopenworld.com> on 2004/08/28 14:04:30 UTC

[lang] Tokenizer

I took a look at Tokenizer this morning and fixed some issues. Others need
some discussion...

a) Tokenizer has constructors and methods that take a char[] to process as
the source text. This char[] is cloned as it is input. It seems that the
main reason why someone would use the char[] method as opposed to String is
to get faster performance and avoid cloning.

I propose the cloning is removed. The class becomes less thread-safe, but
then it shouldn't be used that way anyway.


b) Tokenizer uses a Matcher to spot characters. It seems like this could be
too restrictive, what if you want a String delimiter.

I propose to change Matcher to be
   int isMatch(char[] text, int textLen, int pos)
Matcher implementations can then check against a string, or could even do
context based tests, by querying backwards/forwards in the string. PS. I
have coded this, and it does work.

c) Should we add a PairedMatcher to Tokenizer? This would handle
a=b,c=d,e=f  type strings returning a then b then c... using the first,
third, fifth delimiter as an equals, but the second, fourth,... delimter as
a comma. Is this a common enough format to warrant a class/method?


I did wonder whether it might be better to create a commons-format at this
point. It could contain Tokenizer and Interpolator. The trouble is what
happens to FastDateFormat or DurationFormat? In the end, I felt it would be
more confusing, we just need to control the formats we create.

Stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [lang] Tokenizer

Posted by Stephen Colebourne <sc...@btopenworld.com>.

Any views on this??

----- Original Message -----
From: "Stephen Colebourne" <sc...@btopenworld.com>
> I took a look at Tokenizer this morning and fixed some issues. Others need
> some discussion...
>
> a) Tokenizer has constructors and methods that take a char[] to process as
> the source text. This char[] is cloned as it is input. It seems that the
> main reason why someone would use the char[] method as opposed to String
is
> to get faster performance and avoid cloning.
>
> I propose the cloning is removed. The class becomes less thread-safe, but
> then it shouldn't be used that way anyway.
>
>
> b) Tokenizer uses a Matcher to spot characters. It seems like this could
be
> too restrictive, what if you want a String delimiter.
>
> I propose to change Matcher to be
>    int isMatch(char[] text, int textLen, int pos)
> Matcher implementations can then check against a string, or could even do
> context based tests, by querying backwards/forwards in the string. PS. I
> have coded this, and it does work.
>
> c) Should we add a PairedMatcher to Tokenizer? This would handle
> a=b,c=d,e=f  type strings returning a then b then c... using the first,
> third, fifth delimiter as an equals, but the second, fourth,... delimter
as
> a comma. Is this a common enough format to warrant a class/method?
>
>
> I did wonder whether it might be better to create a commons-format at this
> point. It could contain Tokenizer and Interpolator. The trouble is what
> happens to FastDateFormat or DurationFormat? In the end, I felt it would
be
> more confusing, we just need to control the formats we create.
>
> Stephen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org