You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Jeff Varszegi <jv...@yahoo.com> on 2002/11/19 11:01:54 UTC

[codec] Handling text encodings (one more thing, sorry)

I also think that if there are going to be lots of codecs in the project over time, all the
classes for a particular area should be in subpackages, like the Base64 codec currently is.  That
means that the Metaphone codec etc. should be moved down into a subpackage, and the codec package
should just have the generic stuff.  

You're really getting this insomniac's seventy-five cents' worth tonight. ;O)

-Jeff

__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: [codec] Handling text encodings (one more thing, sorry)

Posted by Ola Berg <ol...@ports.se>.
> Maybe we need two concepts:
> 
> A ChunkCodec - like Soundex, Metaphone, Refined Soundex, Message
> digests....
> And a StreamCodec - like Base64, Rot13, compression algorithms, sound
> encoding...

Sun dealt with the IO in a block/stream neutral way, where InputStream provides a stream oriented view on block oriented media and vice versa. InputStream completely hides the underlying media's orientation. I don't like it. OTOH, I like its possibility to view it in a unified way.

Lots of munching algorithms benefits or needs chunks (Base64 for one), where the implementation has to fake a completely streams based view. Writing the algorithm in a block oriented way is easy, the problem is to interface with it.

The problem with a codec interface that takes String as input is that there are few algorithms that will benefit from taking strings of arbitrary length. Different chunk oriented algorithms works at the lowest level with different chunk sizes and types.

Ideally, you want it to be easy to write the codec, in a way that suits you. Then you want to interface with it in another way. Smells like some kind adapter task/ server provider interface.

Suggestions?

/O



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: [codec] Handling text encodings (one more thing, sorry)

Posted by "O'brien, Tim" <to...@transolutions.net>.
A lot of good points.  Soundex, Metaphone, and Refined Soundex all deal
with language, it would make more sense if these classes were moved into
a language subpackage.

With regards to streams, I think it makes sense for something like
Base64 - most definitely this should be a stream oriented codec.  My
only question relates to something like Metaphone or Soundex.  The
soundex algorithm is a truncated encoding that was primarily developed
to encode last names - for example "O'Brien", or "Varszegi".   It seems
like wrapping "O'Brien" in a StringReader just to get the Soundex "O435"
is overkill.  In other words, even if I had a 512 character String, I'm
still only producing a 4 character code ( unless I use Refined Soundex
).

The only reason, I bring that up is because I need to be able to Soundex
about 120,000 strings and populate a ternary search tree in a very
limited time ( 2-4 seconds ).  If I had to insert a "new StringReader()"
into this process I'd imagine that I'd be waiting much longer to create
this index.

For Soundex, Metaphone, Refined Soundex, I'm more inspired by the
java.security.MessageDigest class. 

Maybe we need two concepts:

A ChunkCodec - like Soundex, Metaphone, Refined Soundex, Message
digests....
And a StreamCodec - like Base64, Rot13, compression algorithms, sound
encoding...

--------
Tim O'Brien 
Transolutions, Inc.
W 847-574-2143
M 847-863-7045


> -----Original Message-----
> From: Jeff Varszegi [mailto:jvarszegi@yahoo.com] 
> Sent: Tuesday, November 19, 2002 4:02 AM
> To: Jakarta Commons Developers List
> Subject: [codec] Handling text encodings (one more thing, sorry)
> 
> 
> I also think that if there are going to be lots of codecs in 
> the project over time, all the classes for a particular area 
> should be in subpackages, like the Base64 codec currently is. 
>  That means that the Metaphone codec etc. should be moved 
> down into a subpackage, and the codec package should just 
> have the generic stuff.  
> 
> You're really getting this insomniac's seventy-five cents' 
> worth tonight. ;O)
> 
> -Jeff
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site 
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>