You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Ola Berg <ol...@ports.se> on 2002/11/19 09:03:57 UTC

Re: [codec] Handling text encodings

> > The codec package is very simple.  Right now it contains 3 encoders
> > specifically geared towards language ( Soundex, RefinedSoundex, and
> > Metaphone ).  It also contains a Base64 encoder and decoder.
> >
> > There is only one interface "Encoder" with one method  "public
> > String encode(String pString)".  I think we need another interface
> > "Decoder", with a similarly simple interface "public String decode(String
> > pString)".

Hmm, I see a couple of issues with this.

1) It encodes chunks, and not streams. This is a scalability issue.

2) It is geared towards text. For Bootstring, I need arbitrary symbols.

3) There is no need for another interface with identical signatures. Maybe a Codec class that points out two "coders" (one encoder and one decoder).

For the short term, I think that a Punycode codec will do, and I will of course use Encoder as you have put it.

/O



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: [codec] Handling text encodings

Posted by Ola Berg <ol...@ports.se>.
> It should work with streams, no doubt about it.  

> I think that there should be two separate
> interfaces-- at least that's what I've usually done in such situations.  

An argument against that would be that both en- and decoding are simply stream transformations. It is the context (or your mind or need at that particular time) that decides whether this is a decoding or encoding transformation. 

In a neutral way, two transformations could be defined, and a third object (the codec) defines that transformation a is encoding while transformation b is decoding.

> (or at least providing interfaces in advance to point the way, so that everything
> will grow nicely together).

Sure, anything we come up with should be able to adapt to common stream handling routines.

What I smell is a generic interface, not belonging in codec, but in lang, for these kind of transformation. 

And while I think at it, I think that one will end up with something very similar to the streams classes in io (because it is justified with block handling as well as singular symbol handling).

/O



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: [codec] Handling text encodings (one more thing, sorry)

Posted by Ola Berg <ol...@ports.se>.
> Maybe we need two concepts:
> 
> A ChunkCodec - like Soundex, Metaphone, Refined Soundex, Message
> digests....
> And a StreamCodec - like Base64, Rot13, compression algorithms, sound
> encoding...

Sun dealt with the IO in a block/stream neutral way, where InputStream provides a stream oriented view on block oriented media and vice versa. InputStream completely hides the underlying media's orientation. I don't like it. OTOH, I like its possibility to view it in a unified way.

Lots of munching algorithms benefits or needs chunks (Base64 for one), where the implementation has to fake a completely streams based view. Writing the algorithm in a block oriented way is easy, the problem is to interface with it.

The problem with a codec interface that takes String as input is that there are few algorithms that will benefit from taking strings of arbitrary length. Different chunk oriented algorithms works at the lowest level with different chunk sizes and types.

Ideally, you want it to be easy to write the codec, in a way that suits you. Then you want to interface with it in another way. Smells like some kind adapter task/ server provider interface.

Suggestions?

/O



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: [codec] Handling text encodings (one more thing, sorry)

Posted by "O'brien, Tim" <to...@transolutions.net>.
A lot of good points.  Soundex, Metaphone, and Refined Soundex all deal
with language, it would make more sense if these classes were moved into
a language subpackage.

With regards to streams, I think it makes sense for something like
Base64 - most definitely this should be a stream oriented codec.  My
only question relates to something like Metaphone or Soundex.  The
soundex algorithm is a truncated encoding that was primarily developed
to encode last names - for example "O'Brien", or "Varszegi".   It seems
like wrapping "O'Brien" in a StringReader just to get the Soundex "O435"
is overkill.  In other words, even if I had a 512 character String, I'm
still only producing a 4 character code ( unless I use Refined Soundex
).

The only reason, I bring that up is because I need to be able to Soundex
about 120,000 strings and populate a ternary search tree in a very
limited time ( 2-4 seconds ).  If I had to insert a "new StringReader()"
into this process I'd imagine that I'd be waiting much longer to create
this index.

For Soundex, Metaphone, Refined Soundex, I'm more inspired by the
java.security.MessageDigest class. 

Maybe we need two concepts:

A ChunkCodec - like Soundex, Metaphone, Refined Soundex, Message
digests....
And a StreamCodec - like Base64, Rot13, compression algorithms, sound
encoding...

--------
Tim O'Brien 
Transolutions, Inc.
W 847-574-2143
M 847-863-7045


> -----Original Message-----
> From: Jeff Varszegi [mailto:jvarszegi@yahoo.com] 
> Sent: Tuesday, November 19, 2002 4:02 AM
> To: Jakarta Commons Developers List
> Subject: [codec] Handling text encodings (one more thing, sorry)
> 
> 
> I also think that if there are going to be lots of codecs in 
> the project over time, all the classes for a particular area 
> should be in subpackages, like the Base64 codec currently is. 
>  That means that the Metaphone codec etc. should be moved 
> down into a subpackage, and the codec package should just 
> have the generic stuff.  
> 
> You're really getting this insomniac's seventy-five cents' 
> worth tonight. ;O)
> 
> -Jeff
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site 
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


[codec] Handling text encodings (one more thing, sorry)

Posted by Jeff Varszegi <jv...@yahoo.com>.
I also think that if there are going to be lots of codecs in the project over time, all the
classes for a particular area should be in subpackages, like the Base64 codec currently is.  That
means that the Metaphone codec etc. should be moved down into a subpackage, and the codec package
should just have the generic stuff.  

You're really getting this insomniac's seventy-five cents' worth tonight. ;O)

-Jeff

__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: [codec] Handling text encodings

Posted by Jeff Varszegi <jv...@yahoo.com>.
I just wanted to point out that in similar situations, more than one XML API developer has chosen
to force String input to be wrapped in StringReader instances.  This is in a whole programming
area devoted to nothing but processing text!  I don't think it's inappropriate at all, even though
I've seen XML utility methods that take Strings more times than I can remember.  My main point,
though, is that we all may as well code it right the first time for anything that's not trivial. 
You can always wrap codec code that deals with streams to deal with chunks, but not the other way
around.

Jeff

__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: [codec] Handling text encodings

Posted by Jeff Varszegi <jv...@yahoo.com>.
It should work with streams, no doubt about it.  I think that there should be two separate
interfaces-- at least that's what I've usually done in such situations.  You can make a separate
Encoder and Decoder interface, and a Codec interface that extends them both.  That gives lots of
flexibility if you want to include everything in one class.

Check out com.sun.image.codec.jpeg; here they have separate encoder and decoder classes.  I read
that stuff a while back and it flavored my thinking.  Now check out out the classes in
com.sun.imageio .  Everything is readers and writers.  You may want to think about setting things
up this way, too (or at least providing interfaces in advance to point the way, so that everything
will grow nicely together).

Now, here's one more thing to think about: intermediate encodings.  I had to write some stuff
using IBM machine-translation engines a while back.  I remember thinking how dumb it was that one
needed to install a separate engine for every language pair.  Lots of pairs, as you can guess,
hadn't been implemented yet, but there were presumably thousands of IBM coders hard at work
implementing the n! engines necessary to supply comprehensive coverage for the world's languages. 
They all had different dictionaries, even.  After that (actually, even before that time), a lot of
focus in the translation-research community was put in the translation research community on
translating to an intermediate form.  Like microcrotch's CLR.  Maybe we can wrassle out (without
spending too too much time) a decent way of arranging that.

Jeff

--- Ola Berg <ol...@ports.se> wrote:
> > > The codec package is very simple.  Right now it contains 3 encoders
> > > specifically geared towards language ( Soundex, RefinedSoundex, and
> > > Metaphone ).  It also contains a Base64 encoder and decoder.
> > >
> > > There is only one interface "Encoder" with one method  "public
> > > String encode(String pString)".  I think we need another interface
> > > "Decoder", with a similarly simple interface "public String decode(String
> > > pString)".
> 
> Hmm, I see a couple of issues with this.
> 
> 1) It encodes chunks, and not streams. This is a scalability issue.
> 
> 2) It is geared towards text. For Bootstring, I need arbitrary symbols.
> 
> 3) There is no need for another interface with identical signatures. Maybe a Codec class that
> points out two "coders" (one encoder and one decoder).
> 
> For the short term, I think that a Punycode codec will do, and I will of course use Encoder as
> you have put it.
> 
> /O
> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>