You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Brett Henderson <br...@mail15.com> on 2003/11/03 01:39:54 UTC

[codec] Streamable Codec Framework

I just realised I left off "codec" in the subject.  Sorry about
that.

-----Original Message-----
From: Brett Henderson [mailto:bretth@mail15.com] 
Sent: Monday, 3 November 2003 10:47 AM
To: commons-dev@jakarta.apache.org
Subject: Streamable Codec Framework


Hi All,

I noticed Alexander Hvostov's recent email containing streamable
base64 codecs.  Given that the current codec implementations are
oriented around in-memory buffers, is there room for an
alternative codec framework supporting stream functionality?  I
realise the need for streamable codecs may not be that great but
it does seem like a gap in the current library.

I have done some work in this area over the last couple of months
as a small hobby project and have produced a small framework for
streamable codecs.

Some of the goals I was working towards were:
1. No memory allocation during streaming.  This eliminates
garbage collection during large conversions.
2. Pipelineable codecs.  This allows multiple codecs to be chained
together and treated as a single codec.  This allows codecs such as
base 64 to be broken into two components (base64 and line wrapping
codecs).
2. Single OutputStream, InputStream implementations which
utilise codec engines internally.  This eliminates the need to
produce a buffer based engine and a stream engine for every codec.
Note that this requires codec engines to be written in a manner
that supports streaming.
3. Customisable receivers.  All codecs utilise receivers to
handle conversion results.  This allows different outputs such as
streams, in-memory buffers, etc to be supported.
4. Direction agnostic codecs.  Decoupling the engine from the
streams allows the engines to be used in different ways than
originally intended.  Ie. You can perform base64 encoding
during reads from an InputStream.

I have produced base64 and ascii hex codecs as a proof of concept
and to evaluate performance.  It isn't as fast as the current
buffer based codecs but is unlikely to ever be as fast due to the
extra overheads associated with streaming.
Both base64 and ascii hex implementations can produce a data rate
of approximately 40MB/sec on a Pentium Mobile 1.5GHz notebook.
With some performance tuning I'm sure this could be improved,
I think array bounds checking is the largest performance hit.

Currently requires jdk1.4 (exception handling requires rework
for jdk1.3).
Running ant without arguments in the root directory will build
the project, run all unit tests and run performance tests.  Note
that the tests require junit to be available within ant.

Javadocs are the only documentation at the moment.

Files can be found at:
http://www32.brinkster.com/bretthenderson/BHCodec-0.2.zip

I hope someone finds this useful.  I'm not trying to force my
implementation on anybody and I'm sure it could be improved in
many ways.  I'm simply putting it forward as an optional approach.
If it is decided that streamable codecs are a useful addition to
commons I'd be glad to help.

Cheers,
Brett

PS.  Some areas that currently need improving are:
1. Exception handling requires jdk1.4, should be rewritten to
support older java versions.
2. BufferReceiver allocates memory continuously during streamed
conversions, should be fixed to recycle memory buffers.
3. Engines should have a new flush method added to allow them
to hold off posting to receivers until their internal buffers
fill up.  This would prevent fragmented buffers during
pipelined conversions.
4. OutputStream flush needs rework, shouldn't call finalize,
should call new flush method on CodecEngines.


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by Tim O'Brien <to...@discursive.com>.
Eek!  I'll have to sheepishly admit that I closed this Bugzilla issue. 
I'll reopen this issue and put in into the bug pile.

But... I consider this outside of Gary's 1.2 RC tag.  


On Mon, 2003-11-10 at 19:04, Ryan Hoegg wrote:
> My apologies as well, I didn't realize you weren't a committer, nor that 
> the MD5 stuff never made it into the release.
> 
> Tim, everyone, now that codec is a released component, might we create a 
> contrib directory or a codec-sandbox?  Chris's MD5 stuff has been done 
> for ages now.
> 
> I am not a commons committer, or I'd take some action myself.
> 
> This was previously tracked in Bug 17091 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=17091) but was CLOSED 
> with resolution LATER.  See Chris's attachment on 6/12/03.
> 
> --
> Ryan Hoegg
> ISIS Networks
> http://www.isisnetworks.net
> 
> Christopher (siege) O'Brien wrote:
> 
> >Apologies, that was not intended for the entire list. But since it went
> >there, may as well elaborate.
> >
> >The ChunkedInputStream used a call-back system to provide a data written
> >to the stream back in consistently-sized chunks (except for the last
> >data written, which would be sized appropriately). This was useful in
> >MD5 for obvious reasons in the streaming implementation. I factored it
> >into its own class because I figured it could also be used in the SHA1
> >implementation that was in the works, and perhaps other registers-based
> >hashes or checksumming codecs.
> >
> >Ryan, the original ChunkedInputStream should be a part of the package I
> >put together for the MD5 package, as you correctly recalled. I had
> >posted a note at one point offering the idea up to the IO folks, but I
> >never got a response on that.
> >
> >
> >- siege
> >
> >On Mon, 2003-11-10 at 19:12, Christopher (siege) O'Brien wrote:
> >  
> >
> >>I don't have CVS access! But you do, and you should have a copy of the
> >>code...
> >>
> >>- siege
> >>    
> >>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
-- 
-----------------------------------------------------	
Tim O'Brien - tobrien@discursive.com - (847) 863-7045


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by Ryan Hoegg <rh...@isisnetworks.net>.
My apologies as well, I didn't realize you weren't a committer, nor that 
the MD5 stuff never made it into the release.

Tim, everyone, now that codec is a released component, might we create a 
contrib directory or a codec-sandbox?  Chris's MD5 stuff has been done 
for ages now.

I am not a commons committer, or I'd take some action myself.

This was previously tracked in Bug 17091 
(http://issues.apache.org/bugzilla/show_bug.cgi?id=17091) but was CLOSED 
with resolution LATER.  See Chris's attachment on 6/12/03.

--
Ryan Hoegg
ISIS Networks
http://www.isisnetworks.net

Christopher (siege) O'Brien wrote:

>Apologies, that was not intended for the entire list. But since it went
>there, may as well elaborate.
>
>The ChunkedInputStream used a call-back system to provide a data written
>to the stream back in consistently-sized chunks (except for the last
>data written, which would be sized appropriately). This was useful in
>MD5 for obvious reasons in the streaming implementation. I factored it
>into its own class because I figured it could also be used in the SHA1
>implementation that was in the works, and perhaps other registers-based
>hashes or checksumming codecs.
>
>Ryan, the original ChunkedInputStream should be a part of the package I
>put together for the MD5 package, as you correctly recalled. I had
>posted a note at one point offering the idea up to the IO folks, but I
>never got a response on that.
>
>
>- siege
>
>On Mon, 2003-11-10 at 19:12, Christopher (siege) O'Brien wrote:
>  
>
>>I don't have CVS access! But you do, and you should have a copy of the
>>code...
>>
>>- siege
>>    
>>



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by "Christopher (siege) O'Brien" <si...@preoccupied.net>.
Apologies, that was not intended for the entire list. But since it went
there, may as well elaborate.

The ChunkedInputStream used a call-back system to provide a data written
to the stream back in consistently-sized chunks (except for the last
data written, which would be sized appropriately). This was useful in
MD5 for obvious reasons in the streaming implementation. I factored it
into its own class because I figured it could also be used in the SHA1
implementation that was in the works, and perhaps other registers-based
hashes or checksumming codecs.

Ryan, the original ChunkedInputStream should be a part of the package I
put together for the MD5 package, as you correctly recalled. I had
posted a note at one point offering the idea up to the IO folks, but I
never got a response on that.


- siege

On Mon, 2003-11-10 at 19:12, Christopher (siege) O'Brien wrote:
> I don't have CVS access! But you do, and you should have a copy of the
> code...
> 
> - siege
> 
> On Mon, 2003-11-10 at 15:27, Ryan Hoegg wrote:
> > IIRC, Chris O'Brien had a ChunkedInputStream for the MD5 digest code he 
> > put together.  If it's not in CVS, Chris, put it there!
> > 
> > --
> > Ryan Hoegg
> > ISIS Networks
> > http://www.isisnetworks.net
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: commons-dev-help@jakarta.apache.org
-- 
Christopher (siege) O'Brien <si...@preoccupied.net>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by "Christopher (siege) O'Brien" <si...@preoccupied.net>.
I don't have CVS access! But you do, and you should have a copy of the
code...

- siege

On Mon, 2003-11-10 at 15:27, Ryan Hoegg wrote:
> IIRC, Chris O'Brien had a ChunkedInputStream for the MD5 digest code he 
> put together.  If it's not in CVS, Chris, put it there!
> 
> --
> Ryan Hoegg
> ISIS Networks
> http://www.isisnetworks.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
-- 
Christopher (siege) O'Brien <si...@preoccupied.net>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by Ryan Hoegg <rh...@isisnetworks.net>.
IIRC, Chris O'Brien had a ChunkedInputStream for the MD5 digest code he 
put together.  If it's not in CVS, Chris, put it there!

--
Ryan Hoegg
ISIS Networks
http://www.isisnetworks.net


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [codec] Streamable Codec Framework

Posted by Brett Henderson <br...@mail15.com>.
I think the design of the codec framework could cover
your requirements but it will require more functionality
than it currently has.

> > > > Some of the goals I was working towards were:
> > > > 1. No memory allocation during streaming.  This eliminates
> > > > garbage collection during large conversions.
> > > Cool. I got large conversions... I'm already at
> > > mediumblob in mysql , and it goes up/down XML
> > stream
> > > :)
> > 
> > I have a lot to learn here.  While I have some
> > knowledge
> > of XML (like every other developer on the planet), I
> > have never used it for large data sets or used SAX
> > parsing.
> > Sounds like a good test to find holes in the design
> > :-)
> 
> It's easy. You got callback, where you can gobble up
> string buffers with incoming chars for element
> contents.  ( and there is a lot of this stuff... )
> After tag is closed, you have all the chars in a big
> string buffer, and get another callback - in this
> callback you have to convert data, and do whatever
> necessary ( in my case, create input stream, and pass
> it to database ) 

This could be tricky, it's something I've been thinking
about but would like feedback from others about the best
way of going about it.

The data you have available is in character format.
The base64 codec engine operates on byte buffers.
The writer you want to write to requires the data
to be in character format.

I have concentrated on byte processing for now because
it is the most common requirement.  XML processing
requires that characters be used instead.

It makes no sense to perform base64 conversion on
character arrays directly because base64 is only 8-bit
aware (you could split each character into two bytes
but this would blow out the result buffer size where
chars only contain ASCII data).

I think it makes more sense to perform character to
byte conversion separately (perhaps through
extensions to existing framework) and then perform
base64 encoding on the result.  I guess this is a
UTF-16 to UTF-8 conversion ...

What support is there within the JDK for performing
character to byte conversion?
JDK1.4 has the java.nio.charset package but I can't
see an equivalent for JDK1.3 and lower, they seem to
use com.sun classes internally when charset conversion
is required.

If JDK1.4 is considered a sufficient base, I could
extend the current framework to provide conversion
engines that translate from one data representation
to another.  I could then create a new CodecEngine
interface to handle character buffers (eg.
CodecEngineChar).


> > > > 3. Customisable receivers.  All codecs utilise
> > > > receivers to
> > > > handle conversion results.  This allows
> > different
> > > > outputs such as
> > > > streams, in-memory buffers, etc to be supported.
> > > 
> > > And writers :) Velocity directives use them.
> > 
> > Do you mean java.io.Writer?  If so I haven't
> > included
> > direct support for them because I focused on raw
> > byte
> > streams.  However it shouldn't be hard to add a
> > receiver to write to java.io.Writer instances.
> 
> 
> My scenarios: 
> - I'm exporting information as base64 to XML with help
> ov velocity. I do it through custom directive - 
> in this directive I get a Writer from velocity, where
> I have to put my data. 
> 
> Ideally codec would do: read input stream - encode -
> put it into writer without allocating too much 
> memory. 
> 
> I'm importing information:
> - I have stream ( string ) of base 64 data - 
> codec gives me an input stream which is fed from this
> source and does not allocate too much memory and
> behaves polite...
> 
The current framework doesn't handle direct conversion
from an input stream to an output stream but this
would be simple to add if required.
Again, the hard part would be the char/byte issues.


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [codec] Streamable Codec Framework

Posted by Konstantin Priblouda <kp...@yahoo.com>.
> > > Some of the goals I was working towards were:
> > > 1. No memory allocation during streaming.  This
> > > eliminates
> > > garbage collection during large conversions.
> > Cool. I got large conversions... I'm already at
> > mediumblob in mysql , and it goes up/down XML
> stream
> > :)
> 
> I have a lot to learn here.  While I have some
> knowledge
> of XML (like every other developer on the planet), I
> have never used it for large data sets or used SAX
> parsing.
> Sounds like a good test to find holes in the design
> :-)

It's easy. You got callback, where you can gobble up
string buffers with incoming chars for element
contents.  ( and there is a lot of this stuff... )
After tag is closed, you have all the chars in a big
string buffer, and get another callback - in this
callback you have to convert data, and do whatever
necessary ( in my case, create input stream, and pass
it to database ) 


> > > 3. Customisable receivers.  All codecs utilise
> > > receivers to
> > > handle conversion results.  This allows
> different
> > > outputs such as
> > > streams, in-memory buffers, etc to be supported.
> > 
> > And writers :) Velocity directives use them.
> 
> Do you mean java.io.Writer?  If so I haven't
> included
> direct support for them because I focused on raw
> byte
> streams.  However it shouldn't be hard to add a
> receiver to write to java.io.Writer instances.


My scenarios: 
- I'm exporting information as base64 to XML with help
ov velocity. I do it through custom directive - 
in this directive I get a Writer from velocity, where
I have to put my data. 

Ideally codec would do: read input stream - encode -
put it into writer without allocating too much 
memory. 

I'm importing information:
- I have stream ( string ) of base 64 data - 
codec gives me an input stream which is fed from this
source and does not allocate too much memory and
behaves polite...

regards,

=====
----[ Konstantin Pribluda ( ko5tik ) ]----------------
Zu Verst�rkung meines Teams suche ich ab Sofort einen
Softwareentwickler[In] f�r die Festanstellung. 
Arbeitsort: Mainz 
Skills:  Programieren, Kentnisse in OpenSource-Bereich
----[ http://www.pribluda.de ]------------------------

__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [codec] Streamable Codec Framework

Posted by Brett Henderson <br...@mail15.com>.
> > I noticed Alexander Hvostov's recent email
> > containing streamable
> > base64 codecs.  Given that the current codec
> > implementations are
> > oriented around in-memory buffers, is there room for
> > an
> > alternative codec framework supporting stream
> > functionality?  I
> > realise the need for streamable codecs may not be
> > that great but
> > it does seem like a gap in the current library.
> 
> I'm in the need. So we are at least 3 :) 
> 
> 
> > Some of the goals I was working towards were:
> > 1. No memory allocation during streaming.  This
> > eliminates
> > garbage collection during large conversions.
> Cool. I got large conversions... I'm already at
> mediumblob in mysql , and it goes up/down XML stream
> :)

I have a lot to learn here.  While I have some knowledge
of XML (like every other developer on the planet), I
have never used it for large data sets or used SAX parsing.
Sounds like a good test to find holes in the design :-)

> > 3. Customisable receivers.  All codecs utilise
> > receivers to
> > handle conversion results.  This allows different
> > outputs such as
> > streams, in-memory buffers, etc to be supported.
> 
> And writers :) Velocity directives use them.

Do you mean java.io.Writer?  If so I haven't included
direct support for them because I focused on raw byte
streams.  However it shouldn't be hard to add a
receiver to write to java.io.Writer instances.

> I'll give it a look at and come back later today :) 

I look forward to your feedback.


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [codec] Streamable Codec Framework

Posted by Konstantin Priblouda <kp...@yahoo.com>.
> Hi All,

Hi. 

> I noticed Alexander Hvostov's recent email
> containing streamable
> base64 codecs.  Given that the current codec
> implementations are
> oriented around in-memory buffers, is there room for
> an
> alternative codec framework supporting stream
> functionality?  I
> realise the need for streamable codecs may not be
> that great but
> it does seem like a gap in the current library.

I'm in the need. So we are at least 3 :) 


> Some of the goals I was working towards were:
> 1. No memory allocation during streaming.  This
> eliminates
> garbage collection during large conversions.
Cool. I got large conversions... I'm already at
mediumblob in mysql , and it goes up/down XML stream
:)

> 2. Pipelineable codecs.  This allows multiple codecs
> to be chained
> together and treated as a single codec.  This allows
> codecs such as
> base 64 to be broken into two components (base64 and
> line wrapping
> codecs).

Also nice. 

> 2. Single OutputStream, InputStream implementations
> which
> utilise codec engines internally.  This eliminates
> the need to
> produce a buffer based engine and a stream engine
> for every codec.
> Note that this requires codec engines to be written
> in a manner
> that supports streaming.

If stream basedengine is there, it's not a problem to 
work on buffers... Though some codecs with internal
state may be tricky. 


> 3. Customisable receivers.  All codecs utilise
> receivers to
> handle conversion results.  This allows different
> outputs such as
> streams, in-memory buffers, etc to be supported.

And writers :) Velocity directives use them.

> 4. Direction agnostic codecs.  Decoupling the engine
> from the
> streams allows the engines to be used in different
> ways than
> originally intended.  Ie. You can perform base64
> encoding
> during reads from an InputStream.
> 
> I have produced base64 and ascii hex codecs as a
> proof of concept
> and to evaluate performance.  It isn't as fast as
> the current
> buffer based codecs but is unlikely to ever be as
> fast due to the
> extra overheads associated with streaming.
> Both base64 and ascii hex implementations can
> produce a data rate
> of approximately 40MB/sec on a Pentium Mobile 1.5GHz
> notebook.
> With some performance tuning I'm sure this could be
> improved,
> I think array bounds checking is the largest
> performance hit.
> 
> Currently requires jdk1.4 (exception handling
> requires rework
> for jdk1.3).
> Running ant without arguments in the root directory
> will build
> the project, run all unit tests and run performance
> tests.  Note
> that the tests require junit to be available within
> ant.
> 
> Javadocs are the only documentation at the moment.
> 
> Files can be found at:
>
http://www32.brinkster.com/bretthenderson/BHCodec-0.2.zip
> 
> I hope someone finds this useful.  I'm not trying to
> force my
> implementation on anybody and I'm sure it could be
> improved in
> many ways.  I'm simply putting it forward as an
> optional approach.
> If it is decided that streamable codecs are a useful
> addition to
> commons I'd be glad to help.

I'll give it a look at and come back later today :)

regards,

=====
----[ Konstantin Pribluda ( ko5tik ) ]----------------
Zu Verst�rkung meines Teams suche ich ab Sofort einen
Softwareentwickler[In] f�r die Festanstellung. 
Arbeitsort: Mainz 
Skills:  Programieren, Kentnisse in OpenSource-Bereich
----[ http://www.pribluda.de ]------------------------

__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org