You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Simon Tyler <st...@mimecast.net> on 2010/03/10 16:31:04 UTC

FW: Compress 1.1 release?

Hi,

Do we have a date yet for the compress 1.1 release? There are a few features
in it that I need to use.

Also, is there time to add a couple of minor feature enhancements? I could
do with access to the following:

1. A public method to check if a ZipArchiveInputStream has a data descriptor
(e.g. return hasDataDescriptor).

2. Better handling when ZipArchiveInputStream is used to read such streams.
Currently it silently fails when this happens when if hits an invalid
LFH_SIG by returning null.

Cheers
Simon

Re: Compress 1.1 release?

Posted by Stefan Bodewig <bo...@apache.org>.
I got it wrong, ZipArchiveInputStream relied on the data descriptor to
have a signature - it will now deal with entries without one as well.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: Compress 1.1 release?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-03-12, Simon Tyler <st...@mimecast.net> wrote:

> The file called Word.xps at:

> http://www.wssdemo.com/XPS/Forms/AllItems.aspx

> exhibits the problem.

COMPRESS-103

> You are entirely correct the entry is STORED so COMPRESS-100 does the
> trick for me.

Well, it will signal an error but not allow you to skip the entry and go
ahead an read the next one since the stream doesn't know how many bytes
to skip.

In order to really solve the problem we'd need to fix COMPRESS-103.
I'll be sitting in trains, planes and airports for quite some time
tomorrow and stand a chance to get this done, but can't promise anything
and wouldn't want to hold up the release for this.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: Compress 1.1 release?

Posted by Simon Tyler <st...@mimecast.net>.
The file called Word.xps at:

http://www.wssdemo.com/XPS/Forms/AllItems.aspx

exhibits the problem.

You are entirely correct the entry is STORED so COMPRESS-100 does the trick
for me.

Simon 



On 12/03/2010 15:17, "Stefan Bodewig" <bo...@apache.org> wrote:

> On 2010-03-12, Simon Tyler <st...@mimecast.net> wrote:
> 
>> If I explain the scenario in more detail then it might become clearer.
> 
>> I am seeing issues with certain zip files and file format based on zip (such
>> as docx and zip). We are reading these files from a stream so are using the
>> ZipArchiveInputStream.
> 
>> What I see is that we loop around getting each entry with getNextZipEntry
>> and we get a null and stop. All looks good. However we have only extracted 1
>> or 2 entries out of a known 20 or 30 entries - the file based extractor
>> extracts all the file.
> 
> Understood.  My guess is that whatever is creating your archives is
> using the optional header to identify data descriptors.  I'll try to
> create one with InfoZIP, can't promise anything, though.
> 
>> I cannot provide an example of a file as the examples I have are all
>> customer owned.
> 
> That's a pitty.
> 
>> However every xps file I have seen suffers the issue:
> 
> I just created one using the "Save as XPS" addin to Word 2007 on a
> "Hello world" document and the stream worked just fine.
> 
>> http://www.microsoft.com/whdc/xps/xpssampdoc.mspx
> 
> I'll take a look later, likely not today.
> 
>> I have investigated the issue and it is caused by entries that use the
>> central directory.
> 
> you mean data descriptor, right?
> 
>> What happens in the zip stream reader is that the size, csize and crc
>> fields are all zero, there is no central directory available to the
>> reader so it performs no extraction.
> 
> This is not true.  If the archiver works correctly it has set a flag
> that it is going to use a data descriptor after the entry's data.  If
> this flag has been set AND the compression method is DEFLATE, the stream
> can figure out itself where the entry data ends (since DEFLATE marks EOF
> internally).  If the entry data is STORED the stream cannot know where
> the data ends.
> 
> I see several problems while looking through the code:
> 
> * it doesn't verify the method is DEFLATE when a data descriptor is used
>   and it will try to read 0 bytes instead of throwing an exception -
>   this may be causing your problem.  COMPRESS-100
> 
> * the stream just skips over the data descriptor and never reads it - it
>   rather sets size and crc fields from what it has found.  This may be
>   OK since we never check the claimed CRC anyway.
> 
> * the stream skips over exactly four words while the archiver may have
>   used a signature of four bytes.  In that case the stream must skip
>   those extra bytes.  COMPRESS-101
> 
>> So my two change requests are simply to enable me to validate entries and
>> detect these types of stream so I can do something appropriate.
> 
> If I'm correct and you are bitten by what is now COMPRESS-100 then it
> should suffice if canReadEntryData returned false.  Right?
> 
>> The second request is to not return a null when this type of error occurs
>> but indicate the error somehow. There might be issues here (I am no zip
>> expert) but I would be worried about false errors being reported.
> 
> That could be COMPRESS-100 as well.  Or COMPRESS-101 is the problem for
> you, in which case we should be able to fix it.  Or it is yet another
> issue that we can't really identify without a testcase.
> 
> Stefan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: Compress 1.1 release?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-03-12, Simon Tyler <st...@mimecast.net> wrote:

> If I explain the scenario in more detail then it might become clearer.

> I am seeing issues with certain zip files and file format based on zip (such
> as docx and zip). We are reading these files from a stream so are using the
> ZipArchiveInputStream.

> What I see is that we loop around getting each entry with getNextZipEntry
> and we get a null and stop. All looks good. However we have only extracted 1
> or 2 entries out of a known 20 or 30 entries - the file based extractor
> extracts all the file.

Understood.  My guess is that whatever is creating your archives is
using the optional header to identify data descriptors.  I'll try to
create one with InfoZIP, can't promise anything, though.

> I cannot provide an example of a file as the examples I have are all
> customer owned.

That's a pitty.

> However every xps file I have seen suffers the issue:

I just created one using the "Save as XPS" addin to Word 2007 on a
"Hello world" document and the stream worked just fine.

> http://www.microsoft.com/whdc/xps/xpssampdoc.mspx

I'll take a look later, likely not today.

> I have investigated the issue and it is caused by entries that use the
> central directory.

you mean data descriptor, right?

> What happens in the zip stream reader is that the size, csize and crc
> fields are all zero, there is no central directory available to the
> reader so it performs no extraction.

This is not true.  If the archiver works correctly it has set a flag
that it is going to use a data descriptor after the entry's data.  If
this flag has been set AND the compression method is DEFLATE, the stream
can figure out itself where the entry data ends (since DEFLATE marks EOF
internally).  If the entry data is STORED the stream cannot know where
the data ends.

I see several problems while looking through the code:

* it doesn't verify the method is DEFLATE when a data descriptor is used
  and it will try to read 0 bytes instead of throwing an exception -
  this may be causing your problem.  COMPRESS-100

* the stream just skips over the data descriptor and never reads it - it
  rather sets size and crc fields from what it has found.  This may be
  OK since we never check the claimed CRC anyway.

* the stream skips over exactly four words while the archiver may have
  used a signature of four bytes.  In that case the stream must skip
  those extra bytes.  COMPRESS-101

> So my two change requests are simply to enable me to validate entries and
> detect these types of stream so I can do something appropriate.

If I'm correct and you are bitten by what is now COMPRESS-100 then it
should suffice if canReadEntryData returned false.  Right?

> The second request is to not return a null when this type of error occurs
> but indicate the error somehow. There might be issues here (I am no zip
> expert) but I would be worried about false errors being reported.

That could be COMPRESS-100 as well.  Or COMPRESS-101 is the problem for
you, in which case we should be able to fix it.  Or it is yet another
issue that we can't really identify without a testcase.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: Compress 1.1 release?

Posted by Simon Tyler <st...@mimecast.net>.

If I explain the scenario in more detail then it might become clearer.

I am seeing issues with certain zip files and file format based on zip (such
as docx and zip). We are reading these files from a stream so are using the
ZipArchiveInputStream.

What I see is that we loop around getting each entry with getNextZipEntry
and we get a null and stop. All looks good. However we have only extracted 1
or 2 entries out of a known 20 or 30 entries - the file based extractor
extracts all the file.

I cannot provide an example of a file as the examples I have are all
customer owned. However every xps file I have seen suffers the issue:

http://www.microsoft.com/whdc/xps/xpssampdoc.mspx

I have investigated the issue and it is caused by entries that use the
central directory. What happens in the zip stream reader is that the size,
csize and crc fields are all zero, there is no central directory available
to the reader so it performs no extraction. This means the next loop to
getNextZipEntry is incorrectly positioned and fails checking the entry
signature (LFH_SIG), this returns a null and to the calling code it appears
that we have succeeded.

So my two change requests are simply to enable me to validate entries and
detect these types of stream so I can do something appropriate. With
compress 1.1 there is support to identify encrypted entries which I need and
hence the request to identify entries using the data descriptor.

The second request is to not return a null when this type of error occurs
but indicate the error somehow. There might be issues here (I am no zip
expert) but I would be worried about false errors being reported.

Simon


On 11/03/2010 13:11, "Stefan Bodewig" <bo...@apache.org> wrote:

> On 2010-03-10, Simon Tyler <st...@mimecast.net> wrote:
> 
>> Do we have a date yet for the compress 1.1 release?
> 
> What Christian said.
> 
>> Also, is there time to add a couple of minor feature enhancements? I could
>> do with access to the following:
> 
>> 1. A public method to check if a ZipArchiveInputStream has a data
>> descriptor (e.g. return hasDataDescriptor).
> 
> This is a property of the individual entry, not the stream as a whole,
> isn't it?  Why would you want to know that (just curious)?  We could
> probably make the general purpose flags available and you could look at
> bit 3.
> 
>> 2. Better handling when ZipArchiveInputStream is used to read such streams.
>> Currently it silently fails when this happens when if hits an invalid
>> LFH_SIG by returning null.
> 
> I'm not sure what you mean, could you describe what happens under what
> circumstances in more detail?  I see that the data descriptor isn't read
> anywhere and I see that the stream may fail if the data descriptor uses
> the "unofficial signature" mentioned in appnote, is this what you mean?
> 
> Stefan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: FW: Compress 1.1 release?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-03-10, Simon Tyler <st...@mimecast.net> wrote:

> Do we have a date yet for the compress 1.1 release?

What Christian said.

> Also, is there time to add a couple of minor feature enhancements? I could
> do with access to the following:

> 1. A public method to check if a ZipArchiveInputStream has a data
> descriptor (e.g. return hasDataDescriptor).

This is a property of the individual entry, not the stream as a whole,
isn't it?  Why would you want to know that (just curious)?  We could
probably make the general purpose flags available and you could look at
bit 3.

> 2. Better handling when ZipArchiveInputStream is used to read such streams.
> Currently it silently fails when this happens when if hits an invalid
> LFH_SIG by returning null.

I'm not sure what you mean, could you describe what happens under what
circumstances in more detail?  I see that the data descriptor isn't read
anywhere and I see that the stream may fail if the data descriptor uses
the "unofficial signature" mentioned in appnote, is this what you mean?

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: FW: Compress 1.1 release?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-03-11, Christian Grobmeier <gr...@gmail.com> wrote:

> Stefan/Torsten - you are mainly active on compress right now, how is
> the state?

So far I don't itend to make any changes before the release, unless
Simon's changes are really small.  If COMPRESS-18 gets accepted it may
need some documentation updates as well.

> COMPRESS-75 might be closed.

Likely.

> COMPRESS-18 needs a review as far as I know, but that could be done
> quite soon aswell (I'll have some time left next week).

We may want to think about COMPRESS-91 if COMPRESS-18 goes in since the
jar streams will rely on META-INF/ to be the very first entry inside the
ZIP (we'd need to increase the signature window to at least 39 bytes).

> However, wdyt about preparing the 1.1 somewhere next week?

Sounds good.

> If you have additions, pls create a patch and add it to a new JIRA issue:
> https://issues.apache.org/jira/browse/COMPRESS

A patch including tests is the optimal solution, but a feature request
without a patch but with a testcase may be enough. 8-)

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: FW: Compress 1.1 release?

Posted by Christian Grobmeier <gr...@gmail.com>.
Hi,

> Do we have a date yet for the compress 1.1 release? There are a few features
> in it that I need to use.

not yet, but we already discussed that its time for a new release.

Stefan/Torsten - you are mainly active on compress right now, how is the state?
I see there are 3 issues open for 1.1.
"COMPRESS-72 - Move acks"  should be straightforward. COMPRESS-75
might be closed.
COMPRESS-18 needs a review as far as I know, but that could be done
quite soon aswell (I'll have some time left next week).

However, wdyt about preparing the 1.1 somewhere next week?


> Also, is there time to add a couple of minor feature enhancements? I could
> do with access to the following:

If you have additions, pls create a patch and add it to a new JIRA issue:
https://issues.apache.org/jira/browse/COMPRESS

If they do not cause too much discussion, they might be added in 1.1
aswell. You should make sure to add test cases too.

Cheers


>
> 1. A public method to check if a ZipArchiveInputStream has a data descriptor
> (e.g. return hasDataDescriptor).
>
> 2. Better handling when ZipArchiveInputStream is used to read such streams.
> Currently it silently fails when this happens when if hits an invalid
> LFH_SIG by returning null.
>
> Cheers
> Simon
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org