You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2011/07/26 14:00:59 UTC

[compress] Where to Place Big Test Archives?

Hi,

ZIP64 is all about supporting archives with entries bigger than 4GB and
archives with more than 65355 entries so it comes as no surprise that
test archives for ZIP64 are big.

Right now I'm working with two archives, one contains a single file that
consists of 5e9 zeros, the InfoZIP generated ZIP is about 4MB in size.
The other one contains 100k empty files and is about 15MB in size.

I wouldn't want to add them to the normal src/test/resources tree as
those two archives alone would bump the size of the source distribution
by about 20 MB (ZIP won't make them any smaller, neither will gzip -
bzip2 might).

Right now I'm looking into ways that place them outside of src/ and to
write unit tests that are simply skipped if the archives are not where
they are expected.  Those tests will take ages to run anyway (creating
the archives using zip 3.0 on my Linux notebook took several minutes
each) and may be better only run when asked for explicitly.

The question for me right now is "how far" outside of src I should place
them.  Do I add them somewhere under the branch and later trunk so that
all compress developers have to download them at least once or do I
place them somewhere next to trunk and only those who intend to run the
tests will need to download them?  Currently I think it would be best to
do the later and place the corresponding tests right along there as
well.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Ted Dunning <te...@gmail.com>.
Multiple maven artifacts is a nice way to get a download only if you need
it.  I was going to suggest s3 or similar place, but maven central is better
in many ways.

I would suggest taring up all of the test files and making a single maven
artifact from that.

On Wed, Jul 27, 2011 at 12:31 AM, Jörg Schaible <joerg.schaible@scalaris.com
> wrote:

> > In the end I expect our corpus of test archives to grow as interop
> > problems are discovered.
>
> Different approach: Create out of them separate Maven artifacts.

Re: [compress] Where to Place Big Test Archives?

Posted by Jörg Schaible <jo...@gmx.de>.
Torsten Curdt wrote:

>>> The best workaround I can see is to create a separate download for the
>>> test data. This allows developers who need sources for debugging to
>>> still get them without downloading GBs of test data.
>>
>> With a separate Maven profile you have exactly that.
> 
> Why would we need a separate profile for this?
> 
> I would think just an ordinary artifact should be fine
> 
> ...or do you mean to prevent the download of that artifact on a normal
> build without running the tests?

Stefan asked what to do with really big test data (e.g. >4GB for ZIP64) and 
I proposed to use a profile to run those tests and declare this data as 
separate deps also in the profile.

- Jörg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Torsten Curdt <tc...@vafer.org>.
>> The best workaround I can see is to create a separate download for the
>> test data. This allows developers who need sources for debugging to
>> still get them without downloading GBs of test data.
>
> With a separate Maven profile you have exactly that.

Why would we need a separate profile for this?

I would think just an ordinary artifact should be fine

...or do you mean to prevent the download of that artifact on a normal
build without running the tests?

cheers,
Torsten

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Jörg Schaible <jo...@gmx.de>.
Gary Gregory wrote:

> On Wed, Jul 27, 2011 at 9:45 AM, Luc Maisonobe <Lu...@free.fr>
> wrote:
>> Le 27/07/2011 14:54, Stefan Bodewig a écrit :
>>>
>>> On 2011-07-27, Torsten Curdt wrote:
>>>
>>>> But to be clear: this is quite some work just to safe some disk space
>>>> for people who check out the whole of commons but do not care about
>>>> compress.
>>>
>>> Actually my main concern was/is the size of the source distribution and
>>> network bandwidth rather than disk space.
>>
>> I second that. I know people who even in occidental countries still have
>> no access to ADSL and can use only 56k modems, with unreliable connexion.
>> For them, 15MB is huge.
>>
>> We should take care of the digital divide.
> 
> I do not see how. If I want to be a developer on [compress], I need to
> be able to run the full test suite.
> 
> The best workaround I can see is to create a separate download for the
> test data. This allows developers who need sources for debugging to
> still get them without downloading GBs of test data.

With a separate Maven profile you have exactly that.

- Jörg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Luc Maisonobe <Lu...@free.fr>.
Le 27/07/2011 18:30, Gary Gregory a écrit :
> On Wed, Jul 27, 2011 at 9:45 AM, Luc Maisonobe<Lu...@free.fr>  wrote:
>> Le 27/07/2011 14:54, Stefan Bodewig a écrit :
>>>
>>> On 2011-07-27, Torsten Curdt wrote:
>>>
>>>> But to be clear: this is quite some work just to safe some disk space
>>>> for people who check out the whole of commons but do not care about
>>>> compress.
>>>
>>> Actually my main concern was/is the size of the source distribution and
>>> network bandwidth rather than disk space.
>>
>> I second that. I know people who even in occidental countries still have no
>> access to ADSL and can use only 56k modems, with unreliable connexion. For
>> them, 15MB is huge.
>>
>> We should take care of the digital divide.
>
> I do not see how. If I want to be a developer on [compress], I need to
> be able to run the full test suite.
>
> The best workaround I can see is to create a separate download for the
> test data. This allows developers who need sources for debugging to
> still get them without downloading GBs of test data.

This would be fine.

Luc

>
> Gary
>
>
>>
>> Luc
>>
>>> If I'm the only one who
>>> thinks a 15MB file in svn trunk is excessive, then I'll be happy to shut
>>> up - I'm going to need the files anyway. ;-)
>>>
>>> Stefan
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-07-27, Gary Gregory wrote:

> The best workaround I can see is to create a separate download for the
> test data. This allows developers who need sources for debugging to
> still get them without downloading GBs of test data.

I agree and - to the extent that I understand it - Jörg's suggestion to
put the corresponding tests into a separate profile sounds good to me.

So far I don't know what I'd need to do concretely to make the test
files "separate artifacts", will need to read up on a few things.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Bear Giles <bg...@coyotesong.com>.
In case you're talking about my test data the uncompressed test case could
be externally compressed and the test code adjusted accordingly. It never
occurred to me to do that since you never compress dump files like that in
the wild - you use the native compression instead - but the file is almost
entirely \0. I guess I also assumed the test data would be a separate
download.

You definitely want to have both compressed and uncompressed test data
though, and large enough files and directories to force multiple segments. I
think I managed to fail on nearly every case during development.

Bear


On Wed, Jul 27, 2011 at 10:30 AM, Gary Gregory <ga...@gmail.com>wrote:

> On Wed, Jul 27, 2011 at 9:45 AM, Luc Maisonobe <Lu...@free.fr>
> wrote:
> > Le 27/07/2011 14:54, Stefan Bodewig a écrit :
> >>
> >> On 2011-07-27, Torsten Curdt wrote:
> >>
> >>> But to be clear: this is quite some work just to safe some disk space
> >>> for people who check out the whole of commons but do not care about
> >>> compress.
> >>
> >> Actually my main concern was/is the size of the source distribution and
> >> network bandwidth rather than disk space.
> >
> > I second that. I know people who even in occidental countries still have
> no
> > access to ADSL and can use only 56k modems, with unreliable connexion.
> For
> > them, 15MB is huge.
> >
> > We should take care of the digital divide.
>
> I do not see how. If I want to be a developer on [compress], I need to
> be able to run the full test suite.
>
> The best workaround I can see is to create a separate download for the
> test data. This allows developers who need sources for debugging to
> still get them without downloading GBs of test data.
>
> Gary
>
>
> >
> > Luc
> >
> >> If I'm the only one who
> >> thinks a 15MB file in svn trunk is excessive, then I'll be happy to shut
> >> up - I'm going to need the files anyway. ;-)
> >>
> >> Stefan
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
>
>
> --
> Thank you,
> Gary
>
> http://garygregory.wordpress.com/
> http://garygregory.com/
> http://people.apache.org/~ggregory/
> http://twitter.com/GaryGregory
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [compress] Where to Place Big Test Archives?

Posted by Gary Gregory <ga...@gmail.com>.
On Wed, Jul 27, 2011 at 9:45 AM, Luc Maisonobe <Lu...@free.fr> wrote:
> Le 27/07/2011 14:54, Stefan Bodewig a écrit :
>>
>> On 2011-07-27, Torsten Curdt wrote:
>>
>>> But to be clear: this is quite some work just to safe some disk space
>>> for people who check out the whole of commons but do not care about
>>> compress.
>>
>> Actually my main concern was/is the size of the source distribution and
>> network bandwidth rather than disk space.
>
> I second that. I know people who even in occidental countries still have no
> access to ADSL and can use only 56k modems, with unreliable connexion. For
> them, 15MB is huge.
>
> We should take care of the digital divide.

I do not see how. If I want to be a developer on [compress], I need to
be able to run the full test suite.

The best workaround I can see is to create a separate download for the
test data. This allows developers who need sources for debugging to
still get them without downloading GBs of test data.

Gary


>
> Luc
>
>> If I'm the only one who
>> thinks a 15MB file in svn trunk is excessive, then I'll be happy to shut
>> up - I'm going to need the files anyway. ;-)
>>
>> Stefan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>



-- 
Thank you,
Gary

http://garygregory.wordpress.com/
http://garygregory.com/
http://people.apache.org/~ggregory/
http://twitter.com/GaryGregory

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Luc Maisonobe <Lu...@free.fr>.
Le 27/07/2011 14:54, Stefan Bodewig a écrit :
> On 2011-07-27, Torsten Curdt wrote:
>
>> But to be clear: this is quite some work just to safe some disk space
>> for people who check out the whole of commons but do not care about
>> compress.
>
> Actually my main concern was/is the size of the source distribution and
> network bandwidth rather than disk space.

I second that. I know people who even in occidental countries still have 
no access to ADSL and can use only 56k modems, with unreliable 
connexion. For them, 15MB is huge.

We should take care of the digital divide.

Luc

> If I'm the only one who
> thinks a 15MB file in svn trunk is excessive, then I'll be happy to shut
> up - I'm going to need the files anyway. ;-)
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-07-27, Torsten Curdt wrote:

> But to be clear: this is quite some work just to safe some disk space
> for people who check out the whole of commons but do not care about
> compress.

Actually my main concern was/is the size of the source distribution and
network bandwidth rather than disk space.  If I'm the only one who
thinks a 15MB file in svn trunk is excessive, then I'll be happy to shut
up - I'm going to need the files anyway. ;-)

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Torsten Curdt <tc...@vafer.org>.
>> Different approach: Create out of them separate Maven artifacts.

If we go down that road I would prefer to keep all test in the core
and only get the resources from the interop jar. The interop jar would
then hold all manually created zip,tgz,etc files.

But to be clear: this is quite some work just to safe some disk space
for people who check out the whole of commons but do not care about
compress. For those who care about compress: you really don't want to
have multiple versions of that huge artifact in your local repository.

Let's say - it's not an ideal solution.

cheers,
Torsten

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Jörg Schaible <jo...@scalaris.com>.
Stefan Bodewig wrote:

> On 2011-07-27, Jörg Schaible wrote:
> 
>> Different approach: Create out of them separate Maven artifacts.
> 
> This probably is what I had in mind when I suggested to move them -
> along with the testcases - somewhere outside of trunk.  I just didn't
> know the proper nomenclature and likely don't know how to do that
> properly, I'm an Ant guy and all that. 8-)
> 
> Technically that would mean we'd add a reactor for Compress with two
> artifacts.  One would be what Compress is right now and one would hold
> "interop tests" or something similar.  Is this correct?

Not necessarily. You may also define a profile that activates the additional 
tests (and the additional dependencies for your large test files).

> I'm with Ted and would only create one artifact for all interop tests.
> And I'd probably prefer to keep the small tests with the code as unit
> tests.

AFAICS Ted proposed one artifact for the test files i.e. the (large) 
archives you will operate on in your tests. That's also possible, but I'd 
not recommend it, since I expect you probably will never change them again, 
while a combined ZIP artifact with all the test files *will* change (every 
time you add a new one).

- Jörg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-07-27, Jörg Schaible wrote:

> Different approach: Create out of them separate Maven artifacts.

This probably is what I had in mind when I suggested to move them -
along with the testcases - somewhere outside of trunk.  I just didn't
know the proper nomenclature and likely don't know how to do that
properly, I'm an Ant guy and all that. 8-)

Technically that would mean we'd add a reactor for Compress with two
artifacts.  One would be what Compress is right now and one would hold
"interop tests" or something similar.  Is this correct?

I'm with Ted and would only create one artifact for all interop tests.
And I'd probably prefer to keep the small tests with the code as unit
tests.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Jörg Schaible <jo...@scalaris.com>.
Hi Stefan,

Stefan Bodewig wrote:

> On 2011-07-26, Ted Dunning wrote:
>> On Tue, Jul 26, 2011 at 5:31 AM, Stefan Bodewig <bo...@apache.org>
>> wrote:
> 
>>>> Perhaps the large test files could be generated on the fly if absent
>>>> in the user's temp directory?
> 
>>> This would require 5 GB of disk space in temp and a working ZIP64
>>> implementation to create the archives.  Disk space is likely not that
>>> big a problem anymore, but still it seems excessive.
> 
>> Generating these files is an excellent test in itself.  Just generate
>> them
>> and check the md5 to verify they are correct.  Then use them.
> 
> Oh, absolutely, no question.
> 
> But (1) we are not there, yet (ZipArchiveInpuStream can read ZIP64 - did
> I say that? - but we have no way to write them, yet) and (2) it will not
> be sufficient.  Compress really is a jungle of dialects for almost all
> supported formats and we need to do interop tests.
> 
> An archive created by InfoZIP is different from one created by PKZIP
> which is different from Windows Compressed folders ... you get the idea.
> If Compress' output matched that of InfoZIP it still wouldn't show that
> we are able to read PKZIP created archives.
> 
> In the end I expect our corpus of test archives to grow as interop
> problems are discovered.

Different approach: Create out of them separate Maven artifacts.

- Jörg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-07-26, Ted Dunning wrote:
> On Tue, Jul 26, 2011 at 5:31 AM, Stefan Bodewig <bo...@apache.org> wrote:

>>> Perhaps the large test files could be generated on the fly if absent
>>> in the user's temp directory?

>> This would require 5 GB of disk space in temp and a working ZIP64
>> implementation to create the archives.  Disk space is likely not that
>> big a problem anymore, but still it seems excessive.

> Generating these files is an excellent test in itself.  Just generate them
> and check the md5 to verify they are correct.  Then use them.

Oh, absolutely, no question.

But (1) we are not there, yet (ZipArchiveInpuStream can read ZIP64 - did
I say that? - but we have no way to write them, yet) and (2) it will not
be sufficient.  Compress really is a jungle of dialects for almost all
supported formats and we need to do interop tests.

An archive created by InfoZIP is different from one created by PKZIP
which is different from Windows Compressed folders ... you get the idea.
If Compress' output matched that of InfoZIP it still wouldn't show that
we are able to read PKZIP created archives.

In the end I expect our corpus of test archives to grow as interop
problems are discovered.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Ted Dunning <te...@gmail.com>.
Generating these files is an excellent test in itself.  Just generate them
and check the md5 to verify they are correct.  Then use them.

On Tue, Jul 26, 2011 at 5:31 AM, Stefan Bodewig <bo...@apache.org> wrote:

> > Perhaps the large test files could be generated on the fly if absent
> > in the user's temp directory?
>
> This would require 5 GB of disk space in temp and a working ZIP64
> implementation to create the archives.  Disk space is likely not that
> big a problem anymore, but still it seems excessive.

Re: [compress] Where to Place Big Test Archives?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-07-26, Gary Gregory wrote:

> For "small" files I would not worry, the [sanselan] test data
> directory is >80MB and no one is complaining.

Really?  My DSL provider still cannot offer me more than 3MBit/s and
this is close to the city center in a town > 250k citizens in Germany, I
could imagine people get upset if they have to download files that big
on slower lines.

Sure, when I started hacking on Ant in 2000 all I had was a 28k modem,
oh my.  8-)

> Perhaps the large test files could be generated on the fly if absent
> in the user's temp directory?

This would require 5 GB of disk space in temp and a working ZIP64
implementation to create the archives.  Disk space is likely not that
big a problem anymore, but still it seems excessive.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Where to Place Big Test Archives?

Posted by Gary Gregory <ga...@gmail.com>.
For "small" files I would not worry, the [sanselan] test data
directory is >80MB and no one is complaining.

Perhaps the large test files could be generated on the fly if absent
in the user's temp directory?

Gary

On Tue, Jul 26, 2011 at 8:00 AM, Stefan Bodewig <bo...@apache.org> wrote:
> Hi,
>
> ZIP64 is all about supporting archives with entries bigger than 4GB and
> archives with more than 65355 entries so it comes as no surprise that
> test archives for ZIP64 are big.
>
> Right now I'm working with two archives, one contains a single file that
> consists of 5e9 zeros, the InfoZIP generated ZIP is about 4MB in size.
> The other one contains 100k empty files and is about 15MB in size.
>
> I wouldn't want to add them to the normal src/test/resources tree as
> those two archives alone would bump the size of the source distribution
> by about 20 MB (ZIP won't make them any smaller, neither will gzip -
> bzip2 might).
>
> Right now I'm looking into ways that place them outside of src/ and to
> write unit tests that are simply skipped if the archives are not where
> they are expected.  Those tests will take ages to run anyway (creating
> the archives using zip 3.0 on my Linux notebook took several minutes
> each) and may be better only run when asked for explicitly.
>
> The question for me right now is "how far" outside of src I should place
> them.  Do I add them somewhere under the branch and later trunk so that
> all compress developers have to download them at least once or do I
> place them somewhere next to trunk and only those who intend to run the
> tests will need to download them?  Currently I think it would be best to
> do the later and place the corresponding tests right along there as
> well.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>



-- 
Thank you,
Gary

http://garygregory.wordpress.com/
http://garygregory.com/
http://people.apache.org/~ggregory/
http://twitter.com/GaryGregory

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org