You are viewing a plain text version of this content. The canonical link for it is here.
Posted to repository@apache.org by "Noel J. Bergman" <no...@devtech.com> on 2004/02/20 23:18:44 UTC

duplicate data

Mark,

An issue was raised earlier today that should be addressed.  The impression
is that java-repository is publishing copies of jars that are also under
dist/TLP/..., which puts more of a load on the mirrors.  It might be best if
the jars/ directories contained symlinks and not copies of artifacts
published elsewhere.

This doesn't address the fact that at some point an artifact may exist only
in the archives, but that would require meta-data aware clients.

	--- Noel


Re: duplicate data

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
I'll try to expand on the functionalities of Maven below.

Sander Striker wrote:

> On Sat, 2004-02-21 at 01:01, Mark R. Diggory wrote:
> 
>>Noel J. Bergman wrote:
>>
>>>>The issue is... the jars/distributables are placed into the
>>>>java-repository using maven.
> 
> 
> Can you explain this a bit?  I thought Maven was used to fetch
> projects and dependencies.  Ofcourse I can read up on Maven,
> but a quick summary of the technicalities would be appreciated.
> 

Maven is used to both fetch jars from the repository and to publish the 
jars to the repository. In regards to the latter, it does this basically 
through ssh sessions where it completes a number of commands (scp, md5, 
chmod, chgrp). Because its encapsulated within maven the user can rely 
on Mavens deployement mechanism to setup the jar/signature in the 
repository for their project, since its scripted, it is done the same 
way every time. This takes a great deal of the effort invovled with 
publishing jars to the repository out of the users hands.

Maven is really doing nothing more than acting as an ssh client for the 
user and automating the deployment process for them using their apache 
account.

This benefits Maven because it can rely on the repository being 
maintained in a structure it can predict and locate dependencies within.
> 
>> so, currently, if you look in
>>
>>>>something like the commons project.properties you'll see that
>>>>they are pointing to the central repository for the location
>>>>of where to "publish" files.
>>>
>>>
>>>>The "convergence issues" we currently have for the repository:
>>>> 1.) We want single copies of files on the mirrors.
>>>
>>> +1
> 
> 
> This is the core point.
> 

Yes, we all agree on this one...

> 
>>>>My best conclusion is
>>>>keep "jars" in the java-repository, do not keep them
>>>>in your /dist/<project>/<binaries> directory.  Remove all
>>>>[jar/zip/tar files] from the java-repository.
>>>
>>>
>>>>symlnk the appropriate java-repository dir into their appropriate
>>>>"dist" directory.
> 
> 
> That would mean that this entire area would have to be rw to all
> groups producing releases that are to be in there.  This kindof means
> apcvs group ownership, which I don't really fancy doing.  The other
> way around, control and access of each projects dist/ area seperated,
> and symlinking to that from java-repository, seems a bit sa[fn]er to
> me.
>  

Ultimately we are seeking a convergence here between what the repository 
folks want to see, the maven users want to see and the infrastructure 
folks want to see.

1.) For the repository (and Maven) folks, we want to see the contents of 
dist become standardized according to the Repository URI specification. 
This means "all" distributables (java or not) are organized according to 
this specification.

2.) For Maven users, no matter what happens, we need to maintain a 
functionally working repository the works with the existing version of 
Maven.

3.) For Infrastructure, all this needs to be properly secured and 
maintained according to Apache standards.

The java-repository structure is broken down into

.../java-repository/<project>/<jars|distributables|...>/<foo-version.ext>

this would mean each project would need to maintain a separate set of 
symlinks for "jars", "distributables", "...".





> 
>>>That sounds OK to me, but folks like Sander and others more involved in
>>>mirroring should be put in the loop.  Everything we put under dist/ effects
>>>100s of mirrors.
> 
> 
> Not me specifically, but Infrastructure.  Others are more actively
> maintaining the mirrors list and monitoring the mirrors.  The mirrors
> are a precious resource and we want to be careful not to 'scare' any
> mirrors away with actions on our end.
> 
> 
>>Yes, I learned that the hard way when we created the contents of 
>>java-repository... that was not a happy weekend. I don't make any "rash" 
>>changes to dist any more...Only well thought out moves. But we are in a 
>>state of cleanup now as well, we have to consider what we are going to 
>>do next.
> 
> 
> If you are making large changes to the directory structure and the
> majority of the files is already on the mirrors, send a mail to
> mirrors@, attach a shell script that moves everything around locally,
> and give them a heads up on when this shuffle is happening.  This
> save a _lot_ of bandwidth.
> 
> Also, when adding a lot, make sure to inform the mirrors, so they
> are prepared.
> 
> 
>>>>Discussion about how to finalize the directory structure such
>>>>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
> 
> 
> I don't parse this, but since Noel can read it, I am probably missing
> context/background.
> 

Just that these groups are all focused on different aspects of the 
distributables in the dist directory:

The Repository projects Url structure is important in standardizing and 
improving the dist contents into a more formal structure.

The Maven project represents a working example of a tool that implements 
itself upon this structure.

Between the dist directory maintainers and the the mirrors out there 
represent a "control" on the whole situation, if it doesn't work for 
them, then its not realistic as a strategy.

> 
>>>That would be good.
> 
> 
> 
>>In our last discussion, I think one of the conclusions that was arrived 
>>at as well, was the idea of breaking the java-repository up into two 
>>different locations.
>>
>>www/cvs.apache.org/dist/java-repository --> nightly builds
>>
>>www/www.apache.org/dist/java-repository --> official releases.
>>
>>the idea was that nightly/weekly builds are not things we want to see on 
>>mirrors but to be available for developers. And that official release of 
>>jars are things we want to see mirrored.
> 
> 
> Is Maven using the mirrors today, like getting the list of active
> mirrors from the main site and finding the closest?  Or is it only
> using the main site and perhaps iblibio?
> 
Currently, all Maven clients use www.ibiblio.org/maven to retrieve 
content. www.ibibilio.org is also a mirror of /java-repository for all 
its apache content. Actually Maven users DO NOT go to 
www.apache.org/dist/java-repository to download files, and only Apache 
developers can publish to www.apache.org/dist/java-repository.


What server is used is currently based on the configuration of the Maven 
client, servers currently do not maintain any capability to hand this 
client off to another mirror. I think, in the future as the Repository 
comes into existence and machine readable metadata or mechanisms for 
directing clients off to mirrors come into existence, then clients like 
Maven will implement such capabilities.

> 
>>When it comes to things like the ibiblio maven repository, it would only 
>>maintain full version releases of apache projects.
> 
> 
> Can you explain why ibiblio is special here?  I mean, what you describe
> is what is supposed to be on all the mirrors right?
> 

Just because it is the "default" repository used by the Maven Client.
> 
>> If your an apache 
>>project and need to be on the bleeding edge for a component, then you 
>>can simply add
>>
>>www/cvs.apache.org/dist/java-repository
>>
>>as your first repository location and get your apache jars straight off 
>>the nightly builds...
>>
>>The big question is how to facilitate this a build process, I think the 
>>last decision on the Jakarta Commons/General/Maven lists was that we 
>>would automate the build process for releasing the nightly jars into
>>
>>www/cvs.apache.org/dist/java-repository
>>
>>And the only publishing of jars by actual humans (Release Managers) 
>>would be the full releases onto
>>
>>www/www.apache.org/dist/java-repository
> 
> 
> Symlinks I hope.  Mirrors handle symlinks efficiently, that is,
> if they follow our rsync instructions.

The only mirroring that would be done would be via:

www.apache.org/dist

All other content in cvs.apache.org or archive.apache.org is not to be 
"synced" as its not to be published out to mirrors, such content are 
"developer build" and not for public consumption.

Within the www.apache.org/dist directory, yes symlinking should be used 
to resolve duplication.

> 
> Take a look at http://www.apache.org/~henkp/md5/, specifically
> the fyi: some duplicates section.  Dups are a waste of bandwidth
> and diskspace.

Yes, approx 50% of instances of duplication on this page are currently 
caused by avalon components (avalon also was using their dist directory 
as a private maven repository). For example:

avalon/excalibur-component/jars/excalibur-component-1.1.jar
java-repository/excalibur-component/jars/excalibur-component-1.1.jar

I understand it can be the policy that when rsyncing, if the symlink and 
the target directory do not have the same ownership, that it will not be 
followed. I believe this creates a problem in that I cannot simply 
create symlinks from java-repository/excalibur-component/ to 
avalon/excalibur-component/ as they will not be followed by rsync.

However, the other 50% of duplicates within the java-repository 
directory should be properly alleviated with symlinking, I can work on 
this as I now (as of a couple days ago) own all the files :-). I will 
start working on a script I can run periodically which will accomplish this.

> 
> I'll ask Henk to disable the checks for presence of md5 in the
> dist/java-repository, since that doesn't seem to be applicable
> there.  It seems to me that you do want to do some verification
> in maven, but you are probably storing signature information
> somewhere in the maven 'database'?
> 

No, it is in the directory structure (no db) and md5's should exist next 
to the files, there is a bug in maven caused by the fact that on BSD 
checksums are generated by "md5" not "md5sum" like on linux, this needs 
to be addressed, for example, you see my md5 was bad on the math jar 
(which I just fixed).

-Mark
-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu

Re: duplicate data

Posted by Sander Striker <st...@apache.org>.
On Sat, 2004-02-21 at 01:01, Mark R. Diggory wrote:
> Noel J. Bergman wrote:
> >>The issue is... the jars/distributables are placed into the
> >>java-repository using maven.

Can you explain this a bit?  I thought Maven was used to fetch
projects and dependencies.  Ofcourse I can read up on Maven,
but a quick summary of the technicalities would be appreciated.

>  so, currently, if you look in
> >>something like the commons project.properties you'll see that
> >>they are pointing to the central repository for the location
> >>of where to "publish" files.
> > 
> > 
> >>The "convergence issues" we currently have for the repository:
> >>  1.) We want single copies of files on the mirrors.
> > 
> >  +1

This is the core point.

> >>My best conclusion is
> >> keep "jars" in the java-repository, do not keep them
> >> in your /dist/<project>/<binaries> directory.  Remove all
> >> [jar/zip/tar files] from the java-repository.
> > 
> > 
> >>symlnk the appropriate java-repository dir into their appropriate
> >>"dist" directory.

That would mean that this entire area would have to be rw to all
groups producing releases that are to be in there.  This kindof means
apcvs group ownership, which I don't really fancy doing.  The other
way around, control and access of each projects dist/ area seperated,
and symlinking to that from java-repository, seems a bit sa[fn]er to
me.
 
> > That sounds OK to me, but folks like Sander and others more involved in
> > mirroring should be put in the loop.  Everything we put under dist/ effects
> > 100s of mirrors.

Not me specifically, but Infrastructure.  Others are more actively
maintaining the mirrors list and monitoring the mirrors.  The mirrors
are a precious resource and we want to be careful not to 'scare' any
mirrors away with actions on our end.

> Yes, I learned that the hard way when we created the contents of 
> java-repository... that was not a happy weekend. I don't make any "rash" 
> changes to dist any more...Only well thought out moves. But we are in a 
> state of cleanup now as well, we have to consider what we are going to 
> do next.

If you are making large changes to the directory structure and the
majority of the files is already on the mirrors, send a mail to
mirrors@, attach a shell script that moves everything around locally,
and give them a heads up on when this shuffle is happening.  This
save a _lot_ of bandwidth.

Also, when adding a lot, make sure to inform the mirrors, so they
are prepared.

> >>Discussion about how to finalize the directory structure such
> >>that "Repository", "Dist/Mirror" and "Maven" has to move forward.

I don't parse this, but since Noel can read it, I am probably missing
context/background.

> > That would be good.


> In our last discussion, I think one of the conclusions that was arrived 
> at as well, was the idea of breaking the java-repository up into two 
> different locations.
> 
> www/cvs.apache.org/dist/java-repository --> nightly builds
> 
> www/www.apache.org/dist/java-repository --> official releases.
> 
> the idea was that nightly/weekly builds are not things we want to see on 
> mirrors but to be available for developers. And that official release of 
> jars are things we want to see mirrored.

Is Maven using the mirrors today, like getting the list of active
mirrors from the main site and finding the closest?  Or is it only
using the main site and perhaps iblibio?

> When it comes to things like the ibiblio maven repository, it would only 
> maintain full version releases of apache projects.

Can you explain why ibiblio is special here?  I mean, what you describe
is what is supposed to be on all the mirrors right?

>  If your an apache 
> project and need to be on the bleeding edge for a component, then you 
> can simply add
> 
> www/cvs.apache.org/dist/java-repository
> 
> as your first repository location and get your apache jars straight off 
> the nightly builds...
> 
> The big question is how to facilitate this a build process, I think the 
> last decision on the Jakarta Commons/General/Maven lists was that we 
> would automate the build process for releasing the nightly jars into
> 
> www/cvs.apache.org/dist/java-repository
> 
> And the only publishing of jars by actual humans (Release Managers) 
> would be the full releases onto
> 
> www/www.apache.org/dist/java-repository

Symlinks I hope.  Mirrors handle symlinks efficiently, that is,
if they follow our rsync instructions.

Take a look at http://www.apache.org/~henkp/md5/, specifically
the fyi: some duplicates section.  Dups are a waste of bandwidth
and diskspace.

I'll ask Henk to disable the checks for presence of md5 in the
dist/java-repository, since that doesn't seem to be applicable
there.  It seems to me that you do want to do some verification
in maven, but you are probably storing signature information
somewhere in the maven 'database'?

> I believe this sort of approach would be inline with the policies of the 
> dist/mirrors.

Sander

Re: duplicate data

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Noel J. Bergman wrote:
>>The issue is... the jars/distributables are placed into the
>>java-repository using maven. so, currently, if you look in
>>something like the commons project.properties you'll see that
>>they are pointing to the central repository for the location
>>of where to "publish" files.
> 
> 
>>The "convergence issues" we currently have for the repository:
>>  1.) We want single copies of files on the mirrors.
> 
> 
>  +1
> 
> 
>>My best conclusion is
>> keep "jars" in the java-repository, do not keep them
>> in your /dist/<project>/<binaries> directory.  Remove all
>> [jar/zip/tar files] from the java-repository.
> 
> 
>>symlnk the appropriate java-repository dir into their appropriate
>>"dist" directory.
> 
> 
> That sounds OK to me, but folks like Sander and others more involved in
> mirroring should be put in the loop.  Everything we put under dist/ effects
> 100s of mirrors.
> 


Yes, I learned that the hard way when we created the contents of 
java-repository... that was not a happy weekend. I don't make any "rash" 
changes to dist any more...Only well thought out moves. But we are in a 
state of cleanup now as well, we have to consider what we are going to 
do next.


> 

>>Discussion about how to finalize the directory structure such
>>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
> 
> 
> That would be good.
> 

In our last discussion, I think one of the conclusions that was arrived 
at as well, was the idea of breaking the java-repository up into two 
different locations.

www/cvs.apache.org/dist/java-repository --> nightly builds

www/www.apache.org/dist/java-repository --> official releases.

the idea was that nightly/weekly builds are not things we want to see on 
mirrors but to be available for developers. And that official release of 
jars are things we want to see mirrored.

When it comes to things like the ibiblio maven repository, it would only 
maintain full version releases of apache projects. If your an apache 
project and need to be on the bleeding edge for a component, then you 
can simply add

www/cvs.apache.org/dist/java-repository

as your first repository location and get your apache jars straight off 
the nightly builds...

The big question is how to facilitate this a build process, I think the 
last decision on the Jakarta Commons/General/Maven lists was that we 
would automate the build process for releasing the nightly jars into

www/cvs.apache.org/dist/java-repository

And the only publishing of jars by actual humans (Release Managers) 
would be the full releases onto

www/www.apache.org/dist/java-repository

I believe this sort of approach would be inline with the policies of the 
dist/mirrors.

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu

RE: duplicate data

Posted by "Noel J. Bergman" <no...@devtech.com>.
> The issue is... the jars/distributables are placed into the
> java-repository using maven. so, currently, if you look in
> something like the commons project.properties you'll see that
> they are pointing to the central repository for the location
> of where to "publish" files.

> The "convergence issues" we currently have for the repository:
>   1.) We want single copies of files on the mirrors.

 +1

> My best conclusion is
>  keep "jars" in the java-repository, do not keep them
>  in your /dist/<project>/<binaries> directory.  Remove all
>  [jar/zip/tar files] from the java-repository.

> symlnk the appropriate java-repository dir into their appropriate
> "dist" directory.

That sounds OK to me, but folks like Sander and others more involved in
mirroring should be put in the loop.  Everything we put under dist/ effects
100s of mirrors.

> Discussion about how to finalize the directory structure such
> that "Repository", "Dist/Mirror" and "Maven" has to move forward.

That would be good.

	--- Noel


Re: duplicate data

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Th issue is... the jars/distributables are placed into the 
java-repository using maven. so, currently, if you look in something 
like the commons project.properties you'll see that they are pointing to 
the central repository for the location of where to "publish" files.

######################################################################
# Apache Central Repository
######################################################################
maven.repo.central=www.apache.org
maven.repo.central.directory=/www/www.apache.org/dist/java-repository
maven.remote.group=apcvs


To publish using Maven, the only logical way I can see to resolve this 
discrepancy is to symlink from the projects dir itself, something like

.../java-repository/<project> ---> /dist/<tlp>/**<subproject>/...

The "convergence issues" we currently have for the repository:

1.) We want single copies of files on the mirrors.

2.) we want the repository to reflect the hierarchical structure of our 
projects/subprojects.

3.) Many projects already have groupId/projectIds in maven that do not 
match the hierarchical nature of the the projects at apache. Its very 
difficult to actually move these at this time due to dependency issues.

for instance

/java-repository/commons-collections

I can't identify if this is jakarta, xml or apache commons...(we all 
know from experience its jakarta commons collections.

So the issue is, which direction is the convergence going to go. I'd 
personally like to see the day when

/www/www.apache.org/dist

is the repository location and the maven project ids are something like:

jakarta/commons/collections/<artifact>/<version>/...
xml/commons/resolver/<artifact>/<version>/...
commons/.../<artifact>/<version>/...
avalon/.../<artifact>/<version>/...

which would match fairly well the directory structure of the dist 
directory, the major changes would then be

<subproject>/source
<subproject>/binary

would be replaced with

<subproject>/<artifact>/<version>/<artifact>-<version>.<ext>

or whatever we can agree upon finally for the repository url structure 
(One thats not "theoretical", but actually "used" by tools)...

So the point is, yes, we want to resolve the replication issues. My best 
conclusion is

A.) Currently just keep "jars" in the java-repository, do not keep them 
in your /dist/<project>/<binaries> directory. Remove all distributions 
form the java-repository. Currently, I haven't seen much use in actually 
releasing tarballs/zips into the Maven repository, others may have 
better opinions.

B.) If a project really wants to publish both jars and tar/zip 
distributions to the same location. Then have them have them symlnk the 
  appropriate java-repository dir into their appropriate "dist" directory.

C.) Discussion about how to finalize the directory structure such that 
"Repository", "Dist/Mirror" and "Maven" has to move forward. Refactoring 
steps and interim solutions have to be discussed and thought about.

This will probably make for a great weekend discussion,
-Mark

Noel J. Bergman wrote:

> Mark,
> 
> An issue was raised earlier today that should be addressed.  The impression
> is that java-repository is publishing copies of jars that are also under
> dist/TLP/..., which puts more of a load on the mirrors.  It might be best if
> the jars/ directories contained symlinks and not copies of artifacts
> published elsewhere.
> 
> This doesn't address the fact that at some point an artifact may exist only
> in the archives, but that would require meta-data aware clients.
> 
> 	--- Noel
> 

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu