You are viewing a plain text version of this content. The canonical link for it is here.
Posted to repository@apache.org by "Noel J. Bergman" <no...@devtech.com> on 2004/02/20 23:18:44 UTC
duplicate data
Mark,
An issue was raised earlier today that should be addressed. The impression
is that java-repository is publishing copies of jars that are also under
dist/TLP/..., which puts more of a load on the mirrors. It might be best if
the jars/ directories contained symlinks and not copies of artifacts
published elsewhere.
This doesn't address the fact that at some point an artifact may exist only
in the archives, but that would require meta-data aware clients.
--- Noel
Re: duplicate data
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
I'll try to expand on the functionalities of Maven below.
Sander Striker wrote:
> On Sat, 2004-02-21 at 01:01, Mark R. Diggory wrote:
>
>>Noel J. Bergman wrote:
>>
>>>>The issue is... the jars/distributables are placed into the
>>>>java-repository using maven.
>
>
> Can you explain this a bit? I thought Maven was used to fetch
> projects and dependencies. Ofcourse I can read up on Maven,
> but a quick summary of the technicalities would be appreciated.
>
Maven is used to both fetch jars from the repository and to publish the
jars to the repository. In regards to the latter, it does this basically
through ssh sessions where it completes a number of commands (scp, md5,
chmod, chgrp). Because its encapsulated within maven the user can rely
on Mavens deployement mechanism to setup the jar/signature in the
repository for their project, since its scripted, it is done the same
way every time. This takes a great deal of the effort invovled with
publishing jars to the repository out of the users hands.
Maven is really doing nothing more than acting as an ssh client for the
user and automating the deployment process for them using their apache
account.
This benefits Maven because it can rely on the repository being
maintained in a structure it can predict and locate dependencies within.
>
>> so, currently, if you look in
>>
>>>>something like the commons project.properties you'll see that
>>>>they are pointing to the central repository for the location
>>>>of where to "publish" files.
>>>
>>>
>>>>The "convergence issues" we currently have for the repository:
>>>> 1.) We want single copies of files on the mirrors.
>>>
>>> +1
>
>
> This is the core point.
>
Yes, we all agree on this one...
>
>>>>My best conclusion is
>>>>keep "jars" in the java-repository, do not keep them
>>>>in your /dist/<project>/<binaries> directory. Remove all
>>>>[jar/zip/tar files] from the java-repository.
>>>
>>>
>>>>symlnk the appropriate java-repository dir into their appropriate
>>>>"dist" directory.
>
>
> That would mean that this entire area would have to be rw to all
> groups producing releases that are to be in there. This kindof means
> apcvs group ownership, which I don't really fancy doing. The other
> way around, control and access of each projects dist/ area seperated,
> and symlinking to that from java-repository, seems a bit sa[fn]er to
> me.
>
Ultimately we are seeking a convergence here between what the repository
folks want to see, the maven users want to see and the infrastructure
folks want to see.
1.) For the repository (and Maven) folks, we want to see the contents of
dist become standardized according to the Repository URI specification.
This means "all" distributables (java or not) are organized according to
this specification.
2.) For Maven users, no matter what happens, we need to maintain a
functionally working repository the works with the existing version of
Maven.
3.) For Infrastructure, all this needs to be properly secured and
maintained according to Apache standards.
The java-repository structure is broken down into
.../java-repository/<project>/<jars|distributables|...>/<foo-version.ext>
this would mean each project would need to maintain a separate set of
symlinks for "jars", "distributables", "...".
>
>>>That sounds OK to me, but folks like Sander and others more involved in
>>>mirroring should be put in the loop. Everything we put under dist/ effects
>>>100s of mirrors.
>
>
> Not me specifically, but Infrastructure. Others are more actively
> maintaining the mirrors list and monitoring the mirrors. The mirrors
> are a precious resource and we want to be careful not to 'scare' any
> mirrors away with actions on our end.
>
>
>>Yes, I learned that the hard way when we created the contents of
>>java-repository... that was not a happy weekend. I don't make any "rash"
>>changes to dist any more...Only well thought out moves. But we are in a
>>state of cleanup now as well, we have to consider what we are going to
>>do next.
>
>
> If you are making large changes to the directory structure and the
> majority of the files is already on the mirrors, send a mail to
> mirrors@, attach a shell script that moves everything around locally,
> and give them a heads up on when this shuffle is happening. This
> save a _lot_ of bandwidth.
>
> Also, when adding a lot, make sure to inform the mirrors, so they
> are prepared.
>
>
>>>>Discussion about how to finalize the directory structure such
>>>>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
>
>
> I don't parse this, but since Noel can read it, I am probably missing
> context/background.
>
Just that these groups are all focused on different aspects of the
distributables in the dist directory:
The Repository projects Url structure is important in standardizing and
improving the dist contents into a more formal structure.
The Maven project represents a working example of a tool that implements
itself upon this structure.
Between the dist directory maintainers and the the mirrors out there
represent a "control" on the whole situation, if it doesn't work for
them, then its not realistic as a strategy.
>
>>>That would be good.
>
>
>
>>In our last discussion, I think one of the conclusions that was arrived
>>at as well, was the idea of breaking the java-repository up into two
>>different locations.
>>
>>www/cvs.apache.org/dist/java-repository --> nightly builds
>>
>>www/www.apache.org/dist/java-repository --> official releases.
>>
>>the idea was that nightly/weekly builds are not things we want to see on
>>mirrors but to be available for developers. And that official release of
>>jars are things we want to see mirrored.
>
>
> Is Maven using the mirrors today, like getting the list of active
> mirrors from the main site and finding the closest? Or is it only
> using the main site and perhaps iblibio?
>
Currently, all Maven clients use www.ibiblio.org/maven to retrieve
content. www.ibibilio.org is also a mirror of /java-repository for all
its apache content. Actually Maven users DO NOT go to
www.apache.org/dist/java-repository to download files, and only Apache
developers can publish to www.apache.org/dist/java-repository.
What server is used is currently based on the configuration of the Maven
client, servers currently do not maintain any capability to hand this
client off to another mirror. I think, in the future as the Repository
comes into existence and machine readable metadata or mechanisms for
directing clients off to mirrors come into existence, then clients like
Maven will implement such capabilities.
>
>>When it comes to things like the ibiblio maven repository, it would only
>>maintain full version releases of apache projects.
>
>
> Can you explain why ibiblio is special here? I mean, what you describe
> is what is supposed to be on all the mirrors right?
>
Just because it is the "default" repository used by the Maven Client.
>
>> If your an apache
>>project and need to be on the bleeding edge for a component, then you
>>can simply add
>>
>>www/cvs.apache.org/dist/java-repository
>>
>>as your first repository location and get your apache jars straight off
>>the nightly builds...
>>
>>The big question is how to facilitate this a build process, I think the
>>last decision on the Jakarta Commons/General/Maven lists was that we
>>would automate the build process for releasing the nightly jars into
>>
>>www/cvs.apache.org/dist/java-repository
>>
>>And the only publishing of jars by actual humans (Release Managers)
>>would be the full releases onto
>>
>>www/www.apache.org/dist/java-repository
>
>
> Symlinks I hope. Mirrors handle symlinks efficiently, that is,
> if they follow our rsync instructions.
The only mirroring that would be done would be via:
www.apache.org/dist
All other content in cvs.apache.org or archive.apache.org is not to be
"synced" as its not to be published out to mirrors, such content are
"developer build" and not for public consumption.
Within the www.apache.org/dist directory, yes symlinking should be used
to resolve duplication.
>
> Take a look at http://www.apache.org/~henkp/md5/, specifically
> the fyi: some duplicates section. Dups are a waste of bandwidth
> and diskspace.
Yes, approx 50% of instances of duplication on this page are currently
caused by avalon components (avalon also was using their dist directory
as a private maven repository). For example:
avalon/excalibur-component/jars/excalibur-component-1.1.jar
java-repository/excalibur-component/jars/excalibur-component-1.1.jar
I understand it can be the policy that when rsyncing, if the symlink and
the target directory do not have the same ownership, that it will not be
followed. I believe this creates a problem in that I cannot simply
create symlinks from java-repository/excalibur-component/ to
avalon/excalibur-component/ as they will not be followed by rsync.
However, the other 50% of duplicates within the java-repository
directory should be properly alleviated with symlinking, I can work on
this as I now (as of a couple days ago) own all the files :-). I will
start working on a script I can run periodically which will accomplish this.
>
> I'll ask Henk to disable the checks for presence of md5 in the
> dist/java-repository, since that doesn't seem to be applicable
> there. It seems to me that you do want to do some verification
> in maven, but you are probably storing signature information
> somewhere in the maven 'database'?
>
No, it is in the directory structure (no db) and md5's should exist next
to the files, there is a bug in maven caused by the fact that on BSD
checksums are generated by "md5" not "md5sum" like on linux, this needs
to be addressed, for example, you see my md5 was bad on the math jar
(which I just fixed).
-Mark
--
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu
Re: duplicate data
Posted by Sander Striker <st...@apache.org>.
On Sat, 2004-02-21 at 01:01, Mark R. Diggory wrote:
> Noel J. Bergman wrote:
> >>The issue is... the jars/distributables are placed into the
> >>java-repository using maven.
Can you explain this a bit? I thought Maven was used to fetch
projects and dependencies. Ofcourse I can read up on Maven,
but a quick summary of the technicalities would be appreciated.
> so, currently, if you look in
> >>something like the commons project.properties you'll see that
> >>they are pointing to the central repository for the location
> >>of where to "publish" files.
> >
> >
> >>The "convergence issues" we currently have for the repository:
> >> 1.) We want single copies of files on the mirrors.
> >
> > +1
This is the core point.
> >>My best conclusion is
> >> keep "jars" in the java-repository, do not keep them
> >> in your /dist/<project>/<binaries> directory. Remove all
> >> [jar/zip/tar files] from the java-repository.
> >
> >
> >>symlnk the appropriate java-repository dir into their appropriate
> >>"dist" directory.
That would mean that this entire area would have to be rw to all
groups producing releases that are to be in there. This kindof means
apcvs group ownership, which I don't really fancy doing. The other
way around, control and access of each projects dist/ area seperated,
and symlinking to that from java-repository, seems a bit sa[fn]er to
me.
> > That sounds OK to me, but folks like Sander and others more involved in
> > mirroring should be put in the loop. Everything we put under dist/ effects
> > 100s of mirrors.
Not me specifically, but Infrastructure. Others are more actively
maintaining the mirrors list and monitoring the mirrors. The mirrors
are a precious resource and we want to be careful not to 'scare' any
mirrors away with actions on our end.
> Yes, I learned that the hard way when we created the contents of
> java-repository... that was not a happy weekend. I don't make any "rash"
> changes to dist any more...Only well thought out moves. But we are in a
> state of cleanup now as well, we have to consider what we are going to
> do next.
If you are making large changes to the directory structure and the
majority of the files is already on the mirrors, send a mail to
mirrors@, attach a shell script that moves everything around locally,
and give them a heads up on when this shuffle is happening. This
save a _lot_ of bandwidth.
Also, when adding a lot, make sure to inform the mirrors, so they
are prepared.
> >>Discussion about how to finalize the directory structure such
> >>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
I don't parse this, but since Noel can read it, I am probably missing
context/background.
> > That would be good.
> In our last discussion, I think one of the conclusions that was arrived
> at as well, was the idea of breaking the java-repository up into two
> different locations.
>
> www/cvs.apache.org/dist/java-repository --> nightly builds
>
> www/www.apache.org/dist/java-repository --> official releases.
>
> the idea was that nightly/weekly builds are not things we want to see on
> mirrors but to be available for developers. And that official release of
> jars are things we want to see mirrored.
Is Maven using the mirrors today, like getting the list of active
mirrors from the main site and finding the closest? Or is it only
using the main site and perhaps iblibio?
> When it comes to things like the ibiblio maven repository, it would only
> maintain full version releases of apache projects.
Can you explain why ibiblio is special here? I mean, what you describe
is what is supposed to be on all the mirrors right?
> If your an apache
> project and need to be on the bleeding edge for a component, then you
> can simply add
>
> www/cvs.apache.org/dist/java-repository
>
> as your first repository location and get your apache jars straight off
> the nightly builds...
>
> The big question is how to facilitate this a build process, I think the
> last decision on the Jakarta Commons/General/Maven lists was that we
> would automate the build process for releasing the nightly jars into
>
> www/cvs.apache.org/dist/java-repository
>
> And the only publishing of jars by actual humans (Release Managers)
> would be the full releases onto
>
> www/www.apache.org/dist/java-repository
Symlinks I hope. Mirrors handle symlinks efficiently, that is,
if they follow our rsync instructions.
Take a look at http://www.apache.org/~henkp/md5/, specifically
the fyi: some duplicates section. Dups are a waste of bandwidth
and diskspace.
I'll ask Henk to disable the checks for presence of md5 in the
dist/java-repository, since that doesn't seem to be applicable
there. It seems to me that you do want to do some verification
in maven, but you are probably storing signature information
somewhere in the maven 'database'?
> I believe this sort of approach would be inline with the policies of the
> dist/mirrors.
Sander
Re: duplicate data
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Noel J. Bergman wrote:
>>The issue is... the jars/distributables are placed into the
>>java-repository using maven. so, currently, if you look in
>>something like the commons project.properties you'll see that
>>they are pointing to the central repository for the location
>>of where to "publish" files.
>
>
>>The "convergence issues" we currently have for the repository:
>> 1.) We want single copies of files on the mirrors.
>
>
> +1
>
>
>>My best conclusion is
>> keep "jars" in the java-repository, do not keep them
>> in your /dist/<project>/<binaries> directory. Remove all
>> [jar/zip/tar files] from the java-repository.
>
>
>>symlnk the appropriate java-repository dir into their appropriate
>>"dist" directory.
>
>
> That sounds OK to me, but folks like Sander and others more involved in
> mirroring should be put in the loop. Everything we put under dist/ effects
> 100s of mirrors.
>
Yes, I learned that the hard way when we created the contents of
java-repository... that was not a happy weekend. I don't make any "rash"
changes to dist any more...Only well thought out moves. But we are in a
state of cleanup now as well, we have to consider what we are going to
do next.
>
>>Discussion about how to finalize the directory structure such
>>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
>
>
> That would be good.
>
In our last discussion, I think one of the conclusions that was arrived
at as well, was the idea of breaking the java-repository up into two
different locations.
www/cvs.apache.org/dist/java-repository --> nightly builds
www/www.apache.org/dist/java-repository --> official releases.
the idea was that nightly/weekly builds are not things we want to see on
mirrors but to be available for developers. And that official release of
jars are things we want to see mirrored.
When it comes to things like the ibiblio maven repository, it would only
maintain full version releases of apache projects. If your an apache
project and need to be on the bleeding edge for a component, then you
can simply add
www/cvs.apache.org/dist/java-repository
as your first repository location and get your apache jars straight off
the nightly builds...
The big question is how to facilitate this a build process, I think the
last decision on the Jakarta Commons/General/Maven lists was that we
would automate the build process for releasing the nightly jars into
www/cvs.apache.org/dist/java-repository
And the only publishing of jars by actual humans (Release Managers)
would be the full releases onto
www/www.apache.org/dist/java-repository
I believe this sort of approach would be inline with the policies of the
dist/mirrors.
--
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu
RE: duplicate data
Posted by "Noel J. Bergman" <no...@devtech.com>.
> The issue is... the jars/distributables are placed into the
> java-repository using maven. so, currently, if you look in
> something like the commons project.properties you'll see that
> they are pointing to the central repository for the location
> of where to "publish" files.
> The "convergence issues" we currently have for the repository:
> 1.) We want single copies of files on the mirrors.
+1
> My best conclusion is
> keep "jars" in the java-repository, do not keep them
> in your /dist/<project>/<binaries> directory. Remove all
> [jar/zip/tar files] from the java-repository.
> symlnk the appropriate java-repository dir into their appropriate
> "dist" directory.
That sounds OK to me, but folks like Sander and others more involved in
mirroring should be put in the loop. Everything we put under dist/ effects
100s of mirrors.
> Discussion about how to finalize the directory structure such
> that "Repository", "Dist/Mirror" and "Maven" has to move forward.
That would be good.
--- Noel
Re: duplicate data
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Th issue is... the jars/distributables are placed into the
java-repository using maven. so, currently, if you look in something
like the commons project.properties you'll see that they are pointing to
the central repository for the location of where to "publish" files.
######################################################################
# Apache Central Repository
######################################################################
maven.repo.central=www.apache.org
maven.repo.central.directory=/www/www.apache.org/dist/java-repository
maven.remote.group=apcvs
To publish using Maven, the only logical way I can see to resolve this
discrepancy is to symlink from the projects dir itself, something like
.../java-repository/<project> ---> /dist/<tlp>/**<subproject>/...
The "convergence issues" we currently have for the repository:
1.) We want single copies of files on the mirrors.
2.) we want the repository to reflect the hierarchical structure of our
projects/subprojects.
3.) Many projects already have groupId/projectIds in maven that do not
match the hierarchical nature of the the projects at apache. Its very
difficult to actually move these at this time due to dependency issues.
for instance
/java-repository/commons-collections
I can't identify if this is jakarta, xml or apache commons...(we all
know from experience its jakarta commons collections.
So the issue is, which direction is the convergence going to go. I'd
personally like to see the day when
/www/www.apache.org/dist
is the repository location and the maven project ids are something like:
jakarta/commons/collections/<artifact>/<version>/...
xml/commons/resolver/<artifact>/<version>/...
commons/.../<artifact>/<version>/...
avalon/.../<artifact>/<version>/...
which would match fairly well the directory structure of the dist
directory, the major changes would then be
<subproject>/source
<subproject>/binary
would be replaced with
<subproject>/<artifact>/<version>/<artifact>-<version>.<ext>
or whatever we can agree upon finally for the repository url structure
(One thats not "theoretical", but actually "used" by tools)...
So the point is, yes, we want to resolve the replication issues. My best
conclusion is
A.) Currently just keep "jars" in the java-repository, do not keep them
in your /dist/<project>/<binaries> directory. Remove all distributions
form the java-repository. Currently, I haven't seen much use in actually
releasing tarballs/zips into the Maven repository, others may have
better opinions.
B.) If a project really wants to publish both jars and tar/zip
distributions to the same location. Then have them have them symlnk the
appropriate java-repository dir into their appropriate "dist" directory.
C.) Discussion about how to finalize the directory structure such that
"Repository", "Dist/Mirror" and "Maven" has to move forward. Refactoring
steps and interim solutions have to be discussed and thought about.
This will probably make for a great weekend discussion,
-Mark
Noel J. Bergman wrote:
> Mark,
>
> An issue was raised earlier today that should be addressed. The impression
> is that java-repository is publishing copies of jars that are also under
> dist/TLP/..., which puts more of a load on the mirrors. It might be best if
> the jars/ directories contained symlinks and not copies of artifacts
> published elsewhere.
>
> This doesn't address the fact that at some point an artifact may exist only
> in the archives, but that would require meta-data aware clients.
>
> --- Noel
>
--
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu