You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@maven.apache.org by Boris Baldassari <bo...@gmail.com> on 2021/06/14 10:14:06 UTC

Software Heritage connector to Maven repositories

Hiho good people,

I am currently developing a Maven repositories connector for the 
Software Heritage Foundation [1].  In a nutshell, the SWH aims to 
archive all existing source code in the world, and provides useful 
publicly available services and related tools (unique IDs/DOIs, search, 
datasets, graph tools..). It's all open-source, and many large forges 
and software systems have already been archived (GitHub, GitLab, npm, 
pypi, debian packages, CRAN..) [2]. Now we would like to archive the 
Maven ecosystem.

[1] https://www.softwareheritage.org/
[2] https://archive.softwareheritage.org/

I'm reaching out to ask for wisdom and start a discussion about how this 
could be achieved without impacting anybody, i.e. neither Maven 
repositories maintainers nor the users. Our plan for now is to use the 
maven indexer indexes for the listing, and then download poms and source 
jars, in a way that we see as the most efficient and fair. We of course 
respect all rate-limiting policies (and http error codes), and we are 
polite and patient (although tenacious).

So, here are my questions:

* Who should we talk to to achieve that? i.e. are there maven repository 
maintainers on the list, or do you know of a better place to ask?

* Although we believe the above mentioned process is the most efficient 
and fair one, maybe there is a better way to list, and archive artefact 
sources? Any feedback or mere thoughts are welcome.


Thanks in advance, have a wonderful day!


-- 
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Tel: +33 6 48 03 82 89

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Software Heritage connector to Maven repositories

Posted by Boris Baldassari <bo...@gmail.com>.
On 15/06/2021 02:25, Bernd Eckenfels wrote:
> Hello Boris.
Hi Bernd,

> I know that opening a Nexus JIRA is the usual way to get responses.
Good point for the JIRA, Thanks!
https://issues.sonatype.org/browse/MVNCENTRAL-6804

> BTW also consider scraping the SCM URLs from the POM files and contact the upstream Repos, the maven -src archives are often pruned down and not builtable (if present at all). So it does not hurt to archive them, but don’t expect them to help you in all cases.
We'll do that too, yes -- that's why we want to download all poms 
actually. And partial src jars are fine, since content will be 
mapped/deduplicated with the origins in git/svn/..

> Sadly Tag parsing and SCM scraping is not the most reliable thing (we do it for consumed dependencies), but with some manual overwrite it’s manageable at small scale. Maybe you would get help if you provide such a Registry as a github project or ask OSSIndex for cooperation.
Hum, Interesting. May I ask why they're not reliable in your case, and 
what "manual overwrite" means? I realise that the scm attribute is not 
always/often present in the poms, are there any other caveats you 
encountered?

I didn't consider publishing the full list of poms, src jars or scm tags 
(they all can be extracted from the Maven Indexer indexes, see [1]) but 
if it is useful for others that's easily doable (besides getting help).

[1] https://stackoverflow.com/a/59052733


Thanks a lot for the thoughtful reply, cheers!



-- 
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Tel: +33 6 48 03 82 89

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Software Heritage connector to Maven repositories

Posted by Bernd Eckenfels <ec...@zusammenkunft.net>.
Hello Boris.

I know that opening a Nexus JIRA is the usual way to get responses.

BTW also consider scraping the SCM URLs from the POM files and contact the upstream Repos, the maven -src archives are often pruned down and not builtable (if present at all). So it does not hurt to archive them, but don’t expect them to help you in all cases.

Sadly Tag parsing and SCM scraping is not the most reliable thing (we do it for consumed dependencies), but with some manual overwrite it’s manageable at small scale. Maybe you would get help if you provide such a Registry as a github project or ask OSSIndex for cooperation.

Gruss
Bernd
--
http://bernd.eckenfels.net
________________________________
Von: Boris Baldassari <bo...@gmail.com>
Gesendet: Montag, Juni 14, 2021 5:25 PM
An: dev@maven.apache.org
Betreff: Re: Software Heritage connector to Maven repositories

Hi Frederik,

Thanks for the kind answer and pointers.

Yes, we know that non-consensual mirroring, as well as scrapping, is
explicitly forbidden. Hence the question here. :-)

It should be noted however that our case is a bit specific: we want to
get only *some* types of artefacts (poms and src jars), but we want
*all* of them. So it's a special case of partial mirroring. Furthermore
we will only access them *once* for the archiving so the "re-use"
feature of a proxy, mirror or similar is not needed.

Maybe a repo maintainer could provide some more wisdom?


Thanks, cheers!



--
boris



On 14/06/2021 13:41, Frederik Boster wrote:
> I am not in any way affiliated with Apache or Sonatype. So take my opinion
> with a grain of salt :)
>
> Trying to mirror the entire Maven Central repository will unfortunately get
> you automatically banned.
> To circumvent that I would suggest you setup your own Maven Central mirror
> first. [1]
>
> [1]
> https://maven.apache.org/guides/mini/guide-mirror-settings.html#creating-your-own-mirror
>
> On Mon, Jun 14, 2021, 12:12 Boris Baldassari <bo...@gmail.com>
> wrote:
>
>> Hiho good people,
>>
>> I am currently developing a Maven repositories connector for the
>> Software Heritage Foundation [1].  In a nutshell, the SWH aims to
>> archive all existing source code in the world, and provides useful
>> publicly available services and related tools (unique IDs/DOIs, search,
>> datasets, graph tools..). It's all open-source, and many large forges
>> and software systems have already been archived (GitHub, GitLab, npm,
>> pypi, debian packages, CRAN..) [2]. Now we would like to archive the
>> Maven ecosystem.
>>
>> [1] https://www.softwareheritage.org/
>> [2] https://archive.softwareheritage.org/
>>
>> I'm reaching out to ask for wisdom and start a discussion about how this
>> could be achieved without impacting anybody, i.e. neither Maven
>> repositories maintainers nor the users. Our plan for now is to use the
>> maven indexer indexes for the listing, and then download poms and source
>> jars, in a way that we see as the most efficient and fair. We of course
>> respect all rate-limiting policies (and http error codes), and we are
>> polite and patient (although tenacious).
>>
>> So, here are my questions:
>>
>> * Who should we talk to to achieve that? i.e. are there maven repository
>> maintainers on the list, or do you know of a better place to ask?
>>
>> * Although we believe the above mentioned process is the most efficient
>> and fair one, maybe there is a better way to list, and archive artefact
>> sources? Any feedback or mere thoughts are welcome.
>>
>>
>> Thanks in advance, have a wonderful day!


--
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Tel: +33 6 48 03 82 89

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Software Heritage connector to Maven repositories

Posted by Boris Baldassari <bo...@gmail.com>.
Hi Frederik,

Thanks for the kind answer and pointers.

Yes, we know that non-consensual mirroring, as well as scrapping, is 
explicitly forbidden. Hence the question here. :-)

It should be noted however that our case is a bit specific: we want to 
get only *some* types of artefacts (poms and src jars), but we want 
*all* of them. So it's a special case of partial mirroring. Furthermore 
we will only access them *once* for the archiving so the "re-use" 
feature of a proxy, mirror or similar is not needed.

Maybe a repo maintainer could provide some more wisdom?


Thanks, cheers!



--
boris



On 14/06/2021 13:41, Frederik Boster wrote:
> I am not in any way affiliated with Apache or Sonatype. So take my opinion
> with a grain of salt :)
> 
> Trying to mirror the entire Maven Central repository will unfortunately get
> you automatically banned.
> To circumvent that I would suggest you setup your own Maven Central mirror
> first. [1]
> 
> [1]
> https://maven.apache.org/guides/mini/guide-mirror-settings.html#creating-your-own-mirror
> 
> On Mon, Jun 14, 2021, 12:12 Boris Baldassari <bo...@gmail.com>
> wrote:
> 
>> Hiho good people,
>>
>> I am currently developing a Maven repositories connector for the
>> Software Heritage Foundation [1].  In a nutshell, the SWH aims to
>> archive all existing source code in the world, and provides useful
>> publicly available services and related tools (unique IDs/DOIs, search,
>> datasets, graph tools..). It's all open-source, and many large forges
>> and software systems have already been archived (GitHub, GitLab, npm,
>> pypi, debian packages, CRAN..) [2]. Now we would like to archive the
>> Maven ecosystem.
>>
>> [1] https://www.softwareheritage.org/
>> [2] https://archive.softwareheritage.org/
>>
>> I'm reaching out to ask for wisdom and start a discussion about how this
>> could be achieved without impacting anybody, i.e. neither Maven
>> repositories maintainers nor the users. Our plan for now is to use the
>> maven indexer indexes for the listing, and then download poms and source
>> jars, in a way that we see as the most efficient and fair. We of course
>> respect all rate-limiting policies (and http error codes), and we are
>> polite and patient (although tenacious).
>>
>> So, here are my questions:
>>
>> * Who should we talk to to achieve that? i.e. are there maven repository
>> maintainers on the list, or do you know of a better place to ask?
>>
>> * Although we believe the above mentioned process is the most efficient
>> and fair one, maybe there is a better way to list, and archive artefact
>> sources? Any feedback or mere thoughts are welcome.
>>
>>
>> Thanks in advance, have a wonderful day!


-- 
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Tel: +33 6 48 03 82 89

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Software Heritage connector to Maven repositories

Posted by Frederik Boster <fr...@boster.de.INVALID>.
I am not in any way affiliated with Apache or Sonatype. So take my opinion
with a grain of salt :)

Trying to mirror the entire Maven Central repository will unfortunately get
you automatically banned.
To circumvent that I would suggest you setup your own Maven Central mirror
first. [1]

[1]
https://maven.apache.org/guides/mini/guide-mirror-settings.html#creating-your-own-mirror

On Mon, Jun 14, 2021, 12:12 Boris Baldassari <bo...@gmail.com>
wrote:

> Hiho good people,
>
> I am currently developing a Maven repositories connector for the
> Software Heritage Foundation [1].  In a nutshell, the SWH aims to
> archive all existing source code in the world, and provides useful
> publicly available services and related tools (unique IDs/DOIs, search,
> datasets, graph tools..). It's all open-source, and many large forges
> and software systems have already been archived (GitHub, GitLab, npm,
> pypi, debian packages, CRAN..) [2]. Now we would like to archive the
> Maven ecosystem.
>
> [1] https://www.softwareheritage.org/
> [2] https://archive.softwareheritage.org/
>
> I'm reaching out to ask for wisdom and start a discussion about how this
> could be achieved without impacting anybody, i.e. neither Maven
> repositories maintainers nor the users. Our plan for now is to use the
> maven indexer indexes for the listing, and then download poms and source
> jars, in a way that we see as the most efficient and fair. We of course
> respect all rate-limiting policies (and http error codes), and we are
> polite and patient (although tenacious).
>
> So, here are my questions:
>
> * Who should we talk to to achieve that? i.e. are there maven repository
> maintainers on the list, or do you know of a better place to ask?
>
> * Although we believe the above mentioned process is the most efficient
> and fair one, maybe there is a better way to list, and archive artefact
> sources? Any feedback or mere thoughts are welcome.
>
>
> Thanks in advance, have a wonderful day!
>
>
> --
> Boris Baldassari
> Castalia Solutions -- Elegant Software Engineering
> Web: http://castalia.solutions
> Tel: +33 6 48 03 82 89
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>