You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@archiva.apache.org by Marc Lustig <ml...@marclustig.com> on 2009/10/14 15:06:18 UTC

Proposal: concurrent remote-requests

Hi all,

we have configured about 25 remote-repos for our public-artifacts managed
repo.
In certain cases, black and white lists don't help and a request is proxied
to all the 20 remote-repos _sequentially_. Even thou we have configured a
short timeout of 5 secs, this takes 125 secs in case the artifacts doesn't
exist in any remote-repo - per artifact!

So I was wondering if it would make sense to send requests to all of the
remote-repos _concurrently_.
The first thread that find the artifacts could cause all the other threads
to cancel the http-request.
The total request time would reduce from 100 secs++ to merely 5 secs.
Tremendous win or?

Has this been discussed before?
Is there an argument against this strategy?

The implementation could be based on a thread-pool, or rather a pool of
thread-pools.

greetings
Marc
-- 
View this message in context: http://www.nabble.com/Proposal%3A-concurrent-remote-requests-tp25890731p25890731.html
Sent from the archiva-dev mailing list archive at Nabble.com.

Re: Proposal: concurrent remote-requests / "ASF Certified Maven Repository"

Posted by Brett Porter <br...@apache.org>.

On 15/10/2009, at 7:00 PM, Marc Lustig wrote:

> For companies, this would be a compelling feature! I (working for  
> insurances
> and banks) often hear the argument "of boy - they are downloading  
> software
> from some obscure server from russia". Having the label "Certified  
> Maven
> Repository" would surely make those noises more silent :-)
> The ASF could release a rule-set that the Maven-repo must conform to  
> in
> order to get the "certified" label.

This isn't really in the ASF's mission to provide. Everyone is going  
to have their own rules for what is certified - there are varying  
levels of trust, even if you verify it comes from the project itself  
(for example, see Eclipse's IP verification process).

In this case you are better off having a dedicated group of people  
approving third party artifacts to arrive into Archiva for use by  
others, and limiting proxy access outside. You can obviously do this  
manually in Archiva now, though ideally you want a "quarantine" area  
where they can be retrieved and await approval with a decent workflow  
for moving them into an accessible repository.

- Brett

Re: Proposal: concurrent remote-requests / "ASF Certified Maven Repository"

Posted by Marc Lustig <ml...@marclustig.com>.

thanks Brett for the input.
I can confirm that using black and white lists the case is rather rare when
all remote-repos are searched sequentially and the artifact is not found in
the end. However it is typical for some scenarios e. g. when you enable the
source-jars to get downloaded for a project. From 40 deps, maybe 5 will have
source-jars available. In that way a simple mvn-goal takes 30 minutes or
more.

I mentioned the timeout just to have a maximum value.  Of course usually the
requests don't run in a timeout (except when the repo is down) - the average
response time is maybe 3-4 secs (for our installation).

Also it is clear that the first-serve concept conflicts with the existing
concept of an (ordered) list of repos that is searched for.
Can we not assume that artifacts with a given spec. are identical from
whatever repo they come, provided the hash is matching?

Btw., this brings up another idea: could the ASF possibly grant "official"
certificates for remote-repos?
In that way, Archiva could distinguish between trusted and non-trusted
repos.
For companies, this would be a compelling feature! I (working for insurances
and banks) often hear the argument "of boy - they are downloading software
from some obscure server from russia". Having the label "Certified Maven
Repository" would surely make those noises more silent :-)
The ASF could release a rule-set that the Maven-repo must conform to in
order to get the "certified" label.
Or even better, the ASF could offer a VMware-image that includes all the
software ready to run the Maven-repo - including some logic to verify that
known artifacts are mirrored correctly. A total control of repos is not
possible, of course. But the contract between Archiva and the remote repo
could be tightened pretty much.

Back to the concurrent requests idea: sending the HEAD request before the
actual GET is surely a good idea. Archiva could decide to which repo to send
the GET based on the shortest response-time.
Anyway, this feature needs more brainstorming...

brettporter wrote:
> 
> On 15/10/2009, at 12:06 AM, Marc Lustig wrote:
> 
>>
>> Hi all,
>>
>> we have configured about 25 remote-repos for our public-artifacts  
>> managed
>> repo.
>> In certain cases, black and white lists don't help and a request is  
>> proxied
>> to all the 20 remote-repos _sequentially_. Even thou we have  
>> configured a
>> short timeout of 5 secs, this takes 125 secs in case the artifacts  
>> doesn't
>> exist in any remote-repo - per artifact!
>>
>> So I was wondering if it would make sense to send requests to all of  
>> the
>> remote-repos _concurrently_.
>> The first thread that find the artifacts could cause all the other  
>> threads
>> to cancel the http-request.
>> The total request time would reduce from 100 secs++ to merely 5 secs.
>> Tremendous win or?
>>
>> Has this been discussed before?
> 
> I think this is a pretty unusual case. I don't quite understand why  
> you are hitting the timeout limit on the remote repo - if they are up  
> they should be fast. Also, "first that finds" is different to the  
> current rule since it's first that appears in the list. I worry that  
> in this set up you're not entirely sure which repository the artifacts  
> are meant to be coming from, so maybe it points to another problem.
> 
>> Is there an argument against this strategy?
> 
> Particularly if we turned on streaming of the proxied download to the  
> client (which is intended) - we couldn't do so if they were pooled  
> like this, unless we accepted the "first found rule".
> 
> That said, this might speed up requests with a long list of proxies,  
> even if they are functioning properly. So it might be reasonable as an  
> optional capability. One thing to consider would be doing a HEAD  
> request instead of a GET for all the remotes first to select where to  
> download from, then execute the GET from the desired one.
> 
> - Brett
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Proposal%3A-concurrent-remote-requests-tp25890731p25904406.html
Sent from the archiva-dev mailing list archive at Nabble.com.

Re: Proposal: concurrent remote-requests

Posted by Brett Porter <br...@apache.org>.

On 15/10/2009, at 12:06 AM, Marc Lustig wrote:

>
> Hi all,
>
> we have configured about 25 remote-repos for our public-artifacts  
> managed
> repo.
> In certain cases, black and white lists don't help and a request is  
> proxied
> to all the 20 remote-repos _sequentially_. Even thou we have  
> configured a
> short timeout of 5 secs, this takes 125 secs in case the artifacts  
> doesn't
> exist in any remote-repo - per artifact!
>
> So I was wondering if it would make sense to send requests to all of  
> the
> remote-repos _concurrently_.
> The first thread that find the artifacts could cause all the other  
> threads
> to cancel the http-request.
> The total request time would reduce from 100 secs++ to merely 5 secs.
> Tremendous win or?
>
> Has this been discussed before?

I think this is a pretty unusual case. I don't quite understand why  
you are hitting the timeout limit on the remote repo - if they are up  
they should be fast. Also, "first that finds" is different to the  
current rule since it's first that appears in the list. I worry that  
in this set up you're not entirely sure which repository the artifacts  
are meant to be coming from, so maybe it points to another problem.

> Is there an argument against this strategy?

Particularly if we turned on streaming of the proxied download to the  
client (which is intended) - we couldn't do so if they were pooled  
like this, unless we accepted the "first found rule".

That said, this might speed up requests with a long list of proxies,  
even if they are functioning properly. So it might be reasonable as an  
optional capability. One thing to consider would be doing a HEAD  
request instead of a GET for all the remotes first to select where to  
download from, then execute the GET from the desired one.

- Brett