You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2015/12/24 06:59:35 UTC

Re: Downloading Hadoop from s3://spark-related-packages/

FYI: I opened an INFRA ticket with questions about how best to use the
Apache mirror network.

https://issues.apache.org/jira/browse/INFRA-10999

Nick

On Mon, Nov 2, 2015 at 8:00 AM Luciano Resende <lu...@gmail.com> wrote:

> I am getting the same results using closer.lua versus close.cgi, which
> seems to be downloading a page where the user can choose the closest
> mirror. I tried to add parameters to follow redirect without much success.
> There seems to be already a jira for a similar request with infra:
> https://issues.apache.org/jira/browse/INFRA-10240.
>
> A workaround is to use a url pointing to the mirror directly.
>
> curl -O -L
> http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
>
> I second the lack of documentation on what is available with these
> scripts, I'll see if I can find the source and try to see other options.
>
>
> On Sun, Nov 1, 2015 at 8:40 PM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>> I think the lua one at
>>
>> https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/closer.lua
>> has replaced the cgi one from before. Also it looks like the lua one
>> also supports `action=download` with a filename argument. So you could
>> just do something like
>>
>> wget
>> http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz&action=download
>>
>> Thanks
>> Shivaram
>>
>> On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas
>> <ni...@gmail.com> wrote:
>> > Oh, sweet! For example:
>> >
>> >
>> http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1
>> >
>> > Thanks for sharing that tip. Looks like you can also use as_json (vs.
>> > asjson).
>> >
>> > Nick
>> >
>> >
>> > On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman
>> > <sh...@eecs.berkeley.edu> wrote:
>> >>
>> >> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
>> >> <ni...@gmail.com> wrote:
>> >> > OK, I’ll focus on the Apache mirrors going forward.
>> >> >
>> >> > The problem with the Apache mirrors, if I am not mistaken, is that
>> you
>> >> > cannot use a single URL that automatically redirects you to a working
>> >> > mirror
>> >> > to download Hadoop. You have to pick a specific mirror and pray it
>> >> > doesn’t
>> >> > disappear tomorrow.
>> >> >
>> >> > They don’t go away, especially http://mirror.ox.ac.uk , and in the
>> us
>> >> > the
>> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
>> kept.
>> >> >
>> >> > So does Apache offer no way to query a URL and automatically get the
>> >> > closest
>> >> > working mirror? If I’m installing HDFS onto servers in various EC2
>> >> > regions,
>> >> > the best mirror will vary depending on my location.
>> >> >
>> >> Not sure if this is officially documented somewhere but if you pass
>> >> '&asjson=1' you will get back a JSON which has a 'preferred' field set
>> >> to the closest mirror.
>> >>
>> >> Shivaram
>> >> > Nick
>> >> >
>> >> >
>> >> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
>> >> > <sh...@eecs.berkeley.edu> wrote:
>> >> >>
>> >> >> I think that getting them from the ASF mirrors is a better strategy
>> in
>> >> >> general as it'll remove the overhead of keeping the S3 bucket up to
>> >> >> date. It works in the spark-ec2 case because we only support a
>> limited
>> >> >> number of Hadoop versions from the tool. FWIW I don't have write
>> >> >> access to the bucket and also haven't heard of any plans to support
>> >> >> newer versions in spark-ec2.
>> >> >>
>> >> >> Thanks
>> >> >> Shivaram
>> >> >>
>> >> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <
>> stevel@hortonworks.com>
>> >> >> wrote:
>> >> >> >
>> >> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas
>> >> >> > <ni...@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >> > https://s3.amazonaws.com/spark-related-packages/
>> >> >> >
>> >> >> > spark-ec2 uses this bucket to download and install HDFS on
>> clusters.
>> >> >> > Is
>> >> >> > it
>> >> >> > owned by the Spark project or by the AMPLab?
>> >> >> >
>> >> >> > Anyway, it looks like the latest Hadoop install available on
>> there is
>> >> >> > Hadoop
>> >> >> > 2.4.0.
>> >> >> >
>> >> >> > Are there plans to add newer versions of Hadoop for use by
>> spark-ec2
>> >> >> > and
>> >> >> > similar tools, or should we just be getting that stuff via an
>> Apache
>> >> >> > mirror?
>> >> >> > The latest version is 2.7.1, by the way.
>> >> >> >
>> >> >> >
>> >> >> > you should be grabbing the artifacts off the ASF and then
>> verifying
>> >> >> > their
>> >> >> > SHA1 checksums as published on the ASF HTTPS web site
>> >> >> >
>> >> >> >
>> >> >> > The problem with the Apache mirrors, if I am not mistaken, is that
>> >> >> > you
>> >> >> > cannot use a single URL that automatically redirects you to a
>> working
>> >> >> > mirror
>> >> >> > to download Hadoop. You have to pick a specific mirror and pray it
>> >> >> > doesn't
>> >> >> > disappear tomorrow.
>> >> >> >
>> >> >> >
>> >> >> > They don't go away, especially http://mirror.ox.ac.uk , and in
>> the us
>> >> >> > the
>> >> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
>> >> >> > kept.
>> >> >> >
>> >> >> > full list with availability stats
>> >> >> >
>> >> >> > http://www.apache.org/mirrors/
>> >> >> >
>> >> >> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Downloading Hadoop from s3://spark-related-packages/

Posted by Nicholas Chammas <ni...@gmail.com>.

not that likely to get an answer as it’s really a support call, not a
bug/task.

The first question is about proper documentation of all the stuff we’ve
been discussing in this thread, so one would think that’s a valid task. It
doesn’t seem right that closer.lua, for example, is undocumented. Either
it’s not meant for public use (and I am not an intended user), or there
should be something out there that explains how to use it.

I’m not looking for much; just some basic info that covers the various
things I’ve had to piece together from mailing lists and Google.

there’s no mirroring, if you install to lots of machines your download time
will be slow. You could automate it though, do something like D/L, upload
to your own bucket, do an s3 GET.

Yeah, this is what I’m probably going to do eventually—just use my own S3
bucket.

It’s disappointing that, at least as far as I can tell, the Apache
foundation doesn’t have a fast CDN or something like that to serve its
files. So users like me are left needing to come up with their own solution
if they regularly download Apache software to many machines in an automated
fashion.

Now, perhaps Apache mirrors are not meant to be used in this way. Perhaps
they’re just meant for people to do the one-off download to their personal
machines and that’s it. That’s totally fine! But that goes back to my first
question from the ticket—there should be a simple doc that spells this out
for us if that’s the case: “Don’t use the mirror network for automated
provisioning/deployments.” That would suffice. But as things stand now, I
have to guess and wonder at this stuff.

Nick

On Thu, Dec 24, 2015 at 5:43 AM Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 24 Dec 2015, at 05:59, Nicholas Chammas <ni...@gmail.com>
> wrote:
>
> FYI: I opened an INFRA ticket with questions about how best to use the
> Apache mirror network.
>
> https://issues.apache.org/jira/browse/INFRA-10999
>
> Nick
>
>
>
> not that likely to get an answer as it's really a support call, not a
> bug/task. You never know though.
>
> There's another way to get at binaries, which is check them out direct
> from SVN
>
> https://dist.apache.org/repos/dist/release/
>
> This is a direct view into how you release things in the ASF (you just
> create a new dir under your project, copy the files and then do an svn
> commit; I believe the replicated servers may just do svn update on their
> local cache.
>
> there's no mirroring, if you install to lots of machines your download
> time will be slow. You could automate it though, do something like D/L,
> upload to your own bucket, do an s3 GET.
>

Re: Downloading Hadoop from s3://spark-related-packages/

Posted by Steve Loughran <st...@hortonworks.com>.

On 24 Dec 2015, at 05:59, Nicholas Chammas <ni...@gmail.com>> wrote:

FYI: I opened an INFRA ticket with questions about how best to use the Apache mirror network.

https://issues.apache.org/jira/browse/INFRA-10999

Nick


not that likely to get an answer as it's really a support call, not a bug/task. You never know though.

There's another way to get at binaries, which is check them out direct from SVN

https://dist.apache.org/repos/dist/release/

This is a direct view into how you release things in the ASF (you just create a new dir under your project, copy the files and then do an svn commit; I believe the replicated servers may just do svn update on their local cache.

there's no mirroring, if you install to lots of machines your download time will be slow. You could automate it though, do something like D/L, upload to your own bucket, do an s3 GET.