You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tom <th...@gmail.com> on 2014/07/15 23:10:15 UTC

Retrieve dataset of Big Data Benchmark

Hi,

I would like to use the dataset used in the  Big Data Benchmark
<https://amplab.cs.berkeley.edu/benchmark/>   on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried 
"bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./"
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have. 

Thanks in advance,

Tom



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Retrieve dataset of Big Data Benchmark

Posted by Tom <th...@gmail.com>.

Hi,

I was able to download the dataset this way (and just reconfirmed it by
doing so again):
//Following before starting spark
export AWS_ACCESS_KEY_ID=*key_id* 
export AWS_SECRET_ACCESS_KEY=*access_key*
//Start spark
./spark-shell
//In the spark shell
val dataset = sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")
dataset.saveAsTextFile("/home/tom/hadoop/bigDataBenchmark/test/crawl3.txt")

If you want to do this more often, or use it directly from the cloud instead
of from local (which will be slower), you can add these keys to
./conf/spark-env.sh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p15278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Retrieve dataset of Big Data Benchmark

Posted by Tom <th...@gmail.com>.

Hi Burak,

I tried running it through the Spark shell, but I still ended with the same
error message as in Hadoop:
"java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key
must be specified as the username or password (respectively) of a s3n URL,
or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey
properties (respectively)."
I guess the files are publicly available, but only to registered AWS users,
so I caved in and registered for the service. Using the credentials that I
got I was able to download the files using the local spark shell. 

Thanks!

Tom



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p10096.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Retrieve dataset of Big Data Benchmark

Posted by Burak Yavuz <by...@stanford.edu>.

Hi Tom,

Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, 
they are still accessible through the spark-shell, which means that they are there.

So in order to answer your questions:
- Are the tiny and 1node sets still available? 

Yes, they are.

- Are the Uservisits and Rankings still available?

Yes, they are.

- Why is the crawl set bigger than expected, and how big is it?

It says on the website that it is ~30 GB per node. Since you're downloading the 5nodes version, the total size should be 150 GB.

Coming to other ways on you can download them:

I propose using the spark-shell would be easiest (At least for me it was :).

Once you start the spark-shell, you can access the files as (example for the tiny crawl dataset, exchange with 1node, 5nodes & uservisits, rankings as desired. Mind the lowercase):

val dataset = sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")

dataset.saveAsTextFile("your/local/relative/path/here")

The file will be saved relative to where you run the spark-shell from.

Hope this helps!
Burak


----- Original Message -----
From: "Tom" <th...@gmail.com>
To: user@spark.incubator.apache.org
Sent: Wednesday, July 16, 2014 9:10:58 AM
Subject: Re: Retrieve dataset of Big Data Benchmark

Hi Burak,

Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.

After looking at the  Big Data Benchmark page
<https://amplab.cs.berkeley.edu/benchmark/>   (Section "Run this benchmark
yourself), I was expecting the following combination of files:
Sets: Uservisits, Rankings, Crawl
Size: tiny, 1node, 5node
Both in text and Sequence file.

When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see  
sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103
sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102
sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24
sequence-snappy/5nodes/crawl part 0 to 743

As "Crawl" is the name of a set I am looking for, I started to download it.
Since it was the end of the day and I was going to download it overnight, I
just wrote a for loop from 0 to 999 with wget, expecting it to download
until 7-something and 404 errors for the others. When I looked at it this
morning, I noticed that it all completed downloading. The total Crawl set
for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of
40G. 

This leads to my (sub)questions:
Does anybody know what exactly is still hosted:
- Are the tiny and 1node sets still available? 
- Are the Uservisits and Rankings still available?
- Why is the crawl set bigger than expected, and how big is it?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Retrieve dataset of Big Data Benchmark

Posted by Tom <th...@gmail.com>.

Hi Burak,

Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.

After looking at the Big Data Benchmark page
<https://amplab.cs.berkeley.edu/benchmark/> (Section "Run this benchmark
yourself), I was expecting the following combination of files:
Sets: Uservisits, Rankings, Crawl
Size: tiny, 1node, 5node
Both in text and Sequence file.

When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see
sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103
sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102
sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24
sequence-snappy/5nodes/crawl part 0 to 743

As "Crawl" is the name of a set I am looking for, I started to download it.
Since it was the end of the day and I was going to download it overnight, I
just wrote a for loop from 0 to 999 with wget, expecting it to download
until 7-something and 404 errors for the others. When I looked at it this
morning, I noticed that it all completed downloading. The total Crawl set
for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of
40G.

This leads to my (sub)questions:
Does anybody know what exactly is still hosted:
- Are the tiny and 1node sets still available?
- Are the Uservisits and Rankings still available?
- Why is the crawl set bigger than expected, and how big is it?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Retrieve dataset of Big Data Benchmark

Posted by Burak Yavuz <by...@stanford.edu>.

Hi Tom,

If you wish to load the file in Spark directly, you can use sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your SparkContext. This can be
done because the files should be publicly available and you don't need AWS Credentials to access them.

If you want to download the file on your local drive: you can use the link http://s3.amazonaws.com/big-data-benchmark/pavlo/...

One note though, the tiny dataset doesn't seem to exist anymore. You can look at 
http://s3.amazonaws.com/big-data-benchmark/
to see the available files. ctrl+f tiny returned 0 matches.


Best,
Burak

----- Original Message -----
From: "Tom" <th...@gmail.com>
To: user@spark.incubator.apache.org
Sent: Tuesday, July 15, 2014 2:10:15 PM
Subject: Retrieve dataset of Big Data Benchmark

Hi,

I would like to use the dataset used in the  Big Data Benchmark
<https://amplab.cs.berkeley.edu/benchmark/>   on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried 
"bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./"
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have. 

Thanks in advance,

Tom



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.