You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kamatsuoka <ke...@gmail.com> on 2014/05/07 02:19:07 UTC

How to read a multipart s3 file?

I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.

Behind the scenes, the S3 driver creates a bunch of files like
s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
s3://mybucket/block_3574186879395643429.

How do I construct an url to use this file as input to another Spark app?  I
tried all the variations of s3://mybucket/mydir/myfile.txt, but none of them
work.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

Posted by kamatsuoka <ke...@gmail.com>.
Thanks Nicholas!  I looked at those docs several times without noticing that
critical part you highlighted.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

Posted by Nicholas Chammas <ni...@gmail.com>.
Amazon also strongly discourages the use of s3:// because the block file
system it maps to is deprecated.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html

Note
> The configuration of Hadoop running on Amazon EMR differs from the default
> configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3://
> both map to the Amazon S3 native file system, *while in the default
> configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3
> block storage system.*


Amazon S3 block is a deprecated file system that is not recommended because
> it can trigger a race condition that might cause your cluster to fail. It
> may be required by legacy applications.




On Tue, May 6, 2014 at 8:23 PM, Matei Zaharia <ma...@gmail.com>wrote:

> There’s a difference between s3:// and s3n:// in the Hadoop S3 access
> layer. Make sure you use the right one when reading stuff back. In general
> s3n:// ought to be better because it will create things that look like
> files in other S3 tools. s3:// was present when the file size limit in S3
> was much lower, and it uses S3 objects as blocks in a kind of overlay file
> system.
>
> If you use s3n:// for both, you should be able to pass the exact same file
> to load as you did to save. Make sure you also set your AWS keys in the
> environment or in SparkContext.hadoopConfiguration.
>
> Matei
>
> On May 6, 2014, at 5:19 PM, kamatsuoka <ke...@gmail.com> wrote:
>
> > I have a Spark app that writes out a file,
> s3://mybucket/mydir/myfile.txt.
> >
> > Behind the scenes, the S3 driver creates a bunch of files like
> > s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files
> like
> > s3://mybucket/block_3574186879395643429.
> >
> > How do I construct an url to use this file as input to another Spark
> app?  I
> > tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
> them
> > work.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: How to read a multipart s3 file?

Posted by kamatsuoka <ke...@gmail.com>.
Whereas with s3://, the write takes 32 seconds and the rename takes 33
seconds:

14/05/06 20:23:08 INFO DAGScheduler: Stage 0 (saveAsTextFile at
FileCopy.scala:17) finished in 32.208 s
14/05/06 20:23:08 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
14/05/06 20:23:08 INFO SparkContext: Job finished: saveAsTextFile at
FileCopy.scala:17, took 32.430051749 s
14/05/06 20:23:41 INFO SparkDeploySchedulerBackend: Shutting down all
executors




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5474.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

Posted by kamatsuoka <ke...@gmail.com>.
For example, this app just reads a 4GB file and writes a copy of it.  It
takes 41 seconds to write the file, then 3 more minutes to move all the
temporary files.

I guess this is an issue with the hadoop / jets3t code layer, not Spark.

14/05/06 20:11:41 INFO TaskSetManager: Finished TID 63 in 8688 ms on
ip-10-143-138-33.ec2.internal (progress: 63/63)
14/05/06 20:11:41 INFO DAGScheduler: Stage 0 (saveAsTextFile at
FileCopy.scala:17) finished in 41.326 s
14/05/06 20:11:41 INFO SparkContext: Job finished: saveAsTextFile at
FileCopy.scala:17, took 41.605480454 s
14/05/06 20:14:48 INFO NativeS3FileSystem: OutputStream for key
'dad-20140101-9M.copy/_SUCCESS' writing to tempfile
'/tmp/hadoop-root/s3/output-1223846975509014265.tmp'
14/05/06 20:14:48 INFO NativeS3FileSystem: OutputStream for key
'dad-20140101-9M.copy/_SUCCESS' closed. Now beginning upload
14/05/06 20:14:48 INFO NativeS3FileSystem: OutputStream for key
'dad-20140101-9M.copy/_SUCCESS' upload complete
14/05/06 20:14:48 INFO SparkDeploySchedulerBackend: Shutting down all
executors



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

Posted by Nicholas Chammas <ni...@gmail.com>.
On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson <il...@gmail.com> wrote:

Spark can only run as many tasks as there are partitions, so if you don't
> have enough partitions, your cluster will be underutilized.

 This is a very important point.

kamatsuoka, how many partitions does your RDD have when you try to save it?
You can check this with myrdd._jrdd.splits().size() in PySpark. If it’s
less than the number of cores in your cluster, try repartition()-ing the
RDD as Aaron suggested.

Nick

Re: How to read a multipart s3 file?

Posted by Aaron Davidson <il...@gmail.com>.
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few partitions leads to other problems, such as too little
concurrency -- Spark can only run as many tasks as there are partitions, so
if you don't have enough partitions, your cluster will be underutilized.


On Tue, May 6, 2014 at 7:07 PM, kamatsuoka <ke...@gmail.com> wrote:

> Yes, I'm using s3:// for both. I was using s3n:// but I got frustrated by
> how
> slow it is at writing files. In particular the phases where it moves the
> temporary files to their permanent location takes as long as writing the
> file itself.  I can't believe anyone uses this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: How to read a multipart s3 file?

Posted by Nicholas Chammas <ni...@gmail.com>.
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka <ke...@gmail.com> wrote:

> I was using s3n:// but I got frustrated by how
> slow it is at writing files.
>

I'm curious: How slow is slow? How long does it take you, for example, to
save a 1GB file to S3 using s3n vs s3?

Re: How to read a multipart s3 file?

Posted by kamatsuoka <ke...@gmail.com>.
Yes, I'm using s3:// for both. I was using s3n:// but I got frustrated by how
slow it is at writing files. In particular the phases where it moves the
temporary files to their permanent location takes as long as writing the
file itself.  I can't believe anyone uses this. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

Posted by sparkuser2345 <hm...@gmail.com>.
sparkuser2345 wrote
> I'm using Spark 1.0.0.

The same works when 
- Using Spark 0.9.1.
- Saving to and reading from local file system (Spark 1.0.0)
- Saving to and reading from HDFS (Spark 1.0.0)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p11653.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to read a multipart s3 file?

Posted by sparkuser2345 <hm...@gmail.com>.
Matei Zaharia wrote
> If you use s3n:// for both, you should be able to pass the exact same file
> to load as you did to save. 

I'm trying to write a file to s3n in a Spark app and to read it in another
one using the same file name, but without luck. Writing data to s3n as

val data = Array(1.0, 1.0, 1.0)
sc.parallelize(data).saveAsTextFile("s3n://<access_key>:<secret_access_key>@<bucket-name>/test")

creates the following files: 

test/_SUCCESS
test/_temporary/0/task_201408071147_m_000000_$folder$
test/_temporary/0/task_201408071147_m_000000/part-00000
test/_temporary/0/task_201408071147_m_000001_$folder$
test/_temporary/0/task_201408071147_m_000001/part-00001

When trying to read the file as

val data2 =
sc.textFile("s3n://<access_key>:<secret_access_key>@<bucket-name>/test")  

data2 is an empty array:

scala> data2.collect
14/08/07 11:49:56 INFO mapred.FileInputFormat: Total input paths to process
: 0
14/08/07 11:49:56 INFO spark.SparkContext: Starting job: collect at
<console>:15
14/08/07 11:49:56 INFO spark.SparkContext: Job finished: collect at
<console>:15, took 3.7227E-5 s
res5: Array[String] = Array()

I'm using Spark 1.0.0. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p11643.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to read a multipart s3 file?

Posted by Matei Zaharia <ma...@gmail.com>.
There’s a difference between s3:// and s3n:// in the Hadoop S3 access layer. Make sure you use the right one when reading stuff back. In general s3n:// ought to be better because it will create things that look like files in other S3 tools. s3:// was present when the file size limit in S3 was much lower, and it uses S3 objects as blocks in a kind of overlay file system.

If you use s3n:// for both, you should be able to pass the exact same file to load as you did to save. Make sure you also set your AWS keys in the environment or in SparkContext.hadoopConfiguration.

Matei

On May 6, 2014, at 5:19 PM, kamatsuoka <ke...@gmail.com> wrote:

> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
> 
> Behind the scenes, the S3 driver creates a bunch of files like
> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
> s3://mybucket/block_3574186879395643429.
> 
> How do I construct an url to use this file as input to another Spark app?  I
> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of them
> work.
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to read a multipart s3 file?

Posted by Sean Owen <so...@cloudera.com>.
That won't be it, since you can see from the directory listing that
there are no data files under test -- only "_" files and dirs. The
output looks like it was written, or partially written at least, but
didn't finish, in that the part-* files were never moved to the target
dir. I don't know why, but at least, that is the nature of final
problem.

On Thu, Aug 7, 2014 at 5:14 PM, Ashish Rangole <ar...@gmail.com> wrote:
> Specify a folder instead of a file name for input and output code, as in:
>
> Output:
> s3n://your-bucket-name/your-data-folder
>
> Input: (when consuming the above output)
>
> s3n://your-bucket-name/your-data-folder/*
>
> On May 6, 2014 5:19 PM, "kamatsuoka" <ke...@gmail.com> wrote:
>>
>> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>>
>> Behind the scenes, the S3 driver creates a bunch of files like
>> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
>> s3://mybucket/block_3574186879395643429.
>>
>> How do I construct an url to use this file as input to another Spark app?
>> I
>> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
>> them
>> work.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to read a multipart s3 file?

Posted by sparkuser2345 <hm...@gmail.com>.
Ashish Rangole wrote
> Specify a folder instead of a file name for input and output code, as in:
> 
> Output:
> s3n://your-bucket-name/your-data-folder
> 
> Input: (when consuming the above output)
> 
> s3n://your-bucket-name/your-data-folder/*

Unfortunately no luck: 

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input Pattern s3n://<bucket-name>/test/* matches 0 files



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p11684.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to read a multipart s3 file?

Posted by Ashish Rangole <ar...@gmail.com>.
Specify a folder instead of a file name for input and output code, as in:

Output:
s3n://your-bucket-name/your-data-folder

Input: (when consuming the above output)

s3n://your-bucket-name/your-data-folder/*
On May 6, 2014 5:19 PM, "kamatsuoka" <ke...@gmail.com> wrote:

> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>
> Behind the scenes, the S3 driver creates a bunch of files like
> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
> s3://mybucket/block_3574186879395643429.
>
> How do I construct an url to use this file as input to another Spark app?
>  I
> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
> them
> work.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: How to read a multipart s3 file?

Posted by paul <pa...@datalogix.com>.
darkjh wrote
> But in my experience, when reading directly from
> s3n, spark create only 1 input partition per file, regardless of the file
> size. This may lead to some performance problem if you have big files.

This is actually not true, Spark uses the underlying hadoop input formats to
read the files so if the input format you are using supports splittable
files (text, avro etc.) then it can use multiple splits per file (leading to
multiple map tasks per file).  You do have to set the max input split size,
as an example:

FileInputFormat.setMaxInputSplitSize(job, 256000000L)

In this case any file larger than 256,000,000 bytes is split.  If you don't
explicitly set it the limit is infinite which leads to the behavior you are
seeing where it is 1 split per file. 

Regards,
Paul Hamilton



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p11673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to read a multipart s3 file?

Posted by Nicholas Chammas <ni...@gmail.com>.
On Wed, May 7, 2014 at 4:00 AM, Han JU <ju...@gmail.com> wrote:

But in my experience, when reading directly from s3n, spark create only 1
> input partition per file, regardless of the file size. This may lead to
> some performance problem if you have big files.

 You can (and perhaps should) always repartition() the RDD explicitly to
increase your level of parallelism to match the number of cores in your
cluster. It’s pretty quick, and will speed up all subsequent operations.

Re: How to read a multipart s3 file?

Posted by Han JU <ju...@gmail.com>.
Just some complements to other answers:

If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.

If you file is small and need to be interoperable by other tools/langs, s3n
may be a better choice. But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.


2014-05-07 2:39 GMT+02:00 Andre Kuhnen <an...@gmail.com>:

> Try using s3n instead of s3
> Em 06/05/2014 21:19, "kamatsuoka" <ke...@gmail.com> escreveu:
>
> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>>
>> Behind the scenes, the S3 driver creates a bunch of files like
>> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
>> s3://mybucket/block_3574186879395643429.
>>
>> How do I construct an url to use this file as input to another Spark app?
>>  I
>> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
>> them
>> work.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>


-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Re: How to read a multipart s3 file?

Posted by Andre Kuhnen <an...@gmail.com>.
Try using s3n instead of s3
Em 06/05/2014 21:19, "kamatsuoka" <ke...@gmail.com> escreveu:

> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>
> Behind the scenes, the S3 driver creates a bunch of files like
> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
> s3://mybucket/block_3574186879395643429.
>
> How do I construct an url to use this file as input to another Spark app?
>  I
> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
> them
> work.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>