You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tomer Benyamini <to...@gmail.com> on 2014/11/26 09:06:29 UTC

S3NativeFileSystem inefficient implementation when calling sc.textFile

Hello,

I'm building a spark app required to read large amounts of log files from
s3. I'm doing so in the code by constructing the file list, and passing it
to the context as following:

val myRDD = sc.textFile("s3n://mybucket/file1, s3n://mybucket/file2, ... ,
s3n://mybucket/fileN")

When running it locally there are no issues, but when running it on the
yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient
linear piece of code running, which could probably be easily parallelized:


[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file1

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file2

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file3

....

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/fileN


I believe there are some difference between my local classpath and the
cluster's classpath - locally I see that
*org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas
on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is
being used. Any suggestions?


Thanks,

Tomer

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Aaron Davidson <il...@gmail.com>.

Note that it does not appear that s3a solves the original problems in this
thread, which are on the Spark side or due to the fact that metadata
listing in S3 is slow simply due to going over the network.

On Sun, Nov 30, 2014 at 10:07 AM, David Blewett <da...@dawninglight.net>
wrote:

> You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].
>
> 1.
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
> On Nov 26, 2014 12:24 PM, "Aaron Davidson" <il...@gmail.com> wrote:
>
>> Spark has a known problem where it will do a pass of metadata on a large
>> number of small files serially, in order to find the partition information
>> prior to starting the job. This will probably not be repaired by switching
>> the FS impl.
>>
>> However, you can change the FS being used like so (prior to the first
>> usage):
>> sc.hadoopConfiguration.set("fs.s3n.impl",
>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>
>> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <to...@gmail.com>
>> wrote:
>>
>>> Thanks Lalit; Setting the access + secret keys in the configuration
>>> works even when calling sc.textFile. Is there a way to select which hadoop
>>> s3 native filesystem implementation would be used at runtime using the
>>> hadoop configuration?
>>>
>>> Thanks,
>>> Tomer
>>>
>>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>>
>>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>>> access keys etc.
>>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>>> object here along with key, value classes.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Lalit Yadav
>>>> lalit@sigmoidanalytics.com
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by David Blewett <da...@dawninglight.net>.

You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].

1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, "Aaron Davidson" <il...@gmail.com> wrote:

> Spark has a known problem where it will do a pass of metadata on a large
> number of small files serially, in order to find the partition information
> prior to starting the job. This will probably not be repaired by switching
> the FS impl.
>
> However, you can change the FS being used like so (prior to the first
> usage):
> sc.hadoopConfiguration.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>
> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <to...@gmail.com>
> wrote:
>
>> Thanks Lalit; Setting the access + secret keys in the configuration works
>> even when calling sc.textFile. Is there a way to select which hadoop s3
>> native filesystem implementation would be used at runtime using the hadoop
>> configuration?
>>
>> Thanks,
>> Tomer
>>
>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
>> wrote:
>>
>>>
>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>> access keys etc.
>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>> object here along with key, value classes.
>>>
>>>
>>>
>>>
>>>
>>> -----
>>> Lalit Yadav
>>> lalit@sigmoidanalytics.com
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Tomer Benyamini <to...@gmail.com>.

Thanks - this is very helpful!

On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> In the past I have worked around this problem by avoiding sc.textFile().
> Instead I read the data directly inside of a Spark job.  Basically, you
> start with an RDD where each entry is a file in S3 and then flatMap that
> with something that reads the files and returns the lines.
>
> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
>
> Using this class you can do something like:
>
> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
> Nil).flatMap(new ReadLinesSafe(_))
>
> You can also build up the list of files by running a Spark job:
> https://gist.github.com/marmbrus/15e72f7bc22337cf6653
>
> Michael
>
> On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson <il...@gmail.com>
> wrote:
>
>> Spark has a known problem where it will do a pass of metadata on a large
>> number of small files serially, in order to find the partition information
>> prior to starting the job. This will probably not be repaired by switching
>> the FS impl.
>>
>> However, you can change the FS being used like so (prior to the first
>> usage):
>> sc.hadoopConfiguration.set("fs.s3n.impl",
>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>
>> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <to...@gmail.com>
>> wrote:
>>
>>> Thanks Lalit; Setting the access + secret keys in the configuration
>>> works even when calling sc.textFile. Is there a way to select which hadoop
>>> s3 native filesystem implementation would be used at runtime using the
>>> hadoop configuration?
>>>
>>> Thanks,
>>> Tomer
>>>
>>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>>
>>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>>> access keys etc.
>>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>>> object here along with key, value classes.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Lalit Yadav
>>>> lalit@sigmoidanalytics.com
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Michael Armbrust <mi...@databricks.com>.

In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job.  Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.

Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe

Using this class you can do something like:

sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
Nil).flatMap(new ReadLinesSafe(_))

You can also build up the list of files by running a Spark job:
https://gist.github.com/marmbrus/15e72f7bc22337cf6653

Michael

On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson <il...@gmail.com> wrote:

> Spark has a known problem where it will do a pass of metadata on a large
> number of small files serially, in order to find the partition information
> prior to starting the job. This will probably not be repaired by switching
> the FS impl.
>
> However, you can change the FS being used like so (prior to the first
> usage):
> sc.hadoopConfiguration.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>
> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <to...@gmail.com>
> wrote:
>
>> Thanks Lalit; Setting the access + secret keys in the configuration works
>> even when calling sc.textFile. Is there a way to select which hadoop s3
>> native filesystem implementation would be used at runtime using the hadoop
>> configuration?
>>
>> Thanks,
>> Tomer
>>
>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
>> wrote:
>>
>>>
>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>> access keys etc.
>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>> object here along with key, value classes.
>>>
>>>
>>>
>>>
>>>
>>> -----
>>> Lalit Yadav
>>> lalit@sigmoidanalytics.com
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Aaron Davidson <il...@gmail.com>.

Spark has a known problem where it will do a pass of metadata on a large
number of small files serially, in order to find the partition information
prior to starting the job. This will probably not be repaired by switching
the FS impl.

However, you can change the FS being used like so (prior to the first
usage):
sc.hadoopConfiguration.set("fs.s3n.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem")

On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <to...@gmail.com>
wrote:

> Thanks Lalit; Setting the access + secret keys in the configuration works
> even when calling sc.textFile. Is there a way to select which hadoop s3
> native filesystem implementation would be used at runtime using the hadoop
> configuration?
>
> Thanks,
> Tomer
>
> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
> wrote:
>
>>
>> you can try creating hadoop Configuration and set s3 configuration i.e.
>> access keys etc.
>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>> object here along with key, value classes.
>>
>>
>>
>>
>>
>> -----
>> Lalit Yadav
>> lalit@sigmoidanalytics.com
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Tomer Benyamini <to...@gmail.com>.

Thanks Lalit; Setting the access + secret keys in the configuration works
even when calling sc.textFile. Is there a way to select which hadoop s3
native filesystem implementation would be used at runtime using the hadoop
configuration?

Thanks,
Tomer

On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
wrote:

>
> you can try creating hadoop Configuration and set s3 configuration i.e.
> access keys etc.
> Now, for reading files from s3 use newAPIHadoopFile and pass the config
> object here along with key, value classes.
>
>
>
>
>
> -----
> Lalit Yadav
> lalit@sigmoidanalytics.com
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by lalit1303 <la...@sigmoidanalytics.com>.

you can try creating hadoop Configuration and set s3 configuration i.e.
access keys etc.
Now, for reading files from s3 use newAPIHadoopFile and pass the config
object here along with key, value classes.





-----
Lalit Yadav
lalit@sigmoidanalytics.com
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Posted by Peng Cheng <pc...@uow.edu.au>.

I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:

http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928

In my case I have 1600+ fragmented checkpoint file and the time to read all
metadata takes a staggering 11 hours.

If this is really the cause then its an obvious handicap, as checkponted RDD
already has all file parttition information available and doesn't need to to
read them from s3 into driver again (which cause a single-point-of-failure
and a bottleneck).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p22984.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org