You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Cheolsoo Park <pi...@gmail.com> on 2015/07/24 02:50:49 UTC

Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Hi,

I am wondering if anyone has successfully enabled
"mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
usually set this property to 25 to speed up file listing in MR jobs (Hive
and Pig). But for some reason, this property does not take effect in Spark
HadoopRDD resulting in serious delay in file listing.

I verified that the property is indeed set in HadoopRDD by logging the
value of the property in the getPartitions() function. I also tried to
attach VisualVM to Spark and Pig clients, which look as follows-

In Pig, I can see 25 threads running in parallel for file listing-
[image: Inline image 1]

In Spark, I only see 2 threads running in parallel for file listing-
[image: Inline image 2]

What's strange is that the # of concurrent threads in Spark is throttled no
matter how high I
set "mapreduce.input.fileinputformat.list-status.num-threads".

Is anyone using Spark with this property enabled? If so, can you please
share how you do it?

Thanks!
Cheolsoo

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Alex Nastetsky <al...@vervemobile.com>.

Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.

On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Alex, see this jira-
> https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastetsky@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
>> input.fileinputformat.list-status.num-threads"?
>>
>> Thanks.
>>
>> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am wondering if anyone has successfully enabled
>>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>>> and Pig). But for some reason, this property does not take effect in Spark
>>> HadoopRDD resulting in serious delay in file listing.
>>>
>>> I verified that the property is indeed set in HadoopRDD by logging the
>>> value of the property in the getPartitions() function. I also tried to
>>> attach VisualVM to Spark and Pig clients, which look as follows-
>>>
>>> In Pig, I can see 25 threads running in parallel for file listing-
>>> [image: Inline image 1]
>>>
>>> In Spark, I only see 2 threads running in parallel for file listing-
>>> [image: Inline image 2]
>>>
>>> What's strange is that the # of concurrent threads in Spark is throttled
>>> no matter how high I
>>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>>
>>> Is anyone using Spark with this property enabled? If so, can you please
>>> share how you do it?
>>>
>>> Thanks!
>>> Cheolsoo
>>>
>>
>>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Alex Nastetsky <al...@vervemobile.com>.

Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.

On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Alex, see this jira-
> https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastetsky@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
>> input.fileinputformat.list-status.num-threads"?
>>
>> Thanks.
>>
>> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am wondering if anyone has successfully enabled
>>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>>> and Pig). But for some reason, this property does not take effect in Spark
>>> HadoopRDD resulting in serious delay in file listing.
>>>
>>> I verified that the property is indeed set in HadoopRDD by logging the
>>> value of the property in the getPartitions() function. I also tried to
>>> attach VisualVM to Spark and Pig clients, which look as follows-
>>>
>>> In Pig, I can see 25 threads running in parallel for file listing-
>>> [image: Inline image 1]
>>>
>>> In Spark, I only see 2 threads running in parallel for file listing-
>>> [image: Inline image 2]
>>>
>>> What's strange is that the # of concurrent threads in Spark is throttled
>>> no matter how high I
>>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>>
>>> Is anyone using Spark with this property enabled? If so, can you please
>>> share how you do it?
>>>
>>> Thanks!
>>> Cheolsoo
>>>
>>
>>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Cheolsoo Park <pi...@gmail.com>.

Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926

On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastetsky@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Cheolsoo Park <pi...@gmail.com>.

Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926

On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastetsky@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Alex Nastetsky <al...@vervemobile.com>.

Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
input.fileinputformat.list-status.num-threads"?

Thanks.

On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
> usually set this property to 25 to speed up file listing in MR jobs (Hive
> and Pig). But for some reason, this property does not take effect in Spark
> HadoopRDD resulting in serious delay in file listing.
>
> I verified that the property is indeed set in HadoopRDD by logging the
> value of the property in the getPartitions() function. I also tried to
> attach VisualVM to Spark and Pig clients, which look as follows-
>
> In Pig, I can see 25 threads running in parallel for file listing-
> [image: Inline image 1]
>
> In Spark, I only see 2 threads running in parallel for file listing-
> [image: Inline image 2]
>
> What's strange is that the # of concurrent threads in Spark is throttled
> no matter how high I
> set "mapreduce.input.fileinputformat.list-status.num-threads".
>
> Is anyone using Spark with this property enabled? If so, can you please
> share how you do it?
>
> Thanks!
> Cheolsoo
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Posted by Alex Nastetsky <al...@vervemobile.com>.

Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
input.fileinputformat.list-status.num-threads"?

Thanks.

On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
> usually set this property to 25 to speed up file listing in MR jobs (Hive
> and Pig). But for some reason, this property does not take effect in Spark
> HadoopRDD resulting in serious delay in file listing.
>
> I verified that the property is indeed set in HadoopRDD by logging the
> value of the property in the getPartitions() function. I also tried to
> attach VisualVM to Spark and Pig clients, which look as follows-
>
> In Pig, I can see 25 threads running in parallel for file listing-
> [image: Inline image 1]
>
> In Spark, I only see 2 threads running in parallel for file listing-
> [image: Inline image 2]
>
> What's strange is that the # of concurrent threads in Spark is throttled
> no matter how high I
> set "mapreduce.input.fileinputformat.list-status.num-threads".
>
> Is anyone using Spark with this property enabled? If so, can you please
> share how you do it?
>
> Thanks!
> Cheolsoo
>