You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Friedman <er...@gmail.com> on 2014/09/15 22:35:53 UTC

minPartitions for non-text files?

sc.textFile takes a minimum # of partitions to use.

is there a way to get sc.newAPIHadoopFile to do the same?

I know I can repartition() and get a shuffle.  I'm wondering if there's a
way to tell the underlying InputFormat (AvroParquet, in my case) how many
partitions to use at the outset.

What I'd really prefer is to get the partitions automatically defined based
on the number of blocks.

Re: minPartitions for non-text files?

Posted by Eric Friedman <er...@gmail.com>.

Yes, it's AvroParquetInputFormat, which is splittable.  If I force a
repartitioning, it works. If I don't, spark chokes on my not-terribly-large
250Mb files.

PySpark's documentation says that the dictionary is turned into a
Configuration object.

@param conf: Hadoop configuration, passed in as a dict (None by default)

On Mon, Sep 15, 2014 at 3:26 PM, Sean Owen <so...@cloudera.com> wrote:

> Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.
>
> Is it a splittable format? for example, some compressed formats are
> not splittable and Hadoop has to process whole files at a time.
>
> I'm also not sure if this is something to do with pyspark, since the
> underlying Scala API takes a Configuration object rather than
> dictionary.
>
> On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
> <er...@gmail.com> wrote:
> > That would be awesome, but doesn't seem to have any effect.
> >
> > In PySpark, I created a dict with that key and a numeric value, then
> passed
> > it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
> > RDD still has a single partition.
> >
> > On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I think the reason is simply that there is no longer an explicit
> >> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
> >> At least, I didn't see it when I glanced just now.
> >>
> >> However, you should be able to get the same effect by setting a
> >> Configuration property, and you can do so through the newAPIHadoopFile
> >> method. You set it as a suggested maximum split size rather than
> >> suggest minimum number of splits.
> >>
> >> Although I think the old config property mapred.max.split.size is
> >> still respected, you may try
> >> mapreduce.input.fileinputformat.split.maxsize instead, which appears
> >> to be the intended replacement in the new APIs.
> >>
> >> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
> >> <er...@gmail.com> wrote:
> >> > sc.textFile takes a minimum # of partitions to use.
> >> >
> >> > is there a way to get sc.newAPIHadoopFile to do the same?
> >> >
> >> > I know I can repartition() and get a shuffle.  I'm wondering if
> there's
> >> > a
> >> > way to tell the underlying InputFormat (AvroParquet, in my case) how
> >> > many
> >> > partitions to use at the outset.
> >> >
> >> > What I'd really prefer is to get the partitions automatically defined
> >> > based
> >> > on the number of blocks.
> >
> >
>

Re: minPartitions for non-text files?

Posted by Sean Owen <so...@cloudera.com>.

Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.

Is it a splittable format? for example, some compressed formats are
not splittable and Hadoop has to process whole files at a time.

I'm also not sure if this is something to do with pyspark, since the
underlying Scala API takes a Configuration object rather than
dictionary.

On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
<er...@gmail.com> wrote:
> That would be awesome, but doesn't seem to have any effect.
>
> In PySpark, I created a dict with that key and a numeric value, then passed
> it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
> RDD still has a single partition.
>
> On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think the reason is simply that there is no longer an explicit
>> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
>> At least, I didn't see it when I glanced just now.
>>
>> However, you should be able to get the same effect by setting a
>> Configuration property, and you can do so through the newAPIHadoopFile
>> method. You set it as a suggested maximum split size rather than
>> suggest minimum number of splits.
>>
>> Although I think the old config property mapred.max.split.size is
>> still respected, you may try
>> mapreduce.input.fileinputformat.split.maxsize instead, which appears
>> to be the intended replacement in the new APIs.
>>
>> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
>> <er...@gmail.com> wrote:
>> > sc.textFile takes a minimum # of partitions to use.
>> >
>> > is there a way to get sc.newAPIHadoopFile to do the same?
>> >
>> > I know I can repartition() and get a shuffle.  I'm wondering if there's
>> > a
>> > way to tell the underlying InputFormat (AvroParquet, in my case) how
>> > many
>> > partitions to use at the outset.
>> >
>> > What I'd really prefer is to get the partitions automatically defined
>> > based
>> > on the number of blocks.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: minPartitions for non-text files?

Posted by Eric Friedman <er...@gmail.com>.

That would be awesome, but doesn't seem to have any effect.

In PySpark, I created a dict with that key and a numeric value, then passed
it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
RDD still has a single partition.

On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:

> I think the reason is simply that there is no longer an explicit
> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
> At least, I didn't see it when I glanced just now.
>
> However, you should be able to get the same effect by setting a
> Configuration property, and you can do so through the newAPIHadoopFile
> method. You set it as a suggested maximum split size rather than
> suggest minimum number of splits.
>
> Although I think the old config property mapred.max.split.size is
> still respected, you may try
> mapreduce.input.fileinputformat.split.maxsize instead, which appears
> to be the intended replacement in the new APIs.
>
> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
> <er...@gmail.com> wrote:
> > sc.textFile takes a minimum # of partitions to use.
> >
> > is there a way to get sc.newAPIHadoopFile to do the same?
> >
> > I know I can repartition() and get a shuffle.  I'm wondering if there's a
> > way to tell the underlying InputFormat (AvroParquet, in my case) how many
> > partitions to use at the outset.
> >
> > What I'd really prefer is to get the partitions automatically defined
> based
> > on the number of blocks.
>

Re: minPartitions for non-text files?

Posted by Sean Owen <so...@cloudera.com>.

I think the reason is simply that there is no longer an explicit
min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
At least, I didn't see it when I glanced just now.

However, you should be able to get the same effect by setting a
Configuration property, and you can do so through the newAPIHadoopFile
method. You set it as a suggested maximum split size rather than
suggest minimum number of splits.

Although I think the old config property mapred.max.split.size is
still respected, you may try
mapreduce.input.fileinputformat.split.maxsize instead, which appears
to be the intended replacement in the new APIs.

On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
<er...@gmail.com> wrote:
> sc.textFile takes a minimum # of partitions to use.
>
> is there a way to get sc.newAPIHadoopFile to do the same?
>
> I know I can repartition() and get a shuffle.  I'm wondering if there's a
> way to tell the underlying InputFormat (AvroParquet, in my case) how many
> partitions to use at the outset.
>
> What I'd really prefer is to get the partitions automatically defined based
> on the number of blocks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org