You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Brad Miller <bm...@eecs.berkeley.edu> on 2014/09/12 03:12:14 UTC

coalesce on SchemaRDD in pyspark

Hi All,

I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark.  When I run:

sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
'{"foo":"baz"}'])).coalesce(1)

I get this error:

Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class
java.lang.Boolean]) does not exist

For context, I have a dataset stored in a parquet file, and I'm using
SQLContext to make several queries against the data.  I then register the
results of these as queries new tables in the SQLContext.  Unfortunately
each new table has the same number of partitions as the original (despite
being much smaller).  Hence my interest in coalesce and repartition.

Has anybody else encountered this bug?  Is there an alternate workflow I
should consider?

I am running the 1.1.0 binaries released today.

best,
-Brad

Re: coalesce on SchemaRDD in pyspark

Posted by Davies Liu <da...@databricks.com>.
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller <bm...@eecs.berkeley.edu> wrote:
> Hi Davies,
>
> Thanks for the quick fix. I'm sorry to send out a bug report on release day
> - 1.1.0 really is a great release.  I've been running the 1.1 branch for a
> while and there's definitely lots of good stuff.
>
> For the workaround, I think you may have meant:
>
> srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Yes, thanks for the correction.

> Note:
> "_schema_rdd" -> "_jschema_rdd"
> "false" -> "False"
>
> That workaround seems to work fine (in that I've observed the correct number
> of partitions in the web-ui, although haven't tested it any beyond that).
>
> Thanks!
> -Brad
>
> On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu <da...@databricks.com> wrote:
>>
>> This is a bug, I had create an issue to track this:
>> https://issues.apache.org/jira/browse/SPARK-3500
>>
>> Also, there is PR to fix this: https://github.com/apache/spark/pull/2369
>>
>> Before next bugfix release, you can workaround this by:
>>
>> srdd = sqlCtx.jsonRDD(rdd)
>> srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)
>>
>>
>> On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller <bm...@eecs.berkeley.edu>
>> wrote:
>> > Hi All,
>> >
>> > I'm having some trouble with the coalesce and repartition functions for
>> > SchemaRDD objects in pyspark.  When I run:
>> >
>> > sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
>> > '{"foo":"baz"}'])).coalesce(1)
>> >
>> > I get this error:
>> >
>> > Py4JError: An error occurred while calling o94.coalesce. Trace:
>> > py4j.Py4JException: Method coalesce([class java.lang.Integer, class
>> > java.lang.Boolean]) does not exist
>> >
>> > For context, I have a dataset stored in a parquet file, and I'm using
>> > SQLContext to make several queries against the data.  I then register
>> > the
>> > results of these as queries new tables in the SQLContext.  Unfortunately
>> > each new table has the same number of partitions as the original
>> > (despite
>> > being much smaller).  Hence my interest in coalesce and repartition.
>> >
>> > Has anybody else encountered this bug?  Is there an alternate workflow I
>> > should consider?
>> >
>> > I am running the 1.1.0 binaries released today.
>> >
>> > best,
>> > -Brad
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: coalesce on SchemaRDD in pyspark

Posted by Brad Miller <bm...@eecs.berkeley.edu>.
Hi Davies,

Thanks for the quick fix. I'm sorry to send out a bug report on release day
- 1.1.0 really is a great release.  I've been running the 1.1 branch for a
while and there's definitely lots of good stuff.

For the workaround, I think you may have meant:

srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Note:
"_schema_rdd" -> "_jschema_rdd"
"false" -> "False"

That workaround seems to work fine (in that I've observed the correct
number of partitions in the web-ui, although haven't tested it any beyond
that).

Thanks!
-Brad

On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu <da...@databricks.com> wrote:

> This is a bug, I had create an issue to track this:
> https://issues.apache.org/jira/browse/SPARK-3500
>
> Also, there is PR to fix this: https://github.com/apache/spark/pull/2369
>
> Before next bugfix release, you can workaround this by:
>
> srdd = sqlCtx.jsonRDD(rdd)
> srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)
>
>
> On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller <bm...@eecs.berkeley.edu>
> wrote:
> > Hi All,
> >
> > I'm having some trouble with the coalesce and repartition functions for
> > SchemaRDD objects in pyspark.  When I run:
> >
> > sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
> > '{"foo":"baz"}'])).coalesce(1)
> >
> > I get this error:
> >
> > Py4JError: An error occurred while calling o94.coalesce. Trace:
> > py4j.Py4JException: Method coalesce([class java.lang.Integer, class
> > java.lang.Boolean]) does not exist
> >
> > For context, I have a dataset stored in a parquet file, and I'm using
> > SQLContext to make several queries against the data.  I then register the
> > results of these as queries new tables in the SQLContext.  Unfortunately
> > each new table has the same number of partitions as the original (despite
> > being much smaller).  Hence my interest in coalesce and repartition.
> >
> > Has anybody else encountered this bug?  Is there an alternate workflow I
> > should consider?
> >
> > I am running the 1.1.0 binaries released today.
> >
> > best,
> > -Brad
>

Re: coalesce on SchemaRDD in pyspark

Posted by Davies Liu <da...@databricks.com>.
This is a bug, I had create an issue to track this:
https://issues.apache.org/jira/browse/SPARK-3500

Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

Before next bugfix release, you can workaround this by:

srdd = sqlCtx.jsonRDD(rdd)
srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller <bm...@eecs.berkeley.edu> wrote:
> Hi All,
>
> I'm having some trouble with the coalesce and repartition functions for
> SchemaRDD objects in pyspark.  When I run:
>
> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
> '{"foo":"baz"}'])).coalesce(1)
>
> I get this error:
>
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class
> java.lang.Boolean]) does not exist
>
> For context, I have a dataset stored in a parquet file, and I'm using
> SQLContext to make several queries against the data.  I then register the
> results of these as queries new tables in the SQLContext.  Unfortunately
> each new table has the same number of partitions as the original (despite
> being much smaller).  Hence my interest in coalesce and repartition.
>
> Has anybody else encountered this bug?  Is there an alternate workflow I
> should consider?
>
> I am running the 1.1.0 binaries released today.
>
> best,
> -Brad

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org