You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Friedman <er...@gmail.com> on 2014/09/18 17:49:28 UTC

schema for schema

I have a SchemaRDD which I've gotten from a parquetFile.

Did some transforms on it and now want to save it back out as parquet again.

Getting a SchemaRDD proves challenging because some of my fields can be
null/None and SQLContext.inferSchema abjects those.

So, I decided to use the schema on the original RDD with
SQLContext.applySchema.

This works, but only if I add a map function to turn my Row objects into a
list. (pyspark)

applied = sq.applySchema(transformed_rows.map(lambda r: list(r)),
original_parquet_file.schema())


This seems a bit kludgy.  Is there a better way?  Should there be?

Re: schema for schema

Posted by Eric Friedman <er...@gmail.com>.

Thanks!

On Thu, Sep 18, 2014 at 1:14 PM, Davies Liu <da...@databricks.com> wrote:

> Thanks for reporting this, it will be fixed by
> https://github.com/apache/spark/pull/2448
>
> On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > This looks like a bug, we are investigating.
> >
> > On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman <
> eric.d.friedman@gmail.com>
> > wrote:
> >>
> >> I have a SchemaRDD which I've gotten from a parquetFile.
> >>
> >> Did some transforms on it and now want to save it back out as parquet
> >> again.
> >>
> >> Getting a SchemaRDD proves challenging because some of my fields can be
> >> null/None and SQLContext.inferSchema abjects those.
> >>
> >> So, I decided to use the schema on the original RDD with
> >> SQLContext.applySchema.
> >>
> >> This works, but only if I add a map function to turn my Row objects
> into a
> >> list. (pyspark)
> >>
> >> applied = sq.applySchema(transformed_rows.map(lambda r: list(r)),
> >> original_parquet_file.schema())
> >>
> >>
> >> This seems a bit kludgy.  Is there a better way?  Should there be?
> >
> >
>

Re: schema for schema

Posted by Davies Liu <da...@databricks.com>.

Thanks for reporting this, it will be fixed by
https://github.com/apache/spark/pull/2448

On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> This looks like a bug, we are investigating.
>
> On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman <er...@gmail.com>
> wrote:
>>
>> I have a SchemaRDD which I've gotten from a parquetFile.
>>
>> Did some transforms on it and now want to save it back out as parquet
>> again.
>>
>> Getting a SchemaRDD proves challenging because some of my fields can be
>> null/None and SQLContext.inferSchema abjects those.
>>
>> So, I decided to use the schema on the original RDD with
>> SQLContext.applySchema.
>>
>> This works, but only if I add a map function to turn my Row objects into a
>> list. (pyspark)
>>
>> applied = sq.applySchema(transformed_rows.map(lambda r: list(r)),
>> original_parquet_file.schema())
>>
>>
>> This seems a bit kludgy.  Is there a better way?  Should there be?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: schema for schema

Posted by Michael Armbrust <mi...@databricks.com>.

This looks like a bug, we are investigating.

On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman <er...@gmail.com>
wrote:

> I have a SchemaRDD which I've gotten from a parquetFile.
>
> Did some transforms on it and now want to save it back out as parquet
> again.
>
> Getting a SchemaRDD proves challenging because some of my fields can be
> null/None and SQLContext.inferSchema abjects those.
>
> So, I decided to use the schema on the original RDD with
> SQLContext.applySchema.
>
> This works, but only if I add a map function to turn my Row objects into a
> list. (pyspark)
>
> applied = sq.applySchema(transformed_rows.map(lambda r: list(r)),
> original_parquet_file.schema())
>
>
> This seems a bit kludgy.  Is there a better way?  Should there be?
>