You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Fengyun RAO <ra...@gmail.com> on 2014/07/30 11:14:32 UTC

Is there a way to write spark RDD to Avro files

We used mapreduce for ETL and storing results in Avro files, which are
loaded to hive/impala for query.

Now we are trying to migrate to spark, but didn't find a way to write
resulting RDD to Avro files.

I wonder if there is a way to make it, or if not, why spark doesn't support
Avro as well as mapreduce? Are there any plans?

Or what's the recommended way to output spark results with schema? I don't
think plain text is a good choice.

Re: Is there a way to write spark RDD to Avro files

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,
Have you checked out SchemaRDD?
There should be an examp[le of writing to Parquet files there.
BTW, FYI I was discussing this with the SparlSQL developers last week and
possibly using Apache Gora [0] for achieving this.
HTH
Lewis
[0] http://gora.apache.org

On Wed, Jul 30, 2014 at 5:14 AM, Fengyun RAO <ra...@gmail.com> wrote:

> We used mapreduce for ETL and storing results in Avro files, which are
> loaded to hive/impala for query.
>
> Now we are trying to migrate to spark, but didn't find a way to write
> resulting RDD to Avro files.
>
> I wonder if there is a way to make it, or if not, why spark doesn't
> support Avro as well as mapreduce? Are there any plans?
>
> Or what's the recommended way to output spark results with schema? I don't
> think plain text is a good choice.
>

-- 
*Lewis*

Re: Is there a way to write spark RDD to Avro files

Posted by touchdown <yu...@gmail.com>.

YES! This worked! thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11245.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is there a way to write spark RDD to Avro files

Posted by Fengyun RAO <ra...@gmail.com>.

Below works for me:

        val job = Job.getInstance
        val schema = Schema.create(Schema.Type.STRING)
        AvroJob.setOutputKeySchema(job, schema)

        records.map(item => (new AvroKey[String](item.getGridsumId),
NullWritable.get()))
                .saveAsNewAPIHadoopFile(args(1),
                                        classOf[AvroKey[String]],
                                        classOf[NullWritable],

classOf[AvroKeyOutputFormat[String]],
                                        job.getConfiguration)


2014-08-02 13:49 GMT+08:00 touchdown <yu...@gmail.com>:

> Yes, I saw that after I looked at it closer. Thanks! But I am running into
> a
> schema not set error:
> Writer schema for output key was not set. Use AvroJob.setOutputKeySchema()
>
> I am in the process of figuring out how to set schema for an AvroJob from a
> HDFS file, but any pointer is much appreciated! Thanks again!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11241.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Is there a way to write spark RDD to Avro files

Posted by touchdown <yu...@gmail.com>.

Yes, I saw that after I looked at it closer. Thanks! But I am running into a
schema not set error:
Writer schema for output key was not set. Use AvroJob.setOutputKeySchema()

I am in the process of figuring out how to set schema for an AvroJob from a
HDFS file, but any pointer is much appreciated! Thanks again!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11241.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is there a way to write spark RDD to Avro files

Posted by Ron Gonzalez <zl...@yahoo.com>.

You have to import org.apache.spark.rdd._, which will automatically make available this method.

Thanks,
Ron

Sent from my iPhone

> On Aug 1, 2014, at 3:26 PM, touchdown <yu...@gmail.com> wrote:
> 
> Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small
> avro files into one avro file. I read it in with:
> sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
> AvroKeyInputFormat[GenericRecord]](path)
> 
> but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you please
> tell us how it worked for you thanks!
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11219.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is there a way to write spark RDD to Avro files

Posted by touchdown <yu...@gmail.com>.

Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small
avro files into one avro file. I read it in with:
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](path)

but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you please
tell us how it worked for you thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is there a way to write spark RDD to Avro files

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Marcelo. It works!


2014-07-31 5:37 GMT+08:00 Marcelo Vanzin <va...@cloudera.com>:

> Hi Fengyun,
>
> Have you tried to use saveAsHadoopFile() (or
> saveAsNewAPIHadoopFile())? You should be able to do something with
> that API by using AvroKeyValueOutputFormat.
>
> The API is defined here:
>
> http://spark.apache.org/docs/1.0.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
>
> Lots of RDD types include that functionality already.
>
>
> On Wed, Jul 30, 2014 at 2:14 AM, Fengyun RAO <ra...@gmail.com> wrote:
> > We used mapreduce for ETL and storing results in Avro files, which are
> > loaded to hive/impala for query.
> >
> > Now we are trying to migrate to spark, but didn't find a way to write
> > resulting RDD to Avro files.
> >
> > I wonder if there is a way to make it, or if not, why spark doesn't
> support
> > Avro as well as mapreduce? Are there any plans?
> >
> > Or what's the recommended way to output spark results with schema? I
> don't
> > think plain text is a good choice.
>
>
>
> --
> Marcelo
>

Re: Is there a way to write spark RDD to Avro files

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Fengyun,

Have you tried to use saveAsHadoopFile() (or
saveAsNewAPIHadoopFile())? You should be able to do something with
that API by using AvroKeyValueOutputFormat.

The API is defined here:
http://spark.apache.org/docs/1.0.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions

Lots of RDD types include that functionality already.


On Wed, Jul 30, 2014 at 2:14 AM, Fengyun RAO <ra...@gmail.com> wrote:
> We used mapreduce for ETL and storing results in Avro files, which are
> loaded to hive/impala for query.
>
> Now we are trying to migrate to spark, but didn't find a way to write
> resulting RDD to Avro files.
>
> I wonder if there is a way to make it, or if not, why spark doesn't support
> Avro as well as mapreduce? Are there any plans?
>
> Or what's the recommended way to output spark results with schema? I don't
> think plain text is a good choice.



-- 
Marcelo

Re: Is there a way to write spark RDD to Avro files

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,
Have you checked out SchemaRDD?
There should be an examp[le of writing to Parquet files there.
BTW, FYI I was discussing this with the SparlSQL developers last week and
possibly using Apache Gora [0] for achieving this.
HTH
Lewis
[0] http://gora.apache.org

On Wed, Jul 30, 2014 at 5:14 AM, Fengyun RAO <ra...@gmail.com> wrote:

> We used mapreduce for ETL and storing results in Avro files, which are
> loaded to hive/impala for query.
>
> Now we are trying to migrate to spark, but didn't find a way to write
> resulting RDD to Avro files.
>
> I wonder if there is a way to make it, or if not, why spark doesn't
> support Avro as well as mapreduce? Are there any plans?
>
> Or what's the recommended way to output spark results with schema? I don't
> think plain text is a good choice.
>

-- 
*Lewis*