You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by al b <be...@googlemail.com> on 2014/11/20 19:19:39 UTC

How can I read this avro file using spark & scala?

I've read several posts of people struggling to read avro in spark. The
examples I've tried don't work. When I try this solution (
https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
I get errors:

spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper

How can I read the following sample file in spark using scala?

http://www.4shared.com/file/SxnYcdgJce/sample.html

Thomas

Re: How can I read this avro file using spark & scala?

Posted by Simone Franzini <ca...@gmail.com>.

I have also been struggling with reading avro. Very glad to hear that there
is a new avro library coming in Spark 1.2 (which by the way, seems to have
a lot of other very useful improvements).

In the meanwhile, I have been able to piece together several snippets/tips
that I found from various sources and I am now able to read/write avro
correctly. From my understanding, you basically need 3 pieces:
1. Use the kryo serializer.
2. Register your avro classes. I have done this using twitter chill 0.4.0.
3. Read/write avro with a snippet of code like the one you posted.

Here is relevant code (hopefully all of it).

// All of the following are needed in order to read/write AVRO files
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.fs.{ FileSystem, Path }
// Uncomment the following line if you want to use generic AVRO, I am using
specific
//import org.apache.avro.generic.GenericData
import org.apache.avro.Schema
import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
AvroKeyOutputFormat }
import org.apache.avro.mapred.AvroKey
// Kryo/avro serialization stuff
import com.esotericsoftware.kryo.Kryo
import com.twitter.chill.avro.AvroSerializer
import org.apache.spark.serializer.{ KryoSerializer, KryoRegistrator }

object MyApp {
    def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]")
            .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
            .set("spark.kryo.registrator", "MyRegistrator")

}

// Read
val readJob = new Job()
AvroJob.setInputKeySchema(readJob, schema)
        sc.newAPIHadoopFile(inputPath,
            classOf[AvroKeyInputFormat[MyAvroClass]],
            classOf[AvroKey[MyAvroClass]],
            classOf[NullWritable],
            readJob.getConfiguration)
.map { t => t._1.datum }

// Write
val rddAvroWritable = rdd.map { s => (new AvroKey(s), NullWritable.get) }
val writeJob = new Job()
        AvroJob.setOutputKeySchema(writeJob, schema)

writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[MyAvroClass]])
        rddAvroWritable.saveAsNewAPIHadoopFile(outputPath,
            classOf[AvroKey[MyAvroClass]],
            classOf[NullWritable],
            classOf[AvroKeyOutputFormat[MyAvroClass]],
            writeJob.getConfiguration)

    }
}


class MyRegistrator extends KryoRegistrator {
    override def registerClasses(kryo: Kryo) {
// Put a line like the following for each of your Avro classes if you use
specific Avro
// If you use generic Avro, chill also has a function for that:
GenericRecordSerializer
        kryo.register(classOf[MyAvroClass],
AvroSerializer.SpecificRecordBinarySerializer[MyAvroClass])
    }
}

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Fri, Nov 21, 2014 at 7:04 AM, thomas j <be...@googlemail.com> wrote:

> I've been able to load a different avro file based on GenericRecord with:
>
> val person = sqlContext.avroFile("/tmp/person.avro")
>
> When I try to call `first()` on it, I get "NotSerializableException"
> exceptions again:
>
> person.first()
>
> ...
> 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
> 20)
> java.io.NotSerializableException:
> org.apache.avro.generic.GenericData$Record
>     at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>     at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>     at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
>     at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>     ...
>
> Apart from this I want to transform the records into pairs of (user_id,
> record). I can do this by specifying the offset of the user_id column with
> something like this:
>
> person.map(r => (r.getInt(2), r)).take(4).collect()
>
> Is there any way to be able to specify the column name ("user_id") instead
> of needing to know/calculate the offset somehow?
>
> Thanks again
>
>
> On Fri, Nov 21, 2014 at 11:48 AM, thomas j <be...@googlemail.com>
> wrote:
>
>> Thanks for the pointer Michael.
>>
>> I've downloaded spark 1.2.0 from
>> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
>> built the spark-avro repo you linked to.
>>
>> When I run it against the example avro file linked to in the
>> documentation it works. However, when I try to load my avro file (linked to
>> in my original question) I receive the following error:
>>
>> java.lang.RuntimeException: Unsupported type LONG
>>     at scala.sys.package$.error(package.scala:27)
>>     at com.databricks.spark.avro.AvroRelation.com
>> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116)
>>     at
>> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97)
>>     at
>> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     ...
>>
>> If this is useful I'm happy to try loading the various different avro
>> files I have to try to battle-test spark-avro.
>>
>> Thanks
>>
>> On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> One option (starting with Spark 1.2, which is currently in preview) is
>>> to use the Avro library for Spark SQL.  This is very new, but we would love
>>> to get feedback: https://github.com/databricks/spark-avro
>>>
>>> On Thu, Nov 20, 2014 at 10:19 AM, al b <be...@googlemail.com> wrote:
>>>
>>>> I've read several posts of people struggling to read avro in spark. The
>>>> examples I've tried don't work. When I try this solution (
>>>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
>>>> I get errors:
>>>>
>>>> spark java.io.NotSerializableException:
>>>> org.apache.avro.mapred.AvroWrapper
>>>>
>>>> How can I read the following sample file in spark using scala?
>>>>
>>>> http://www.4shared.com/file/SxnYcdgJce/sample.html
>>>>
>>>> Thomas
>>>>
>>>
>>>
>>
>

Re: Re: How can I read this avro file using spark & scala?

Posted by VISHNU SUBRAMANIAN <jo...@gmail.com>.

Check this link.
https://github.com/databricks/spark-avro

Home page for Spark-avro project.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:19 PM, Todd <bi...@163.com> wrote:

> Databricks provides a sample code on its website...but i can't find it for
> now.
>
>
>
>
>
>
> At 2015-02-12 00:43:07, "captainfranz" <ca...@gmail.com> wrote:
> >I am confused as to whether avro support was merged into Spark 1.2 or it is
> >still an independent library.
> >I see some people writing sqlContext.avroFile similarly to jsonFile but this
> >does not work for me, nor do I see this in the Scala docs.
> >
> >
> >
> >--
> >View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >For additional commands, e-mail: user-help@spark.apache.org
> >
>
>

Re:Re: How can I read this avro file using spark & scala?

Posted by Todd <bi...@163.com>.

Databricks provides a sample code on its website...but i can't find it for now.








At 2015-02-12 00:43:07, "captainfranz" <ca...@gmail.com> wrote:
>I am confused as to whether avro support was merged into Spark 1.2 or it is
>still an independent library.
>I see some people writing sqlContext.avroFile similarly to jsonFile but this
>does not work for me, nor do I see this in the Scala docs.
>
>
>
>--
>View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>For additional commands, e-mail: user-help@spark.apache.org
>

Re: How can I read this avro file using spark & scala?

Posted by captainfranz <ca...@gmail.com>.

I am confused as to whether avro support was merged into Spark 1.2 or it is
still an independent library.
I see some people writing sqlContext.avroFile similarly to jsonFile but this
does not work for me, nor do I see this in the Scala docs.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How can I read this avro file using spark & scala?

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks for the feedback, I filed a couple of issues:
https://github.com/databricks/spark-avro/issues

On Fri, Nov 21, 2014 at 5:04 AM, thomas j <be...@googlemail.com> wrote:

> I've been able to load a different avro file based on GenericRecord with:
>
> val person = sqlContext.avroFile("/tmp/person.avro")
>
> When I try to call `first()` on it, I get "NotSerializableException"
> exceptions again:
>
> person.first()
>
> ...
> 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
> 20)
> java.io.NotSerializableException:
> org.apache.avro.generic.GenericData$Record
>     at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>     at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>     at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
>     at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>     ...
>
> Apart from this I want to transform the records into pairs of (user_id,
> record). I can do this by specifying the offset of the user_id column with
> something like this:
>
> person.map(r => (r.getInt(2), r)).take(4).collect()
>
> Is there any way to be able to specify the column name ("user_id") instead
> of needing to know/calculate the offset somehow?
>
> Thanks again
>
>
> On Fri, Nov 21, 2014 at 11:48 AM, thomas j <be...@googlemail.com>
> wrote:
>
>> Thanks for the pointer Michael.
>>
>> I've downloaded spark 1.2.0 from
>> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
>> built the spark-avro repo you linked to.
>>
>> When I run it against the example avro file linked to in the
>> documentation it works. However, when I try to load my avro file (linked to
>> in my original question) I receive the following error:
>>
>> java.lang.RuntimeException: Unsupported type LONG
>>     at scala.sys.package$.error(package.scala:27)
>>     at com.databricks.spark.avro.AvroRelation.com
>> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116)
>>     at
>> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97)
>>     at
>> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     ...
>>
>> If this is useful I'm happy to try loading the various different avro
>> files I have to try to battle-test spark-avro.
>>
>> Thanks
>>
>> On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> One option (starting with Spark 1.2, which is currently in preview) is
>>> to use the Avro library for Spark SQL.  This is very new, but we would love
>>> to get feedback: https://github.com/databricks/spark-avro
>>>
>>> On Thu, Nov 20, 2014 at 10:19 AM, al b <be...@googlemail.com> wrote:
>>>
>>>> I've read several posts of people struggling to read avro in spark. The
>>>> examples I've tried don't work. When I try this solution (
>>>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
>>>> I get errors:
>>>>
>>>> spark java.io.NotSerializableException:
>>>> org.apache.avro.mapred.AvroWrapper
>>>>
>>>> How can I read the following sample file in spark using scala?
>>>>
>>>> http://www.4shared.com/file/SxnYcdgJce/sample.html
>>>>
>>>> Thomas
>>>>
>>>
>>>
>>
>

Re: How can I read this avro file using spark & scala?

Posted by thomas j <be...@googlemail.com>.

I've been able to load a different avro file based on GenericRecord with:

val person = sqlContext.avroFile("/tmp/person.avro")

When I try to call `first()` on it, I get "NotSerializableException"
exceptions again:

person.first()

...
14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
20)
java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
    at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    ...

Apart from this I want to transform the records into pairs of (user_id,
record). I can do this by specifying the offset of the user_id column with
something like this:

person.map(r => (r.getInt(2), r)).take(4).collect()

Is there any way to be able to specify the column name ("user_id") instead
of needing to know/calculate the offset somehow?

Thanks again


On Fri, Nov 21, 2014 at 11:48 AM, thomas j <be...@googlemail.com> wrote:

> Thanks for the pointer Michael.
>
> I've downloaded spark 1.2.0 from
> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
> built the spark-avro repo you linked to.
>
> When I run it against the example avro file linked to in the documentation
> it works. However, when I try to load my avro file (linked to in my
> original question) I receive the following error:
>
> java.lang.RuntimeException: Unsupported type LONG
>     at scala.sys.package$.error(package.scala:27)
>     at com.databricks.spark.avro.AvroRelation.com
> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116)
>     at
> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97)
>     at
> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96)
>     at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>     at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>     ...
>
> If this is useful I'm happy to try loading the various different avro
> files I have to try to battle-test spark-avro.
>
> Thanks
>
> On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> One option (starting with Spark 1.2, which is currently in preview) is to
>> use the Avro library for Spark SQL.  This is very new, but we would love to
>> get feedback: https://github.com/databricks/spark-avro
>>
>> On Thu, Nov 20, 2014 at 10:19 AM, al b <be...@googlemail.com> wrote:
>>
>>> I've read several posts of people struggling to read avro in spark. The
>>> examples I've tried don't work. When I try this solution (
>>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
>>> I get errors:
>>>
>>> spark java.io.NotSerializableException:
>>> org.apache.avro.mapred.AvroWrapper
>>>
>>> How can I read the following sample file in spark using scala?
>>>
>>> http://www.4shared.com/file/SxnYcdgJce/sample.html
>>>
>>> Thomas
>>>
>>
>>
>

Re: How can I read this avro file using spark & scala?

Posted by thomas j <be...@googlemail.com>.

Thanks for the pointer Michael.

I've downloaded spark 1.2.0 from
https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
built the spark-avro repo you linked to.

When I run it against the example avro file linked to in the documentation
it works. However, when I try to load my avro file (linked to in my
original question) I receive the following error:

java.lang.RuntimeException: Unsupported type LONG
    at scala.sys.package$.error(package.scala:27)
    at com.databricks.spark.avro.AvroRelation.com
$databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116)
    at
com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97)
    at
com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96)
    at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    ...

If this is useful I'm happy to try loading the various different avro files
I have to try to battle-test spark-avro.

Thanks

On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> One option (starting with Spark 1.2, which is currently in preview) is to
> use the Avro library for Spark SQL.  This is very new, but we would love to
> get feedback: https://github.com/databricks/spark-avro
>
> On Thu, Nov 20, 2014 at 10:19 AM, al b <be...@googlemail.com> wrote:
>
>> I've read several posts of people struggling to read avro in spark. The
>> examples I've tried don't work. When I try this solution (
>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
>> I get errors:
>>
>> spark java.io.NotSerializableException:
>> org.apache.avro.mapred.AvroWrapper
>>
>> How can I read the following sample file in spark using scala?
>>
>> http://www.4shared.com/file/SxnYcdgJce/sample.html
>>
>> Thomas
>>
>
>

Re: How can I read this avro file using spark & scala?

Posted by Michael Armbrust <mi...@databricks.com>.

One option (starting with Spark 1.2, which is currently in preview) is to
use the Avro library for Spark SQL.  This is very new, but we would love to
get feedback: https://github.com/databricks/spark-avro

On Thu, Nov 20, 2014 at 10:19 AM, al b <be...@googlemail.com> wrote:

> I've read several posts of people struggling to read avro in spark. The
> examples I've tried don't work. When I try this solution (
> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
> I get errors:
>
> spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper
>
> How can I read the following sample file in spark using scala?
>
> http://www.4shared.com/file/SxnYcdgJce/sample.html
>
> Thomas
>