You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nira Amit (JIRA)" <ji...@apache.org> on 2017/03/05 17:41:32 UTC
[jira] [Reopened] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

     [ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nira Amit reopened SPARK-19656:
-------------------------------

[~srowen] Will you at least consider the possibility that I'm on to a real problem here? I may be wrong of course, but I'm telling you that I've been struggling with this for weeks and that other developers are struggling with the same thing. At the very least it's worthwhile clarifying this issue. I'm attaching a screenshot of what my code returns in runtime, it's definitely the right class. So the problem is NOT with my file reader. My reader extends  AvroRecordReaderBase<ABgoEventAvroKey, NullWritable, ABgoEvent>.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> ------------------------------------------------------------------
>
>                 Key: SPARK-19656
>                 URL: https://issues.apache.org/jira/browse/SPARK-19656
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 2.0.2
>            Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};
> public static class MyCustomAvroReader extends AvroRecordReaderBase<MyCustomAvroKey, NullWritable, MyCustomClass> {
> // with my custom schema and all the required methods...
>     }
> public static class MyCustomInputFormat extends FileInputFormat<MyCustomAvroKey, NullWritable>{
>         @Override
>         public RecordReader<MyCustomAvroKey, NullWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
>             return new MyCustomAvroReader();
>         }
>     }
> ...
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/datafile.avro",
>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>                         NullWritable.class,
>                         sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org