You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/03 19:55:00 UTC

[jira] [Resolved] (SPARK-25405) Saving RDD with new Hadoop API file as a Sequence File too restrictive

     [ https://issues.apache.org/jira/browse/SPARK-25405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-25405.
-------------------------------
    Resolution: Not A Problem

Looks like you are using the old Mapreduce OutputFormat classes with the 'new' API methods in Spark. This isn't a bug. Use the newer OutputFormat implementations under .mapreduce

> Saving RDD with new Hadoop API file as a Sequence File too restrictive
> ----------------------------------------------------------------------
>
>                 Key: SPARK-25405
>                 URL: https://issues.apache.org/jira/browse/SPARK-25405
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Marcin Gasior
>            Priority: Major
>
> I tried to transform Hbase export (sequence file) using spark job, and face a compilation issue:
>  
> {code:java}
> val hc = sc.hadoopConfiguration
> val serializers = List(
>   classOf[WritableSerialization].getName,
>   classOf[ResultSerialization].getName
> ).mkString(",")
> hc.set("io.serializations", serializers)
> val c = new Configuration(sc.hadoopConfiguration)
> c.set("mapred.input.dir", sourcePath)
> val subsetRDD = sc.newAPIHadoopRDD(
>   c,
>   classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]],
>   classOf[ImmutableBytesWritable],
>   classOf[Result])
> subsetRDD.saveAsNewAPIHadoopFile(
>   "output/sequence",
>   classOf[ImmutableBytesWritable],
>   classOf[Result],
>   classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
>   hc
> )
> {code}
>  
>  
> During compilation I received:
> {code:java}
> Error: type mismatch
> Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat]) 
> required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code}
>  
> By using Hadoop low-level api I could workaround the issue with following:
> {code:java}
> val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")),
>   Writer.keyClass(classOf[ImmutableBytesWritable]),
>   Writer.valueClass(classOf[Result]),
>   Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
>   Writer.replication(fs.getDefaultReplication()),
>   Writer.blockSize(1073741824),
>   Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
>   Writer.progressable(null),
>   Writer.metadata(new Metadata()))
> subset.foreach(p => writer.append(p._1, p._2))
> IOUtils.closeStream(writer)
> {code}
>  
> I think that the interface is too restrictive, and does not allow to pass external (de)serializers
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org