You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/03 19:55:00 UTC
[jira] [Resolved] (SPARK-25405) Saving RDD with new Hadoop API file
as a Sequence File too restrictive
[ https://issues.apache.org/jira/browse/SPARK-25405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-25405.
-------------------------------
Resolution: Not A Problem
Looks like you are using the old Mapreduce OutputFormat classes with the 'new' API methods in Spark. This isn't a bug. Use the newer OutputFormat implementations under .mapreduce
> Saving RDD with new Hadoop API file as a Sequence File too restrictive
> ----------------------------------------------------------------------
>
> Key: SPARK-25405
> URL: https://issues.apache.org/jira/browse/SPARK-25405
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.2.0
> Reporter: Marcin Gasior
> Priority: Major
>
> I tried to transform Hbase export (sequence file) using spark job, and face a compilation issue:
>
> {code:java}
> val hc = sc.hadoopConfiguration
> val serializers = List(
> classOf[WritableSerialization].getName,
> classOf[ResultSerialization].getName
> ).mkString(",")
> hc.set("io.serializations", serializers)
> val c = new Configuration(sc.hadoopConfiguration)
> c.set("mapred.input.dir", sourcePath)
> val subsetRDD = sc.newAPIHadoopRDD(
> c,
> classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]],
> classOf[ImmutableBytesWritable],
> classOf[Result])
> subsetRDD.saveAsNewAPIHadoopFile(
> "output/sequence",
> classOf[ImmutableBytesWritable],
> classOf[Result],
> classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
> hc
> )
> {code}
>
>
> During compilation I received:
> {code:java}
> Error: type mismatch
> Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
> required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code}
>
> By using Hadoop low-level api I could workaround the issue with following:
> {code:java}
> val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")),
> Writer.keyClass(classOf[ImmutableBytesWritable]),
> Writer.valueClass(classOf[Result]),
> Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
> Writer.replication(fs.getDefaultReplication()),
> Writer.blockSize(1073741824),
> Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
> Writer.progressable(null),
> Writer.metadata(new Metadata()))
> subset.foreach(p => writer.append(p._1, p._2))
> IOUtils.closeStream(writer)
> {code}
>
> I think that the interface is too restrictive, and does not allow to pass external (de)serializers
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org