You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcin Gasior (JIRA)" <ji...@apache.org> on 2018/09/11 14:24:00 UTC
[jira] [Created] (SPARK-25405) Saving RDD with new Hadoop API file
as a Sequence File too restrictive
Marcin Gasior created SPARK-25405:
-------------------------------------
Summary: Saving RDD with new Hadoop API file as a Sequence File too restrictive
Key: SPARK-25405
URL: https://issues.apache.org/jira/browse/SPARK-25405
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.2.0
Reporter: Marcin Gasior
I tried to transform Hbase export (sequence file) using spark job, and face a compilation issue:
{code:java}
val hc = sc.hadoopConfiguration
val serializers = List(
classOf[WritableSerialization].getName,
classOf[ResultSerialization].getName
).mkString(",")
hc.set("io.serializations", serializers)
val c = new Configuration(sc.hadoopConfiguration)
c.set("mapred.input.dir", sourcePath)
val subsetRDD = sc.newAPIHadoopRDD(
c,
classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]],
classOf[ImmutableBytesWritable],
classOf[Result])
subsetRDD.saveAsNewAPIHadoopFile(
"output/sequence",
classOf[ImmutableBytesWritable],
classOf[Result],
classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
hc
)
{code}
During compilation I received:
{code:java}
Error: type mismatch
Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code}
By using Hadoop low-level api I could workaround the issue with following:
{code:java}
val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")),
Writer.keyClass(classOf[ImmutableBytesWritable]),
Writer.valueClass(classOf[Result]),
Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
Writer.replication(fs.getDefaultReplication()),
Writer.blockSize(1073741824),
Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
Writer.progressable(null),
Writer.metadata(new Metadata()))
subset.foreach(p => writer.append(p._1, p._2))
IOUtils.closeStream(writer)
{code}
I think that the interface is too restrictive, and does not allow to pass external (de)serializers
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org