You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcin Gasior (JIRA)" <ji...@apache.org> on 2018/09/11 14:24:00 UTC

[jira] [Created] (SPARK-25405) Saving RDD with new Hadoop API file as a Sequence File too restrictive

Marcin Gasior created SPARK-25405:
-------------------------------------

             Summary: Saving RDD with new Hadoop API file as a Sequence File too restrictive
                 Key: SPARK-25405
                 URL: https://issues.apache.org/jira/browse/SPARK-25405
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.2.0
            Reporter: Marcin Gasior


I tried to transform Hbase export (sequence file) using spark job, and face a compilation issue:

 
{code:java}

val hc = sc.hadoopConfiguration

val serializers = List(
  classOf[WritableSerialization].getName,
  classOf[ResultSerialization].getName
).mkString(",")

hc.set("io.serializations", serializers)

val c = new Configuration(sc.hadoopConfiguration)
c.set("mapred.input.dir", sourcePath)
val subsetRDD = sc.newAPIHadoopRDD(
  c,
  classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]],
  classOf[ImmutableBytesWritable],
  classOf[Result])

subsetRDD.saveAsNewAPIHadoopFile(
  "output/sequence",
  classOf[ImmutableBytesWritable],
  classOf[Result],
  classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
  hc
)
{code}
 

 

During compilation I received:
{code:java}
Error: type mismatch
Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat]) 

required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code}
 

By using Hadoop low-level api I could workaround the issue with following:
{code:java}
val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")),
  Writer.keyClass(classOf[ImmutableBytesWritable]),
  Writer.valueClass(classOf[Result]),
  Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
  Writer.replication(fs.getDefaultReplication()),
  Writer.blockSize(1073741824),
  Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
  Writer.progressable(null),
  Writer.metadata(new Metadata()))

subset.foreach(p => writer.append(p._1, p._2))

IOUtils.closeStream(writer)
{code}
 

I think that the interface is too restrictive, and does not allow to pass external (de)serializers

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org