You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Enno Shioji <es...@gmail.com> on 2014/12/30 12:26:24 UTC

[SOLVED] Re: Writing and reading sequence file results in trailing extra data

This poor soul had the exact same problem and solution:

http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile

ᐧ

On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji <es...@gmail.com> wrote:

> Hi, I'm facing a weird issue. Any help appreciated.
>
> When I execute the below code and compare "input" and "output", each
> record in the output has some extra trailing data appended to it, and hence
> corrupted. I'm just reading and writing, so the input and output should be
> exactly the same.
>
> I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
> (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
> fine at the time it's passed to Hadoop classes, and has already the extra
> data when in Hadoop classes (I guess this makes it more of a Hadoop
> question...).
>
> Code:
> =====
>   def main(args: Array[String]) {
>     val conf = new SparkConf()
>       .setMaster("local[4]")
>       .setAppName("Simple Application")
>
>     val sc = new SparkContext(conf)
>
>    // input.txt is a text file with some Base64 encoded binaries stored as
> lines
>
>     val src = sc
>       .textFile("input.txt")
>       .map(DatatypeConverter.parseBase64Binary)
>       .map(x => (NullWritable.get(), new BytesWritable(x)))
>       .saveAsSequenceFile("s3n://fake-test/stored")
>
>     val file = "s3n://fake-test/stored"
>     val logData = sc.sequenceFile(file, classOf[NullWritable],
> classOf[BytesWritable])
>
>     val count = logData
>       .map { case (k, v) => v }
>       .map(x => DatatypeConverter.printBase64Binary(x.getBytes))
>       .saveAsTextFile("/tmp/output")
>
>   }
>
>