You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ted Yu <yu...@gmail.com> on 2016/03/09 23:50:03 UTC

Re: binary file deserialization

bq. there is a varying number of items for that record

If the combination of items is very large, using case class would be
tedious.

On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj <ba...@gmail.com>
wrote:

> You can load that binary up as a String RDD, then map over that RDD and
> convert each row to your case class representing the data. In the map stage
> you could also map the input string into an RDD of JSON values and use the
> following function to convert it into a DF
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
>
>
> On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov <da...@gmail.com>
> wrote:
>
>> We have a huge binary file in a custom serialization format (e.g. header
>> tells the length of the record, then there is a varying number of items for
>> that record). This is produced by an old c++ application.
>> What would be best approach to deserialize it into a Hive table or a
>> Spark RDD?
>> Format is known and well documented.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>