You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Connell, Chuck" <Ch...@nuance.com> on 2012/12/01 17:50:47 UTC

BINARY column type

I am trying to use BINARY columns and believe I have the perfect use-case for it, but I am missing something. Has anyone used this for true binary data (which may contain newlines)?


Here is the background... I have some files that each contain just one logical field, which is a binary object. (The files are Google Protobuf format.) I want to put these binary files into a larger file, where each protobuf is a logical record. Then I want to define a Hive table that stores each protobuf as one row, with the entire protobuf object in one BINARY column. Then I will use a custom UDF to select/query the binary object.


This is about as simple as can be for putting binary data into Hive.


What file format should I use to package the binary rows? What should the Hive table definition be? Which SerDe option (LazySimpleBinary?). I cannot use TEXTFILE, since the binary may contain newlines. Many of my attempts have choked on the newlines.


Thank you,

Chuck Connell

Nuance

Burlington, MA


RE: BINARY column type

Posted by "Connell, Chuck" <Ch...@nuance.com>.
Thanks. I may be able to use some of your ideas.

I still feel like there ought to be a way to pack true binary records into a larger file, import that file into Hive, store each record as one BINARY field, then select/query the field with a UDF (or a SerDe). This seems to me like the simplest base case for binary data.

________________________________
From: John Omernik [john@omernik.com]
Sent: Sunday, December 02, 2012 11:27 AM
To: user@hive.apache.org
Subject: Re: BINARY column type

Ya, for me it's pcap data so I had to take the data and process it out of the pcaps into something serialized for hive anyhow.  in that case, I took the pcaps and loaded them with a transform.  My transform script to a single file name in on STDIN and then read the PCAP, parsed out the packets in the formated I wanted and then took the raw data for each packet and hexed it as it outputted it to STDOUT.  My Insert statement took the results of the pcap parsing script (including the hexed data) and then unhexed it at insert.  There may be a better way to do this, but for me it works well. *shrug*



On Sun, Dec 2, 2012 at 9:00 AM, Connell, Chuck <Ch...@nuance.com>> wrote:
The hex idea is clever. But does this mean that the files you brought into Hive (with a LOAD statement) were essentially ascii (hexed), not raw binary?

________________________________
From: John Omernik [john@omernik.com<ma...@omernik.com>]
Sent: Saturday, December 01, 2012 11:58 PM

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: BINARY column type

No, I didn't remove any newline characters. newline became 0A  By using perl or python in a transform if I had "Hi how are you\n" It would be come 486920686f772061726520796f75200A

>From there it would pass that to the unhex() function in hive in the insert statement. That allowed me to move the data with newline around easily, but on the final step (on insert) it would unhex it and put it in as actual binary, no bytes were harmed in the hexing (or unhexing) of my data.



On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck <Ch...@nuance.com>> wrote:
Thanks John. When you say "hexed" data, do you mean binary encoded to ascii hex? This would remove the raw newline characters.

We considered Base64 encoding our data, a similar idea, which would also remove raw newlines. But my preference is to put real binary data into Hive, and find a way to make this work.

Chuck

________________________________
From: John Omernik [john@omernik.com<ma...@omernik.com>]
Sent: Saturday, December 01, 2012 4:22 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: BINARY column type

Hi Chuck -

I've used binary columns with Newlines in the data. I used RCFile format for my storage method. Works great so far. Whether or not this is "the" way to get data in, I use hexed data (my transform script outputs hex encoded) and the final insert into the table gets a unhex(sourcedata).  That's never been a problem for me, seems a bit hackish, but works well.

On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Ch...@nuance.com>> wrote:

I am trying to use BINARY columns and believe I have the perfect use-case for it, but I am missing something. Has anyone used this for true binary data (which may contain newlines)?


Here is the background... I have some files that each contain just one logical field, which is a binary object. (The files are Google Protobuf format.) I want to put these binary files into a larger file, where each protobuf is a logical record. Then I want to define a Hive table that stores each protobuf as one row, with the entire protobuf object in one BINARY column. Then I will use a custom UDF to select/query the binary object.


This is about as simple as can be for putting binary data into Hive.


What file format should I use to package the binary rows? What should the Hive table definition be? Which SerDe option (LazySimpleBinary?). I cannot use TEXTFILE, since the binary may contain newlines. Many of my attempts have choked on the newlines.


Thank you,

Chuck Connell

Nuance

Burlington, MA





Re: BINARY column type

Posted by John Omernik <jo...@omernik.com>.
Ya, for me it's pcap data so I had to take the data and process it out of
the pcaps into something serialized for hive anyhow.  in that case, I took
the pcaps and loaded them with a transform.  My transform script to a
single file name in on STDIN and then read the PCAP, parsed out the packets
in the formated I wanted and then took the raw data for each packet and
hexed it as it outputted it to STDOUT.  My Insert statement took the
results of the pcap parsing script (including the hexed data) and then
unhexed it at insert.  There may be a better way to do this, but for me it
works well. *shrug*



On Sun, Dec 2, 2012 at 9:00 AM, Connell, Chuck <Ch...@nuance.com>wrote:

>  The hex idea is clever. But does this mean that the files you brought
> into Hive (with a LOAD statement) were essentially ascii (hexed), not raw
> binary?
>
>  ------------------------------
> *From:* John Omernik [john@omernik.com]
> *Sent:* Saturday, December 01, 2012 11:58 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: BINARY column type
>
>  No, I didn't remove any newline characters. newline became 0A  By using
> perl or python in a transform if I had "Hi how are you\n" It would be
> come 486920686f772061726520796f75200A
>
>  From there it would pass that to the unhex() function in hive in the
> insert statement. That allowed me to move the data with newline around
> easily, but on the final step (on insert) it would unhex it and put it in
> as actual binary, no bytes were harmed in the hexing (or unhexing) of my
> data.
>
>
>
> On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck <Ch...@nuance.com>wrote:
>
>>  Thanks John. When you say "hexed" data, do you mean binary encoded to
>> ascii hex? This would remove the raw newline characters.
>>
>> We considered Base64 encoding our data, a similar idea, which would also
>> remove raw newlines. But my preference is to put real binary data into
>> Hive, and find a way to make this work.
>>
>> Chuck
>>
>>  ------------------------------
>> *From:* John Omernik [john@omernik.com]
>> *Sent:* Saturday, December 01, 2012 4:22 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: BINARY column type
>>
>>   Hi Chuck -
>>
>>  I've used binary columns with Newlines in the data. I used RCFile
>> format for my storage method. Works great so far. Whether or not this is
>> "the" way to get data in, I use hexed data (my transform script outputs hex
>> encoded) and the final insert into the table gets a unhex(sourcedata).
>>  That's never been a problem for me, seems a bit hackish, but works well.
>>
>> On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Chuck.Connell@nuance.com
>> > wrote:
>>
>>>  I am trying to use BINARY columns and believe I have the perfect
>>> use-case for it, but I am missing something. Has anyone used this for true
>>> binary data (which may contain newlines)?
>>>
>>>
>>>  Here is the background... I have some files that each contain just one
>>> logical field, which is a binary object. (The files are Google Protobuf
>>> format.) I want to put these binary files into a larger file, where each
>>> protobuf is a logical record. Then I want to define a Hive table that
>>> stores each protobuf as one row, with the entire protobuf object in one
>>> BINARY column. Then I will use a custom UDF to select/query the binary
>>> object.
>>>
>>>
>>>  This is about as simple as can be for putting binary data into Hive.
>>>
>>>
>>>  What file format should I use to package the binary rows? What should
>>> the Hive table definition be? Which SerDe option (LazySimpleBinary?). I
>>> cannot use TEXTFILE, since the binary may contain newlines. Many of my
>>> attempts have choked on the newlines.
>>>
>>>
>>>  Thank you,
>>>
>>> Chuck Connell
>>>
>>> Nuance
>>>
>>> Burlington, MA
>>>
>>>
>>
>

RE: BINARY column type

Posted by "Connell, Chuck" <Ch...@nuance.com>.
The hex idea is clever. But does this mean that the files you brought into Hive (with a LOAD statement) were essentially ascii (hexed), not raw binary?

________________________________
From: John Omernik [john@omernik.com]
Sent: Saturday, December 01, 2012 11:58 PM
To: user@hive.apache.org
Subject: Re: BINARY column type

No, I didn't remove any newline characters. newline became 0A  By using perl or python in a transform if I had "Hi how are you\n" It would be come 486920686f772061726520796f75200A

>From there it would pass that to the unhex() function in hive in the insert statement. That allowed me to move the data with newline around easily, but on the final step (on insert) it would unhex it and put it in as actual binary, no bytes were harmed in the hexing (or unhexing) of my data.



On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck <Ch...@nuance.com>> wrote:
Thanks John. When you say "hexed" data, do you mean binary encoded to ascii hex? This would remove the raw newline characters.

We considered Base64 encoding our data, a similar idea, which would also remove raw newlines. But my preference is to put real binary data into Hive, and find a way to make this work.

Chuck

________________________________
From: John Omernik [john@omernik.com<ma...@omernik.com>]
Sent: Saturday, December 01, 2012 4:22 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: BINARY column type

Hi Chuck -

I've used binary columns with Newlines in the data. I used RCFile format for my storage method. Works great so far. Whether or not this is "the" way to get data in, I use hexed data (my transform script outputs hex encoded) and the final insert into the table gets a unhex(sourcedata).  That's never been a problem for me, seems a bit hackish, but works well.

On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Ch...@nuance.com>> wrote:

I am trying to use BINARY columns and believe I have the perfect use-case for it, but I am missing something. Has anyone used this for true binary data (which may contain newlines)?


Here is the background... I have some files that each contain just one logical field, which is a binary object. (The files are Google Protobuf format.) I want to put these binary files into a larger file, where each protobuf is a logical record. Then I want to define a Hive table that stores each protobuf as one row, with the entire protobuf object in one BINARY column. Then I will use a custom UDF to select/query the binary object.


This is about as simple as can be for putting binary data into Hive.


What file format should I use to package the binary rows? What should the Hive table definition be? Which SerDe option (LazySimpleBinary?). I cannot use TEXTFILE, since the binary may contain newlines. Many of my attempts have choked on the newlines.


Thank you,

Chuck Connell

Nuance

Burlington, MA




Re: BINARY column type

Posted by John Omernik <jo...@omernik.com>.
No, I didn't remove any newline characters. newline became 0A  By using
perl or python in a transform if I had "Hi how are you\n" It would be
come 486920686f772061726520796f75200A

>From there it would pass that to the unhex() function in hive in the insert
statement. That allowed me to move the data with newline around easily, but
on the final step (on insert) it would unhex it and put it in as actual
binary, no bytes were harmed in the hexing (or unhexing) of my data.



On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck <Ch...@nuance.com>wrote:

>  Thanks John. When you say "hexed" data, do you mean binary encoded to
> ascii hex? This would remove the raw newline characters.
>
> We considered Base64 encoding our data, a similar idea, which would also
> remove raw newlines. But my preference is to put real binary data into
> Hive, and find a way to make this work.
>
> Chuck
>
>  ------------------------------
> *From:* John Omernik [john@omernik.com]
> *Sent:* Saturday, December 01, 2012 4:22 PM
> *To:* user@hive.apache.org
> *Subject:* Re: BINARY column type
>
>  Hi Chuck -
>
>  I've used binary columns with Newlines in the data. I used RCFile format
> for my storage method. Works great so far. Whether or not this is "the" way
> to get data in, I use hexed data (my transform script outputs hex encoded)
> and the final insert into the table gets a unhex(sourcedata).  That's never
> been a problem for me, seems a bit hackish, but works well.
>
> On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Ch...@nuance.com>wrote:
>
>>  I am trying to use BINARY columns and believe I have the perfect
>> use-case for it, but I am missing something. Has anyone used this for true
>> binary data (which may contain newlines)?
>>
>>
>>  Here is the background... I have some files that each contain just one
>> logical field, which is a binary object. (The files are Google Protobuf
>> format.) I want to put these binary files into a larger file, where each
>> protobuf is a logical record. Then I want to define a Hive table that
>> stores each protobuf as one row, with the entire protobuf object in one
>> BINARY column. Then I will use a custom UDF to select/query the binary
>> object.
>>
>>
>>  This is about as simple as can be for putting binary data into Hive.
>>
>>
>>  What file format should I use to package the binary rows? What should
>> the Hive table definition be? Which SerDe option (LazySimpleBinary?). I
>> cannot use TEXTFILE, since the binary may contain newlines. Many of my
>> attempts have choked on the newlines.
>>
>>
>>  Thank you,
>>
>> Chuck Connell
>>
>> Nuance
>>
>> Burlington, MA
>>
>>
>

RE: BINARY column type

Posted by "Connell, Chuck" <Ch...@nuance.com>.
Thanks John. When you say "hexed" data, do you mean binary encoded to ascii hex? This would remove the raw newline characters.

We considered Base64 encoding our data, a similar idea, which would also remove raw newlines. But my preference is to put real binary data into Hive, and find a way to make this work.

Chuck

________________________________
From: John Omernik [john@omernik.com]
Sent: Saturday, December 01, 2012 4:22 PM
To: user@hive.apache.org
Subject: Re: BINARY column type

Hi Chuck -

I've used binary columns with Newlines in the data. I used RCFile format for my storage method. Works great so far. Whether or not this is "the" way to get data in, I use hexed data (my transform script outputs hex encoded) and the final insert into the table gets a unhex(sourcedata).  That's never been a problem for me, seems a bit hackish, but works well.

On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Ch...@nuance.com>> wrote:

I am trying to use BINARY columns and believe I have the perfect use-case for it, but I am missing something. Has anyone used this for true binary data (which may contain newlines)?


Here is the background... I have some files that each contain just one logical field, which is a binary object. (The files are Google Protobuf format.) I want to put these binary files into a larger file, where each protobuf is a logical record. Then I want to define a Hive table that stores each protobuf as one row, with the entire protobuf object in one BINARY column. Then I will use a custom UDF to select/query the binary object.


This is about as simple as can be for putting binary data into Hive.


What file format should I use to package the binary rows? What should the Hive table definition be? Which SerDe option (LazySimpleBinary?). I cannot use TEXTFILE, since the binary may contain newlines. Many of my attempts have choked on the newlines.


Thank you,

Chuck Connell

Nuance

Burlington, MA



Re: BINARY column type

Posted by John Omernik <jo...@omernik.com>.
Hi Chuck -

I've used binary columns with Newlines in the data. I used RCFile format
for my storage method. Works great so far. Whether or not this is "the" way
to get data in, I use hexed data (my transform script outputs hex encoded)
and the final insert into the table gets a unhex(sourcedata).  That's never
been a problem for me, seems a bit hackish, but works well.

On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck <Ch...@nuance.com>wrote:

>  I am trying to use BINARY columns and believe I have the perfect
> use-case for it, but I am missing something. Has anyone used this for true
> binary data (which may contain newlines)?
>
>
>  Here is the background... I have some files that each contain just one
> logical field, which is a binary object. (The files are Google Protobuf
> format.) I want to put these binary files into a larger file, where each
> protobuf is a logical record. Then I want to define a Hive table that
> stores each protobuf as one row, with the entire protobuf object in one
> BINARY column. Then I will use a custom UDF to select/query the binary
> object.
>
>
>  This is about as simple as can be for putting binary data into Hive.
>
>
>  What file format should I use to package the binary rows? What should
> the Hive table definition be? Which SerDe option (LazySimpleBinary?). I
> cannot use TEXTFILE, since the binary may contain newlines. Many of my
> attempts have choked on the newlines.
>
>
>  Thank you,
>
> Chuck Connell
>
> Nuance
>
> Burlington, MA
>
>