You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Viraj Bhat <vi...@yahoo-inc.com> on 2010/06/08 19:50:47 UTC

Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Hi all,
  I am working on an M/R program to convert Zebra data to Hive RC
format. 

The TableInputFormat (Zebra) returns keys and values in the form of
BytesWritable and (Pig) Tuple.

In order to convert it to the RCFileOutputFormat whose key is
"BytesWritable and value is "BytesRefArrayWritable" I need to take in a
Pig Tuple iterate over each of its contents and convert it to
"BytesRefWritable".

The easy part is for Strings, which can be converted to BytesRefWritable
as:

myvalue = new BytesRefArrayWritable(10);
//value is a Pig Tuple and get returns a string
String s = (String)value.get(0);
myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));



How do I do it for java "Long", "HashMap" and "Arrays"
//value is a Pig tuple
Long l = new Long((Long)value.get(1));
myvalue.set(iter, new BytesRefWritable(l.toString().getBytes("UTF-8")));
myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));


HashMap<String, Object> hm = new
HashMap<String,Object>((HashMap)value.get(2));

myvalue.set(iter, new
BytesRefWritable(hm.toString().getBytes("UTF-8")));


Would the toString() method work? If I need to re-read RC format back
through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" would
it interpret correctly? 

Is there any documentation for the same?

Any suggestions would be beneficial.

Viraj

RE: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Posted by Viraj Bhat <vi...@yahoo-inc.com>.

Hi Yongqiang,
 Thanks again for your help. Using a serde tailored for Zebra is
definitely a convenient way without having to convert the data.
When you mention complex types, does it mean bags and hashmaps? 

I was also interested if there is a way to do this in M/R by calling
appropriate objects, from the serde to convert to BytesRefWritable. I
need to investigate.

Meanwhile 
Does anyone else in this group have experiences on writing M/R programs
for converting data to RC format?
Viraj

-----Original Message-----
From: Yongqiang He [mailto:heyongqiangict@gmail.com] 
Sent: Thursday, June 10, 2010 12:24 AM
To: Viraj Bhat; hive-user@hadoop.apache.org
Cc: Harmeek Singh Bedi
Subject: Re: Converting types from java HashMap, Long and Array to
BytesWritable for RCFileOutputFormat

Please see inline comment.
please correct me if I am wrong about the serde layer.

Thanks
Yongqiang
On 6/9/10 11:24 PM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:

> Hi Yongqiang and Hive users,
>  In my Map Reduce program I have HashMap's and Array of HashMap's,
which
> I need to convert to BytesRefWritable for using the RCFileOutputFormat
> (which uses values as BytesRefWritable). I am then planning to re-read
> this data using the "ROW FORMAT SERDE
> "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> 
> Here are the questions I have about the steps to be followed:
> 
> 1) Should I take the columnarserde code and write my own serde since I
> have HashMaps and Array of HashMaps?

I think you do not need to write your own serde. Hive's serde support
complex types and nested complex types.

> 
> 2) Where should I specify the separators I need to use for the
HashMaps
> and Array of HashMaps I am creating?
If you are writing out the data, and want to use hive's serde to read
the
data. You can just use hive's default separators. (which is definded in
LazySimpleSerde.java)
> 
> 3) Should I be using LazyArray, LazyMap objects in my M/R program to
get
> the required serializations?
If you want to use hive's built-in serde, you don't need to.
> 
> 4) If I write out my original data using TextFormat instead of
> RCFileOutputFormat and make Hive read it as an external table and then
> store the corresponding results to RCFormat using Hive DDL commands,
how
> does Hive convert to RC here. A) Can it do that?  b) If it did that
what
> are the separators that are used in this case?
A) yes. It can do that.
B) the separators used are from the table's metadata. If not defined, it
will use default defined in LazySimpleSerde.
  
As long as data can be parsed by hive, hive can convert the data into
what
format you want. So you need hive to be able to parse you text format
(again
be careful of separators).
Basically hive use de-serializer to de-serialize the input data to
hive's
built-in types and use serialzer to serialize the data out to hdfs.

Attached some code letting hive parse Zebra table which use pig's tuple
as
it data type. Right now it can work well with primitive pig types. But
It
should not be very difficult to extend to work with complex types.
Hope these code could be helpful to you. The code most related to serde
is
under zebra/serde and ZebraUtils.java


Thanks
Yongqiang
> Any insights would be appreciated.
> 
> Thanks Viraj
> 
> 
> -----Original Message-----
> From: Yongqiang He [mailto:heyongqiangict@gmail.com]
> Sent: Tuesday, June 08, 2010 2:25 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Converting types from java HashMap, Long and Array to
> BytesWritable for RCFileOutputFormat
> 
> Hi Viraj
> 
> I recommend you to use Hive's columnserde/lazyserde's code to
serialize
> and
> deserialize the data. This can help you avoid write your own way to
> serialze/deserialize the data.
> 
> Basically, for primitives, it is easy to serialize and de-serialize.
But
> for
> complex types, you need to use separators.
> 
> Thanks
> Yongqiang
> On 6/8/10 10:50 AM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:
> 
>> Hi all,
>>   I am working on an M/R program to convert Zebra data to Hive RC
>> format. 
>> 
>> The TableInputFormat (Zebra) returns keys and values in the form of
>> BytesWritable and (Pig) Tuple.
>> 
>> In order to convert it to the RCFileOutputFormat whose key is
>> "BytesWritable and value is "BytesRefArrayWritable" I need to take in
> a
>> Pig Tuple iterate over each of its contents and convert it to
>> "BytesRefWritable".
>> 
>> The easy part is for Strings, which can be converted to
> BytesRefWritable
>> as:
>> 
>> myvalue = new BytesRefArrayWritable(10);
>> //value is a Pig Tuple and get returns a string
>> String s = (String)value.get(0);
>> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
>> 
>> 
>> 
>> How do I do it for java "Long", "HashMap" and "Arrays"
>> //value is a Pig tuple
>> Long l = new Long((Long)value.get(1));
>> myvalue.set(iter, new
> BytesRefWritable(l.toString().getBytes("UTF-8")));
>> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
>> 
>> 
>> HashMap<String, Object> hm = new
>> HashMap<String,Object>((HashMap)value.get(2));
>> 
>> myvalue.set(iter, new
>> BytesRefWritable(hm.toString().getBytes("UTF-8")));
>> 
>> 
>> Would the toString() method work? If I need to re-read RC format back
>> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> would
>> it interpret correctly?
>> 
>> Is there any documentation for the same?
>> 
>> Any suggestions would be beneficial.
>> 
>> Viraj
> 
>

Re: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Posted by Yongqiang He <he...@gmail.com>.

Please see inline comment.
please correct me if I am wrong about the serde layer.

Thanks
Yongqiang
On 6/9/10 11:24 PM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:

> Hi Yongqiang and Hive users,
>  In my Map Reduce program I have HashMap's and Array of HashMap's, which
> I need to convert to BytesRefWritable for using the RCFileOutputFormat
> (which uses values as BytesRefWritable). I am then planning to re-read
> this data using the "ROW FORMAT SERDE
> "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> 
> Here are the questions I have about the steps to be followed:
> 
> 1) Should I take the columnarserde code and write my own serde since I
> have HashMaps and Array of HashMaps?

I think you do not need to write your own serde. Hive's serde support
complex types and nested complex types.

> 
> 2) Where should I specify the separators I need to use for the HashMaps
> and Array of HashMaps I am creating?
If you are writing out the data, and want to use hive's serde to read the
data. You can just use hive's default separators. (which is definded in
LazySimpleSerde.java)
> 
> 3) Should I be using LazyArray, LazyMap objects in my M/R program to get
> the required serializations?
If you want to use hive's built-in serde, you don't need to.
> 
> 4) If I write out my original data using TextFormat instead of
> RCFileOutputFormat and make Hive read it as an external table and then
> store the corresponding results to RCFormat using Hive DDL commands, how
> does Hive convert to RC here. A) Can it do that?  b) If it did that what
> are the separators that are used in this case?
A) yes. It can do that.
B) the separators used are from the table's metadata. If not defined, it
will use default defined in LazySimpleSerde.
  
As long as data can be parsed by hive, hive can convert the data into what
format you want. So you need hive to be able to parse you text format (again
be careful of separators).
Basically hive use de-serializer to de-serialize the input data to hive's
built-in types and use serialzer to serialize the data out to hdfs.

Attached some code letting hive parse Zebra table which use pig's tuple as
it data type. Right now it can work well with primitive pig types. But It
should not be very difficult to extend to work with complex types.
Hope these code could be helpful to you. The code most related to serde is
under zebra/serde and ZebraUtils.java


Thanks
Yongqiang
> Any insights would be appreciated.
> 
> Thanks Viraj
> 
> 
> -----Original Message-----
> From: Yongqiang He [mailto:heyongqiangict@gmail.com]
> Sent: Tuesday, June 08, 2010 2:25 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Converting types from java HashMap, Long and Array to
> BytesWritable for RCFileOutputFormat
> 
> Hi Viraj
> 
> I recommend you to use Hive's columnserde/lazyserde's code to serialize
> and
> deserialize the data. This can help you avoid write your own way to
> serialze/deserialize the data.
> 
> Basically, for primitives, it is easy to serialize and de-serialize. But
> for
> complex types, you need to use separators.
> 
> Thanks
> Yongqiang
> On 6/8/10 10:50 AM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:
> 
>> Hi all,
>>   I am working on an M/R program to convert Zebra data to Hive RC
>> format. 
>> 
>> The TableInputFormat (Zebra) returns keys and values in the form of
>> BytesWritable and (Pig) Tuple.
>> 
>> In order to convert it to the RCFileOutputFormat whose key is
>> "BytesWritable and value is "BytesRefArrayWritable" I need to take in
> a
>> Pig Tuple iterate over each of its contents and convert it to
>> "BytesRefWritable".
>> 
>> The easy part is for Strings, which can be converted to
> BytesRefWritable
>> as:
>> 
>> myvalue = new BytesRefArrayWritable(10);
>> //value is a Pig Tuple and get returns a string
>> String s = (String)value.get(0);
>> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
>> 
>> 
>> 
>> How do I do it for java "Long", "HashMap" and "Arrays"
>> //value is a Pig tuple
>> Long l = new Long((Long)value.get(1));
>> myvalue.set(iter, new
> BytesRefWritable(l.toString().getBytes("UTF-8")));
>> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
>> 
>> 
>> HashMap<String, Object> hm = new
>> HashMap<String,Object>((HashMap)value.get(2));
>> 
>> myvalue.set(iter, new
>> BytesRefWritable(hm.toString().getBytes("UTF-8")));
>> 
>> 
>> Would the toString() method work? If I need to re-read RC format back
>> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> would
>> it interpret correctly?
>> 
>> Is there any documentation for the same?
>> 
>> Any suggestions would be beneficial.
>> 
>> Viraj
> 
>

RE: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Posted by Viraj Bhat <vi...@yahoo-inc.com>.

Hi Yongqiang and Hive users,
 In my Map Reduce program I have HashMap's and Array of HashMap's, which
I need to convert to BytesRefWritable for using the RCFileOutputFormat
(which uses values as BytesRefWritable). I am then planning to re-read
this data using the "ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"

Here are the questions I have about the steps to be followed:

1) Should I take the columnarserde code and write my own serde since I
have HashMaps and Array of HashMaps?

2) Where should I specify the separators I need to use for the HashMaps
and Array of HashMaps I am creating? 

3) Should I be using LazyArray, LazyMap objects in my M/R program to get
the required serializations?

4) If I write out my original data using TextFormat instead of
RCFileOutputFormat and make Hive read it as an external table and then
store the corresponding results to RCFormat using Hive DDL commands, how
does Hive convert to RC here. A) Can it do that?  b) If it did that what
are the separators that are used in this case?

Any insights would be appreciated.

Thanks Viraj

-----Original Message-----
From: Yongqiang He [mailto:heyongqiangict@gmail.com] 
Sent: Tuesday, June 08, 2010 2:25 PM
To: hive-user@hadoop.apache.org
Subject: Re: Converting types from java HashMap, Long and Array to
BytesWritable for RCFileOutputFormat

Hi Viraj

I recommend you to use Hive's columnserde/lazyserde's code to serialize
and
deserialize the data. This can help you avoid write your own way to
serialze/deserialize the data.

Basically, for primitives, it is easy to serialize and de-serialize. But
for
complex types, you need to use separators.

Thanks
Yongqiang
On 6/8/10 10:50 AM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:

> Hi all,
>   I am working on an M/R program to convert Zebra data to Hive RC
> format. 
> 
> The TableInputFormat (Zebra) returns keys and values in the form of
> BytesWritable and (Pig) Tuple.
> 
> In order to convert it to the RCFileOutputFormat whose key is
> "BytesWritable and value is "BytesRefArrayWritable" I need to take in
a
> Pig Tuple iterate over each of its contents and convert it to
> "BytesRefWritable".
> 
> The easy part is for Strings, which can be converted to
BytesRefWritable
> as:
> 
> myvalue = new BytesRefArrayWritable(10);
> //value is a Pig Tuple and get returns a string
> String s = (String)value.get(0);
> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
> 
> 
> 
> How do I do it for java "Long", "HashMap" and "Arrays"
> //value is a Pig tuple
> Long l = new Long((Long)value.get(1));
> myvalue.set(iter, new
BytesRefWritable(l.toString().getBytes("UTF-8")));
> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
> 
> 
> HashMap<String, Object> hm = new
> HashMap<String,Object>((HashMap)value.get(2));
> 
> myvalue.set(iter, new
> BytesRefWritable(hm.toString().getBytes("UTF-8")));
> 
> 
> Would the toString() method work? If I need to re-read RC format back
> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
would
> it interpret correctly?
> 
> Is there any documentation for the same?
> 
> Any suggestions would be beneficial.
> 
> Viraj

Re: Hive Web Interface Error

Posted by Vinithra Varadharajan <vi...@cloudera.com>.

CDH3 beta 1 does have Hive 5.0:
http://www.cloudera.com/blog/2010/03/cdh3-beta1-now-available/

-Vinithra

On Tue, Jun 8, 2010 at 12:10 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Tue, Jun 8, 2010 at 1:56 PM, Karthik <ka...@yahoo.com> wrote:
>
>> I'm using Hive 4.0 from CDH2 and I get this below error when I click on
>> the "Create Session" link and provide a value for session name and hit the
>> "submit query" button:
>>
>> Unexpected  while processing |-S|-h|-e|-f
>> log4j:WARN No appenders could be found for logger (org.mortbay.log).
>> log4j:WARN Please initialize the log4j system properly.
>>
>> This exception is printed on the server side (Jetty) logs and the page
>> (browser) hangs for ever trying something.  Any quick solution?
>>
>> Regards,
>> Karthik.
>>
>
> That is erroneous output which should be removed even though it does not
> cause a problem.
>
> Question 1? Are you using a JDBC metastore?
>
> http://wiki.apache.org/hadoop/HiveDerbyServerMode
>
> If you are not you can only have one hive session opened at once and the
> CLI will probably lock out the web interface.
>
> Any quick solution? hive 4.0 is an old release. I have not been tracking
> CDH but I bet they offer a hive 5.0 release. Update to that, take a non CDH
> release. or build your own hive from trunk.
>
> Edward
>

Re: What is the CLI eq. of "add jar Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Jun 8, 2010 at 3:39 PM, Alex Kozlov <al...@cloudera.com> wrote:

> Hi Karthik,
>
> Do you have access to the cluster?  The simplest way is to put the jars
> into the cluster, $HADOOP_HOME/lib directories on each of the nodes.  This
> will require TTs restart though.
>
> Alex K
>
>
> On Tue, Jun 8, 2010 at 12:30 PM, Karthik <ka...@yahoo.com> wrote:
>
>> I need to pass some custom Java classes that I use as InputFormats and
>> SerDe classes to Hive Queries made from HWI.  I use "add jar <path>" from
>> CLI that works without any issues.  How do I do the same from HWI?
>>
>> I have set the "hive.aux.jars.path" path where the JAR files are, but that
>> is not passed on to the DataNodes as it's used only for Hive SerDe and not
>> for InputFormat classes that is needed by the MR jobs.
>>
>> Please advice.
>>
>> Regards,
>> Karthik.
>>
>
> For the web interface, 'add jar'  looks for the jar on the node that
started the web interface.

Regards,
Edward

Re: What is the CLI eq. of "add jar Posted by Alex Kozlov <al...@cloudera.com>.

Hi Karthik,

Do you have access to the cluster?  The simplest way is to put the jars into
the cluster, $HADOOP_HOME/lib directories on each of the nodes.  This will
require TTs restart though.

Alex K

On Tue, Jun 8, 2010 at 12:30 PM, Karthik <ka...@yahoo.com> wrote:

> I need to pass some custom Java classes that I use as InputFormats and
> SerDe classes to Hive Queries made from HWI.  I use "add jar <path>" from
> CLI that works without any issues.  How do I do the same from HWI?
>
> I have set the "hive.aux.jars.path" path where the JAR files are, but that
> is not passed on to the DataNodes as it's used only for Hive SerDe and not
> for InputFormat classes that is needed by the MR jobs.
>
> Please advice.
>
> Regards,
> Karthik.
>

What is the CLI eq. of "add jar Posted by Karthik <ka...@yahoo.com>.

I need to pass some custom Java classes that I use as InputFormats and SerDe classes to Hive Queries made from HWI.  I use "add jar <path>" from CLI that works without any issues.  How do I do the same from HWI?

I have set the "hive.aux.jars.path" path where the JAR files are, but that is not passed on to the DataNodes as it's used only for Hive SerDe and not for InputFormat classes that is needed by the MR jobs.

Please advice.

Regards,
Karthik.

Re: Hive Web Interface Error

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Jun 8, 2010 at 1:56 PM, Karthik <ka...@yahoo.com> wrote:

> I'm using Hive 4.0 from CDH2 and I get this below error when I click on the
> "Create Session" link and provide a value for session name and hit the
> "submit query" button:
>
> Unexpected  while processing |-S|-h|-e|-f
> log4j:WARN No appenders could be found for logger (org.mortbay.log).
> log4j:WARN Please initialize the log4j system properly.
>
> This exception is printed on the server side (Jetty) logs and the page
> (browser) hangs for ever trying something.  Any quick solution?
>
> Regards,
> Karthik.
>

That is erroneous output which should be removed even though it does not
cause a problem.

Question 1? Are you using a JDBC metastore?

http://wiki.apache.org/hadoop/HiveDerbyServerMode

If you are not you can only have one hive session opened at once and the CLI
will probably lock out the web interface.

Any quick solution? hive 4.0 is an old release. I have not been tracking CDH
but I bet they offer a hive 5.0 release. Update to that, take a non CDH
release. or build your own hive from trunk.

Edward

Hive Web Interface Error

Posted by Karthik <ka...@yahoo.com>.

I'm using Hive 4.0 from CDH2 and I get this below error when I click on the "Create Session" link and provide a value for session name and hit the "submit query" button:

Unexpected  while processing |-S|-h|-e|-f
log4j:WARN No appenders could be found for logger (org.mortbay.log).
log4j:WARN Please initialize the log4j system properly.

This exception is printed on the server side (Jetty) logs and the page (browser) hangs for ever trying something.  Any quick solution?

Regards,
Karthik.

Re: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Posted by Yongqiang He <he...@gmail.com>.

Hi Viraj

I recommend you to use Hive's columnserde/lazyserde's code to serialize and
deserialize the data. This can help you avoid write your own way to
serialze/deserialize the data.

Basically, for primitives, it is easy to serialize and de-serialize. But for
complex types, you need to use separators.

Thanks
Yongqiang
On 6/8/10 10:50 AM, "Viraj Bhat" <vi...@yahoo-inc.com> wrote:

> Hi all,
>   I am working on an M/R program to convert Zebra data to Hive RC
> format. 
> 
> The TableInputFormat (Zebra) returns keys and values in the form of
> BytesWritable and (Pig) Tuple.
> 
> In order to convert it to the RCFileOutputFormat whose key is
> "BytesWritable and value is "BytesRefArrayWritable" I need to take in a
> Pig Tuple iterate over each of its contents and convert it to
> "BytesRefWritable".
> 
> The easy part is for Strings, which can be converted to BytesRefWritable
> as:
> 
> myvalue = new BytesRefArrayWritable(10);
> //value is a Pig Tuple and get returns a string
> String s = (String)value.get(0);
> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
> 
> 
> 
> How do I do it for java "Long", "HashMap" and "Arrays"
> //value is a Pig tuple
> Long l = new Long((Long)value.get(1));
> myvalue.set(iter, new BytesRefWritable(l.toString().getBytes("UTF-8")));
> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
> 
> 
> HashMap<String, Object> hm = new
> HashMap<String,Object>((HashMap)value.get(2));
> 
> myvalue.set(iter, new
> BytesRefWritable(hm.toString().getBytes("UTF-8")));
> 
> 
> Would the toString() method work? If I need to re-read RC format back
> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" would
> it interpret correctly?
> 
> Is there any documentation for the same?
> 
> Any suggestions would be beneficial.
> 
> Viraj