You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2012/10/16 17:30:02 UTC

Writing Custom Serdes for Hive

We have a maybe obvious question about a serde. When a serde in invoked,
does it have access to the original hive query?  Ideally the original query
could provide the Serde some hints on how to access the data on the
backend.

Also, are there any good links/documention on how to write Serdes?  Kinda
hard to google on for some reason.

Re: Writing Custom Serdes for Hive

Posted by John Omernik <jo...@omernik.com>.

AWESOME This is exactly what we were looking for. Sorry that I was looking
in the wrong spot!



On Tue, Oct 16, 2012 at 11:09 AM, shrikanth shankar <ss...@qubole.com>wrote:

> I think what you need is a custom Input Format/ Record Reader. By the time
> the SerDe is called the row has been fetched. I believe the record reader
> can get access to predicates. The code to access HBase from Hive needs it
> for the same reasons as you would need with Mongo and might be a good place
> to start.
>
> thanks,
> Shrikanth
>
> On Oct 16, 2012, at 8:54 AM, John Omernik wrote:
>
> There reason I am asking (and maybe YC reads this list and can chime in)
> but he has written a connector for MongoDB.  It's simple, basically it
> connects to a MongoDB, maps columns (primitives only) to mongodb fields,
> and allows you to select out of Mongo. Pretty sweet actually, and with
> Mongo, things are really fast for small tables.
>
>
> That being said, I noticed that his connector basically gets all rows from
> a Mongo DB collection every time it's ran.  And we wanted to see if we
> could extend it to do some simple MongoDB level filtering based on the
> passed query.  Basically have a fail open approach... if it saw something
> it thought it could optimize in the mongodb query to limit data, it would,
> otherwise, it would default to the original approach of getting all the
> data.
>
>
> For example:
>
> select * from mongo_table where name rlike 'Bobby\\sWhite'
>
> Current method: the connection do db.collection.find() gets all the
> documents from MongoDB, and then hive does the regex.
>
> Thing we want to try "Oh one of our defined mongo columns has a rlike, ok
> send this instead: db.collection.find("name":/Bobby\sWhite");   less data
> that would need to be transfered. Yes, Hive would still run the rlike on
> the data... "shrug" at least it's running it on far less data.   Basically
> if we could determine shortcuts, we could use them.
>
>
> Just trying to understand Serdes and how we are completely not using them
> as intended :)
>
>
>
>
> On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck <Chuck.Connell@nuance.com
> > wrote:
>
>>  A serde is actually used the other way around… Hive parses the query,
>> writes MapReduce code to solve the query, and the generated code uses the
>> serde for field access.****
>>
>> ** **
>>
>> Standard way to write a serde is to start from the trunk regex serde,
>> then modify as needed…****
>>
>> ** **
>>
>>
>> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
>>
>> ****
>>
>> Also, nice article by Roberto Congiu…****
>>
>> ** **
>>
>> http://www.congiu.com/a-json-readwrite-serde-for-hive/****
>>
>> ** **
>>
>> Chuck Connell****
>>
>> Nuance R&D Data Team****
>>
>> Burlington, MA****
>>
>> ** **
>>
>> ** **
>>
>> *From:* John Omernik [mailto:john@omernik.com]
>> *Sent:* Tuesday, October 16, 2012 11:30 AM
>> *To:* user@hive.apache.org
>> *Subject:* Writing Custom Serdes for Hive****
>>
>> ** **
>>
>> We have a maybe obvious question about a serde. When a serde in invoked,
>> does it have access to the original hive query?  Ideally the original query
>> could provide the Serde some hints on how to access the data on the
>> backend.  ****
>>
>> ** **
>>
>> Also, are there any good links/documention on how to write Serdes?  Kinda
>> hard to google on for some reason. ****
>>
>> ** **
>>
>> ** **
>>
>
>
>

Re: Writing Custom Serdes for Hive

Posted by shrikanth shankar <ss...@qubole.com>.

I think what you need is a custom Input Format/ Record Reader. By the time the SerDe is called the row has been fetched. I believe the record reader can get access to predicates. The code to access HBase from Hive needs it for the same reasons as you would need with Mongo and might be a good place to start. 

thanks,
Shrikanth
On Oct 16, 2012, at 8:54 AM, John Omernik wrote:

> There reason I am asking (and maybe YC reads this list and can chime in) but he has written a connector for MongoDB.  It's simple, basically it connects to a MongoDB, maps columns (primitives only) to mongodb fields, and allows you to select out of Mongo. Pretty sweet actually, and with Mongo, things are really fast for small tables.  
> 
> 
> That being said, I noticed that his connector basically gets all rows from a Mongo DB collection every time it's ran.  And we wanted to see if we could extend it to do some simple MongoDB level filtering based on the passed query.  Basically have a fail open approach... if it saw something it thought it could optimize in the mongodb query to limit data, it would, otherwise, it would default to the original approach of getting all the data.  
> 
> 
> For example:
> 
> select * from mongo_table where name rlike 'Bobby\\sWhite'
> 
> Current method: the connection do db.collection.find() gets all the documents from MongoDB, and then hive does the regex.  
> 
> Thing we want to try "Oh one of our defined mongo columns has a rlike, ok send this instead: db.collection.find("name":/Bobby\sWhite");   less data that would need to be transfered. Yes, Hive would still run the rlike on the data... "shrug" at least it's running it on far less data.   Basically if we could determine shortcuts, we could use them. 
> 
> 
> Just trying to understand Serdes and how we are completely not using them as intended :) 
> 
> 
> 
> 
> On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck <Ch...@nuance.com> wrote:
> A serde is actually used the other way around… Hive parses the query, writes MapReduce code to solve the query, and the generated code uses the serde for field access.
> 
>  
> 
> Standard way to write a serde is to start from the trunk regex serde, then modify as needed…
> 
>  
> 
> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
> 
> 
> Also, nice article by Roberto Congiu…
> 
>  
> 
> http://www.congiu.com/a-json-readwrite-serde-for-hive/
> 
>  
> 
> Chuck Connell
> 
> Nuance R&D Data Team
> 
> Burlington, MA
> 
>  
> 
>  
> 
> From: John Omernik [mailto:john@omernik.com] 
> Sent: Tuesday, October 16, 2012 11:30 AM
> To: user@hive.apache.org
> Subject: Writing Custom Serdes for Hive
> 
>  
> 
> We have a maybe obvious question about a serde. When a serde in invoked, does it have access to the original hive query?  Ideally the original query could provide the Serde some hints on how to access the data on the backend.  
> 
>  
> 
> Also, are there any good links/documention on how to write Serdes?  Kinda hard to google on for some reason. 
> 
>  
> 
>  
> 
>

Re: Writing Custom Serdes for Hive

Posted by John Omernik <jo...@omernik.com>.

There reason I am asking (and maybe YC reads this list and can chime in)
but he has written a connector for MongoDB.  It's simple, basically it
connects to a MongoDB, maps columns (primitives only) to mongodb fields,
and allows you to select out of Mongo. Pretty sweet actually, and with
Mongo, things are really fast for small tables.

That being said, I noticed that his connector basically gets all rows from
a Mongo DB collection every time it's ran.  And we wanted to see if we
could extend it to do some simple MongoDB level filtering based on the
passed query.  Basically have a fail open approach... if it saw something
it thought it could optimize in the mongodb query to limit data, it would,
otherwise, it would default to the original approach of getting all the
data.

For example:

select * from mongo_table where name rlike 'Bobby\\sWhite'

Current method: the connection do db.collection.find() gets all the
documents from MongoDB, and then hive does the regex.

Thing we want to try "Oh one of our defined mongo columns has a rlike, ok
send this instead: db.collection.find("name":/Bobby\sWhite");   less data
that would need to be transfered. Yes, Hive would still run the rlike on
the data... "shrug" at least it's running it on far less data.   Basically
if we could determine shortcuts, we could use them.

Just trying to understand Serdes and how we are completely not using them
as intended :)

On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck
<Ch...@nuance.com>wrote:

>  A serde is actually used the other way around… Hive parses the query,
> writes MapReduce code to solve the query, and the generated code uses the
> serde for field access.****
>
> ** **
>
> Standard way to write a serde is to start from the trunk regex serde, then
> modify as needed…****
>
> ** **
>
>
> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
>
> ****
>
> Also, nice article by Roberto Congiu…****
>
> ** **
>
> http://www.congiu.com/a-json-readwrite-serde-for-hive/****
>
> ** **
>
> Chuck Connell****
>
> Nuance R&D Data Team****
>
> Burlington, MA****
>
> ** **
>
> ** **
>
> *From:* John Omernik [mailto:john@omernik.com]
> *Sent:* Tuesday, October 16, 2012 11:30 AM
> *To:* user@hive.apache.org
> *Subject:* Writing Custom Serdes for Hive****
>
> ** **
>
> We have a maybe obvious question about a serde. When a serde in invoked,
> does it have access to the original hive query?  Ideally the original query
> could provide the Serde some hints on how to access the data on the
> backend.  ****
>
> ** **
>
> Also, are there any good links/documention on how to write Serdes?  Kinda
> hard to google on for some reason. ****
>
> ** **
>
> ** **
>

RE: Writing Custom Serdes for Hive

Posted by "Connell, Chuck" <Ch...@nuance.com>.

A serde is actually used the other way around... Hive parses the query, writes MapReduce code to solve the query, and the generated code uses the serde for field access.

Standard way to write a serde is to start from the trunk regex serde, then modify as needed...

http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup

Also, nice article by Roberto Congiu...

http://www.congiu.com/a-json-readwrite-serde-for-hive/

Chuck Connell
Nuance R&D Data Team
Burlington, MA


From: John Omernik [mailto:john@omernik.com]
Sent: Tuesday, October 16, 2012 11:30 AM
To: user@hive.apache.org
Subject: Writing Custom Serdes for Hive

We have a maybe obvious question about a serde. When a serde in invoked, does it have access to the original hive query?  Ideally the original query could provide the Serde some hints on how to access the data on the backend.

Also, are there any good links/documention on how to write Serdes?  Kinda hard to google on for some reason.

Re: Writing Custom Serdes for Hive

Posted by Ariel Marcus <ar...@openbi.com>.

John,

This article is pretty good:
http://www.congiu.com/a-json-readwrite-serde-for-hive/

It looks like you can get access to the information you are looking for in
the Properties object passed to the SerDe's initialize method.

 public void initialize(Configuration conf, Properties tbl) throws
SerDeException

Look here for an example of usage:
https://github.com/rcongiu/Hive-JSON-Serde/blob/master/src/main/java/org/openx/data/jsonserde/JsonSerDe.java

Hope that helps.

Ariel

---------------------------------
Ariel Marcus, Consultant
www.openbi.com | ariel.marcus@openbi.com
150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
Cell: 314-827-4356

On Tue, Oct 16, 2012 at 11:30 AM, John Omernik <jo...@omernik.com> wrote:

> We have a maybe obvious question about a serde. When a serde in invoked,
> does it have access to the original hive query?  Ideally the original query
> could provide the Serde some hints on how to access the data on the
> backend.
>
> Also, are there any good links/documention on how to write Serdes?  Kinda
> hard to google on for some reason.
>
>
>
> ------------------------------
>
> This transmission is confidential and intended solely for the use of the
> recipient named above. It may contain confidential, proprietary, or legally
> privileged information. If you are not the intended recipient, you are
> hereby notified that any unauthorized review, use, disclosure or
> distribution is strictly prohibited. If you have received this transmission
> in error, please contact the sender by reply e-mail and delete the original
> transmission and all copies from your system.
>

-- 

------------------------------

This transmission is confidential and intended solely for the use of the 
recipient named above. It may contain confidential, proprietary, or legally 
privileged information. If you are not the intended recipient, you are 
hereby notified that any unauthorized review, use, disclosure or 
distribution is strictly prohibited. If you have received this transmission 
in error, please contact the sender by reply e-mail and delete the original 
transmission and all copies from your system.