You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Matthieu Labour <ma...@actionx.com> on 2013/04/09 00:49:54 UTC

Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

Hi

It would be terrific to get some advice on migrating a schema from RDMS to
Hive.

Should the SQL schema for HIVE be identical to the SQL schema for a
Posgresql/mysql database?

Specifically:

I have an application that generates events that look like the following:

{"ts":N+1,"userId":"123","event":"location","payload":{"verticalAccuracy":10,"longitude":-73.99718090313884,"latitude":40.72473278788106,"altitude":27.79653739929199,"horizontalAccuracy":65}}
{"ts":N+2,"userId":"123","event":"addProduct","payload":["cart","osprey-kestrel-48"]}
...

Events are being written to persistent storage (AWS S3). A 'worker' wakes
up periodically, reads the new events received and inserts them in a
postgresql database. The database has  user, product, user_product
(user_id, product_id, action:(viewed|wishlist...), timestamp), location
etc... tables.

We are migrating to HIVE.

Should we also create user, product, user_product, locations etc... as HIVE
tables and have a MapReduce job process event files to populate the HIVE
tables? Or should/can we implement a different schema that would allow for
external HIVE tables to map directly to the event files generated. Or a mix
of both?

Thank you for your help!

-matt

Insight into Hive script running

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi

Is there a way we can get insights into what stage of Hive script is running ?

We have at least 3 sub queries based have scripts…so this will be a great monitor to have

Also which parts of a Hive script get translated into mappers and reducers (possibly that’s in the EXPLAIN EXTENDED ???)
Thanks

sanjay



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

RE: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

Posted by Mike Liddell <Mi...@microsoft.com>.

Some general guidance would be to aim for minimal JOINs in your regular queries.
Thus the biggest change from a normal-form RDBMS schema is to denormalize such that joins do not come in to play until a query has already performed data reduction via filtering or aggregation. This implies a star schema comprising one primary fact table that has sufficient data in it to sensibly partition it and support direct filtering and aggregations.
Supplementing the main table will be dimension tables that can provide additional data to flesh out result sets.

It looks like your data will be highly amenable to this.

-mike.

From: Matthieu Labour [mailto:matthieu@actionx.com]
Sent: Monday, April 08, 2013 3:50 PM
To: user@hive.apache.org
Subject: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

Hi

It would be terrific to get some advice on migrating a schema from RDMS to Hive.

Should the SQL schema for HIVE be identical to the SQL schema for a Posgresql/mysql database?

Specifically:

I have an application that generates events that look like the following:

{"ts":N+1,"userId":"123","event":"location","payload":{"verticalAccuracy":10,"longitude":-73.99718090313884,"latitude":40.72473278788106,"altitude":27.79653739929199,"horizontalAccuracy":65}}
{"ts":N+2,"userId":"123","event":"addProduct","payload":["cart","osprey-kestrel-48"]}
...

Events are being written to persistent storage (AWS S3). A 'worker' wakes up periodically, reads the new events received and inserts them in a postgresql database. The database has  user, product, user_product (user_id, product_id, action:(viewed|wishlist...), timestamp), location etc... tables.

We are migrating to HIVE.

Should we also create user, product, user_product, locations etc... as HIVE tables and have a MapReduce job process event files to populate the HIVE tables? Or should/can we implement a different schema that would allow for external HIVE tables to map directly to the event files generated. Or a mix of both?

Thank you for your help!

-matt