You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jon Palmer <jp...@care.com> on 2012/05/15 14:11:53 UTC

What's the right data storage/representation?

All,

I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of multiple webapp servers backed by a mySql database. We are evaluating Hive as a high scale solution for our relatively sophisticated reporting and analytics needs. However, it's not clear what the best practices are around storing and representing the data our application generates. Probably best explained with an example:

We imagine a Hive deployment that is importing Apache logs and MySql data from the application db (probably via Sqoop). We would run our analysis daily and output the results somewhere (flat files in s3 or another MySql reporting database). We have users that have a) a status (Basic or Premium) and b) a location (a Zip code). We'd like to be able to ask questions like "How many premium users did we have within ten miles of zip 02110 on Jan 3rd 2012?" Computing these numbers for all dates across all zip codes and for a number of radi on a very large set of users seems like a pretty good use of Hadoop/Hive.

However users can move location and change status. The application database only really cares about the current location and status of a user and not the history of those fields. This presents a challenge to the analytics process. If we run the analysis every day we will naturally pick up the changes in status and location. However, if we were to try to recomputed our entire analysis for all dates we would get different results for users that moved location or changed status. The Apache logs are like not of much use as they are unlikely to contain member ids to deduce the requests which resulted in the change of status or location for a user.

How is this type of problem typically solved with Hive?

I can see a few potential solutions:

1. Don't solve it. Accept that you have some artifacts in your reporting data that cannot be recovered from the source data.

2. Create status and location history tables in the application db and use that during the analytics process.

3. Log the status and location change 'events' to some other log file and use those logs in the Hive analysis.

Are there any 'best practices' around these kinds of problems and in particular suggestions for the simplest implementation of the extra logging and analysis required by 3.?

Thanks
Jon

This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.

Re: What's the right data storage/representation?

Posted by Mark Grover <mg...@oanda.com>.

Hi Jon,
First off, processing the data from (say, Apache) logs in Hive and storing aggregates in a reporting server like you mentioned is a fairly common paradigm.

You have some large scale data (Apache logs) and some dimension data (user data). The problem you really have is how to make use of this dimension data during your analysis.

For each of your options:
1. Running away from your problem isn't really a solution:-)
2. You wouldn't want to connect to Application DB from your Hive/Hadoop jobs. All your Hadoop nodes could be connecting to your Application DB at the same time, in the worst case.
3. Now this sounds promising. Have a periodic job that runs and populates/updates a file on S3 with the user status information. Then create an external Hive table on top of this S3 file and use this Hive table in your analysis. Alternatively (depending on the use case, size and other considerations), you could add this S3 file to distributed cache; that way this file becomes available to all mappers and reducers for possible consumption.

Here is option 4: What if you analyzed the data for all locations of the user? Then, when you export this aggregate data in reporting server, drop/ignore all aggregate analyzed data for each user for all but the present location. That way, Hive just needs a list of new/changed users that it needs to run periodic analysis on. Since your stakeholders would be interacting only with the reporting server, you just need to make sure that they see the latest location data.

Good luck!

Mark

----- Original Message -----
From: "shrikanth shankar" <ss...@qubole.com>
To: user@hive.apache.org
Sent: Tuesday, May 15, 2012 2:41:21 PM
Subject: Re: What's the right data storage/representation?

Hive tables can sit on top of S3 storage so you dont really need a separate export process

thanks,
Shrikanth
On May 15, 2012, at 11:35 AM, Jon Palmer wrote:

> That seems like a very reasonable approach. However, if we use a technology like Amazon Elastic Map Reduce my Hive cluster is (potentially) going to be destroyed and recreated. As a result I'd really need to export the update history Hive table to some other store (like S3) so that it can be re-imported on the next spin up of the Hive cluster. Do I have that right?
> 
> Jon
> 
> -----Original Message-----
> From: shrikanth shankar [mailto:sshankar@qubole.com] 
> Sent: Tuesday, May 15, 2012 1:14 PM
> To: user@hive.apache.org
> Subject: Re: What's the right data storage/representation?
> 
> I would agree on keeping track of the history of updates in a separate table in Hive (you may not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge here would be writing a ETL process to maintain the Hive table based on the current status of the application db table ..
> 
> Shrikanth
> On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:
> 
>> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jp...@care.com> wrote:
>>> I can see a few potential solutions:
>>> 
>>> 1.       Don't solve it. Accept that you have some artifacts in your
>>> reporting data that cannot be recovered from the source data.
>>> 
>>> 2.       Create status and location history tables in the application db and
>>> use that during the analytics process.
>>> 
>>> 3.       Log the status and location change 'events' to some other log file
>>> and use those logs in the Hive analysis.
>> 
>> I would probably create a Hive table that includes the status and 
>> location updates. One of the advantages of Hive & Hadoop is that it is 
>> easy to store the raw information in bulk and continue to process it.
>> Once you have the information, you will likely find new uses for it.
>> 
>> -- Owen
> 
> 
> 
> This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.

Re: What's the right data storage/representation?

Posted by shrikanth shankar <ss...@qubole.com>.

Hive tables can sit on top of S3 storage so you dont really need a separate export process

thanks,
Shrikanth
On May 15, 2012, at 11:35 AM, Jon Palmer wrote:

> That seems like a very reasonable approach. However, if we use a technology like Amazon Elastic Map Reduce my Hive cluster is (potentially) going to be destroyed and recreated. As a result I'd really need to export the update history Hive table to some other store (like S3) so that it can be re-imported on the next spin up of the Hive cluster. Do I have that right?
> 
> Jon
> 
> -----Original Message-----
> From: shrikanth shankar [mailto:sshankar@qubole.com] 
> Sent: Tuesday, May 15, 2012 1:14 PM
> To: user@hive.apache.org
> Subject: Re: What's the right data storage/representation?
> 
> I would agree on keeping track of the history of updates in a separate table in Hive (you may not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge here would be writing a ETL process to maintain the Hive table based on the current status of the application db table ..
> 
> Shrikanth
> On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:
> 
>> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jp...@care.com> wrote:
>>> I can see a few potential solutions:
>>> 
>>> 1.       Don't solve it. Accept that you have some artifacts in your
>>> reporting data that cannot be recovered from the source data.
>>> 
>>> 2.       Create status and location history tables in the application db and
>>> use that during the analytics process.
>>> 
>>> 3.       Log the status and location change 'events' to some other log file
>>> and use those logs in the Hive analysis.
>> 
>> I would probably create a Hive table that includes the status and 
>> location updates. One of the advantages of Hive & Hadoop is that it is 
>> easy to store the raw information in bulk and continue to process it.
>> Once you have the information, you will likely find new uses for it.
>> 
>> -- Owen
> 
> 
> 
> This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.

RE: What's the right data storage/representation?

Posted by Jon Palmer <jp...@care.com>.

That seems like a very reasonable approach. However, if we use a technology like Amazon Elastic Map Reduce my Hive cluster is (potentially) going to be destroyed and recreated. As a result I'd really need to export the update history Hive table to some other store (like S3) so that it can be re-imported on the next spin up of the Hive cluster. Do I have that right?

Jon

-----Original Message-----
From: shrikanth shankar [mailto:sshankar@qubole.com] 
Sent: Tuesday, May 15, 2012 1:14 PM
To: user@hive.apache.org
Subject: Re: What's the right data storage/representation?

I would agree on keeping track of the history of updates in a separate table in Hive (you may not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge here would be writing a ETL process to maintain the Hive table based on the current status of the application db table ..

Shrikanth
On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:

> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jp...@care.com> wrote:
>> I can see a few potential solutions:
>> 
>> 1.       Don't solve it. Accept that you have some artifacts in your
>> reporting data that cannot be recovered from the source data.
>> 
>> 2.       Create status and location history tables in the application db and
>> use that during the analytics process.
>> 
>> 3.       Log the status and location change 'events' to some other log file
>> and use those logs in the Hive analysis.
> 
> I would probably create a Hive table that includes the status and 
> location updates. One of the advantages of Hive & Hadoop is that it is 
> easy to store the raw information in bulk and continue to process it.
> Once you have the information, you will likely find new uses for it.
> 
> -- Owen

This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.

Re: What's the right data storage/representation?

Posted by shrikanth shankar <ss...@qubole.com>.

I would agree on keeping track of the history of updates in a separate table in Hive (you may not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge here would be writing a ETL process to maintain the Hive table based on the current status of the application db table ..

Shrikanth
On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:

> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jp...@care.com> wrote:
>> I can see a few potential solutions:
>> 
>> 1.       Don’t solve it. Accept that you have some artifacts in your
>> reporting data that cannot be recovered from the source data.
>> 
>> 2.       Create status and location history tables in the application db and
>> use that during the analytics process.
>> 
>> 3.       Log the status and location change ‘events’ to some other log file
>> and use those logs in the Hive analysis.
> 
> I would probably create a Hive table that includes the status and
> location updates. One of the advantages of Hive & Hadoop is that it is
> easy to store the raw information in bulk and continue to process it.
> Once you have the information, you will likely find new uses for it.
> 
> -- Owen

Re: What's the right data storage/representation?

Posted by Owen O'Malley <om...@apache.org>.

On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jp...@care.com> wrote:
> I can see a few potential solutions:
>
> 1.       Don’t solve it. Accept that you have some artifacts in your
> reporting data that cannot be recovered from the source data.
>
> 2.       Create status and location history tables in the application db and
> use that during the analytics process.
>
> 3.       Log the status and location change ‘events’ to some other log file
> and use those logs in the Hive analysis.

I would probably create a Hive table that includes the status and
location updates. One of the advantages of Hive & Hadoop is that it is
easy to store the raw information in bulk and continue to process it.
Once you have the information, you will likely find new uses for it.

-- Owen