You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by qiaoresearcher <qi...@gmail.com> on 2013/08/31 01:34:24 UTC

how to write hive query to solve this problem?

I have three tables:

Table 1: record when and who visited gas station or not, this contains all
the users of interest, name all the users as a set A
date                         |      user name    |   visited gas station?
2013-09-01                         tom                             yes
2013-09-02                         tom                             yes
2013-09-01                         hanks                          yes
2013-09-03                         tomy                            yes
....                                           ...
       ...

Table 2: record when and who visited Bestbuy,   the user in set A appear
here, but not all users of A will appear in this table, also table 2 has
users does not belong to set A
date                         |      user name    |   visited Bestbuy?
2013-09-01                         tom                             yes
2013-09-02                         jacob                           yes
2013-09-01                         hanks                          yes
2013-09-03                         michael                       yes
....                                           ...
       ...

Table 3: record when and who arrives one of three destinations: CA, NY and
DC, the users in table 3 has similar situation as users in table 2
regarding set A.
date                         |      user name    |   visited Bestbuy or not

2013-09-01                         tom                             CA
2013-09-02                         tom                             NY
2013-09-01                         hanks                          DC
2013-09-03                         tomy                            CA
....                                           ...
       ...

Now we want to know,  within a 90 days period, what are the following
numbers:
(1) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, after that date, user went to
Bestbuy, finally arrive CA
(2) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy, finally
arrive CA

(3) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then went Bestbuy, finally arrive NY
(4) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy,
finally arrive NY

(5) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then went Bestbuy, finally arrive DC
(6) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy,
finally arrive DC

It is possible in a given day, a user may visit Bestbuy many times for
table 2, this will be simply regarded as a status 'visited Bestbuy'. One
user will be counted once in a given day.
>From the day of user visiting gas station, within next 90 days, this user
has to arrive one of three final destinations in table 3, and arrives only
one station. no multiple arrived stations.

For example,
a user tom may visit gas station on 9/1, then went to Bestbuy on 9/5,
finally went to CA on 9/30, then this user tom can be counted as 1 on 9/1
for path (1).
a user hanks may visit gas station on 9/1, then went to Bestbuy on 9/5 and
on 9/8, finally went to CA on 9/30, then this user tom count as 1 on 9/1
for path (1).
a user ruby may visit gas station on 9/1, and tables 2 does not have
records to show ruby visit Bestbuy till 11/30, and ruby arrives DC in table
3 before 11/30, then ruby contribute 1 for path (6) for the day 9/1.

How to write Hive query to get those numbers for the six paths?
a sample output will be :
9/1: 100 for path (1), 90 for path (2), ...., etc
9/2: 60 for path (1), 80 for path (2), ...., etc
9/3: ................

Any suggestions or suggested reference/readings will be deeply appreciated.

Thanks!
Qiao

Re: Hive Statistics information

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Thanks Ravi let me give this a shot
Regards
sanjay

From: Ravi Kiran <ma...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, August 30, 2013 10:53 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Hive Statistics information

Hi Sanjay,

   What do the logs say when you fire the ANALYZE TABLE...   statement on a table ?
   One minor correction to the db connectionstring would be to use &amp; for the query parameters.
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1&amp;password=hive_user_vso1<http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1>

I hope the database hive_vso1_tempstatsstore<http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1> exists in your MySQL?

Regards
Ravi Magham


On Sat, Aug 31, 2013 at 6:15 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
Hi guys

I have configured Hive to use MySQL for all statistics

hive.stats.atomic=false
hive.stats.autogather=true
hive.stats.collect.rawdatasize=true
hive.stats.dbclass=jdbc:mysql
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1<http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1>
hive.stats.jdbc.timeout=30
hive.stats.jdbcdriver=com.mysql.jdbc.Driver
hive.stats.retries.max=0
hive.stats.retries.wait=3000

However in the MYSQL hive statistics tables , they don't seem to have any data ?

Where does Hive store the statistics information ?

sanjay

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Hive Statistics information

Posted by Ravi Kiran <ma...@gmail.com>.
Hi Sanjay,

   What do the logs say when you fire the ANALYZE TABLE...   statement on a
table ?
   One minor correction to the db connectionstring would be to use &amp;
for the query parameters.
hive.stats.dbconnectionstring=jdbc:mysql://
v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1&amp;password=hive_user_vso1<http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1>

I hope the database
hive_vso1_tempstatsstore<http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1>exists
in your MySQL?

Regards
Ravi Magham


On Sat, Aug 31, 2013 at 6:15 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  Hi guys
>
>  I have configured Hive to use MySQL for all statistics
>
>  hive.stats.atomic=false
> hive.stats.autogather=true
> hive.stats.collect.rawdatasize=true
> hive.stats.dbclass=jdbc:mysql
> hive.stats.dbconnectionstring=jdbc:mysql://
> v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1
> hive.stats.jdbc.timeout=30
> hive.stats.jdbcdriver=com.mysql.jdbc.Driver
> hive.stats.retries.max=0
> hive.stats.retries.wait=3000
>
>  However in the MYSQL hive statistics tables , they don't seem to have
> any data ?
>
>  Where does Hive store the statistics information ?
>
>  sanjay
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Hive Statistics information

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi guys

I have configured Hive to use MySQL for all statistics

hive.stats.atomic=false
hive.stats.autogather=true
hive.stats.collect.rawdatasize=true
hive.stats.dbclass=jdbc:mysql
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?&user=hive_user_vso1&password=hive_user_vso1
hive.stats.jdbc.timeout=30
hive.stats.jdbcdriver=com.mysql.jdbc.Driver
hive.stats.retries.max=0
hive.stats.retries.wait=3000

However in the MYSQL hive statistics tables , they don't seem to have any data ?

Where does Hive store the statistics information ?

sanjay

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: how to write hive query to solve this problem?

Posted by Stephen Sprague <sp...@gmail.com>.
so this is not particular to Hive is it? You could post this on a DB2,
Oracle, or even Stackflow board i'd imagine.


On Fri, Aug 30, 2013 at 4:34 PM, qiaoresearcher <qi...@gmail.com>wrote:

>
> I have three tables:
>
> Table 1: record when and who visited gas station or not, this contains all
> the users of interest, name all the users as a set A
> date                         |      user name    |   visited gas station?
> 2013-09-01                         tom                             yes
> 2013-09-02                         tom                             yes
> 2013-09-01                         hanks                          yes
> 2013-09-03                         tomy                            yes
> ....                                           ...
>        ...
>
> Table 2: record when and who visited Bestbuy,   the user in set A appear
> here, but not all users of A will appear in this table, also table 2 has
> users does not belong to set A
> date                         |      user name    |   visited Bestbuy?
> 2013-09-01                         tom                             yes
> 2013-09-02                         jacob                           yes
> 2013-09-01                         hanks                          yes
> 2013-09-03                         michael                       yes
> ....                                           ...
>        ...
>
> Table 3: record when and who arrives one of three destinations: CA, NY and
> DC, the users in table 3 has similar situation as users in table 2
> regarding set A.
> date                         |      user name    |   visited Bestbuy or
> not
> 2013-09-01                         tom                             CA
> 2013-09-02                         tom                             NY
> 2013-09-01                         hanks                          DC
> 2013-09-03                         tomy                            CA
> ....                                           ...
>        ...
>
> Now we want to know,  within a 90 days period, what are the following
> numbers:
> (1) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, after that date, user went to
> Bestbuy, finally arrive CA
> (2) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, then did not visit Bestbuy, finally
> arrive CA
>
> (3) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, then went Bestbuy, finally arrive NY
> (4) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, then did not visit Bestbuy,
> finally arrive NY
>
> (5) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, then went Bestbuy, finally arrive DC
> (6) for any given day, for the users in table 1, how many of them has a
> path like:   visited Gas station first, then did not visit Bestbuy,
> finally arrive DC
>
> It is possible in a given day, a user may visit Bestbuy many times for
> table 2, this will be simply regarded as a status 'visited Bestbuy'. One
> user will be counted once in a given day.
>  From the day of user visiting gas station, within next 90 days, this user
> has to arrive one of three final destinations in table 3, and arrives only
> one station. no multiple arrived stations.
>
> For example,
> a user tom may visit gas station on 9/1, then went to Bestbuy on 9/5,
> finally went to CA on 9/30, then this user tom can be counted as 1 on 9/1
> for path (1).
> a user hanks may visit gas station on 9/1, then went to Bestbuy on 9/5 and
> on 9/8, finally went to CA on 9/30, then this user tom count as 1 on 9/1
> for path (1).
> a user ruby may visit gas station on 9/1, and tables 2 does not have
> records to show ruby visit Bestbuy till 11/30, and ruby arrives DC in table
> 3 before 11/30, then ruby contribute 1 for path (6) for the day 9/1.
>
> How to write Hive query to get those numbers for the six paths?
> a sample output will be :
> 9/1: 100 for path (1), 90 for path (2), ...., etc
> 9/2: 60 for path (1), 80 for path (2), ...., etc
> 9/3: ................
>
> Any suggestions or suggested reference/readings will be deeply
> appreciated.
>
> Thanks!
> Qiao
>
>