You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mark <st...@gmail.com> on 2013/03/29 18:19:46 UTC

Noob question on creating tables

We have existing log data in directories in the format of YEAR/MONTH/DAY. 

- How can we create a table over this table without hive modifying and/or moving it?
- How can we tell Hive to partition this data so it knows about each day of logs?
- Does hive out of the box work with reading compressed files?

Thanks

Re: Noob question on creating tables

Posted by Nitin Pawar <ni...@gmail.com>.

yes


On Fri, Mar 29, 2013 at 11:46 PM, Mark <st...@gmail.com> wrote:

> Thanks
>
> Does this mean I need to create a partition for each day manually? There
> is no way to have infer that from my directory structure?
>
> On Mar 29, 2013, at 10:32 AM, Dean Wampler <
> dean.wampler@thinkbiganalytics.com> wrote:
>
>
>
> On Fri, Mar 29, 2013 at 12:19 PM, Mark <st...@gmail.com> wrote:
>
>> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>>
>> - How can we create a table over this table without hive modifying and/or
>> moving it?
>>
>
> create external table foo (...) partitioned by (year  int, month int, day
> int);
> ...
>
> - How can we tell Hive to partition this data so it knows about each day
>> of logs?
>>
>
> alter table foo add partition(year = 2013, month = 3, day = 29) location
> '/path/to/2013/03/29';
>
>
>> - Does hive out of the box work with reading compressed files?
>>
>
> yes, if you're using a compression scheme supported by Hadoop.
>
>
>>
>> Thanks
>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>


-- 
Nitin Pawar

Re: Noob question on creating tables

Posted by Mark <st...@gmail.com>.

Thanks

Does this mean I need to create a partition for each day manually? There is no way to have infer that from my directory structure?

On Mar 29, 2013, at 10:32 AM, Dean Wampler <de...@thinkbiganalytics.com> wrote:

> 
> 
> On Fri, Mar 29, 2013 at 12:19 PM, Mark <st...@gmail.com> wrote:
> We have existing log data in directories in the format of YEAR/MONTH/DAY.
> 
> - How can we create a table over this table without hive modifying and/or moving it?
> 
> create external table foo (...) partitioned by (year  int, month int, day int);
> ...
> 
> - How can we tell Hive to partition this data so it knows about each day of logs?
> 
> alter table foo add partition(year = 2013, month = 3, day = 29) location '/path/to/2013/03/29';
>  
> - Does hive out of the box work with reading compressed files?
> 
> yes, if you're using a compression scheme supported by Hadoop.
>  
> 
> Thanks
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>

Re: Noob question on creating tables

Posted by Dean Wampler <de...@thinkbiganalytics.com>.

On Fri, Mar 29, 2013 at 12:19 PM, Mark <st...@gmail.com> wrote:

> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>
> - How can we create a table over this table without hive modifying and/or
> moving it?
>

create external table foo (...) partitioned by (year  int, month int, day
int);
...

- How can we tell Hive to partition this data so it knows about each day of
> logs?
>

alter table foo add partition(year = 2013, month = 3, day = 29) location
'/path/to/2013/03/29';


> - Does hive out of the box work with reading compressed files?
>

yes, if you're using a compression scheme supported by Hadoop.


>
> Thanks




-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Noob question on creating tables

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Yes when u use EXTERNAL , then u effectively have delinked the table management from its data. So if you drop an EXTERNAL table, the data does not get deleted.
Best is if u don't like the data  , tomorrow u can drop partitions and add partitions pointing to another data LOCATION.

If you create an EXTERNAL table called foo_table_ext u will find a subdirectory /usr/hive/warehouse/foo_table_ext
But if you drop the table this subdirectory won't be deleted…also u have the freedom to "add" any location and partitions of your choice

Hope this clarifies…

sanjay

From: Mark <st...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, March 29, 2013 1:25 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Noob question on creating tables

Sweet this makes perfect sense. We are actually already using an ooze job that wait for input and does some pre-processing on the logs before finally archiving in HDFS. It would make sense to just attach some job to import this data and partition it by date at the same time.

Just one last question regarding external tables and tables that Hive manages. Am I correct to assume that when you let hive manager your table it moves everything where you want and it takes care of partitioning. But if I want to manage the directory structure myself, or use an existing one like we have, I will need to use the external keyword. Does that sound about right?

On Mar 29, 2013, at 12:45 PM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:

I agree. BASH is super easy for things like this

I have a daily alter partition script I call thru a java-action in Oozie (that Java class calls a HiveClient interface implementation)
Example script that I run for us where date and server are partitions
for r in $(hdfs dfs -ls  /path/to/directory/in/hdfs |awk '{print $8}')
  do
    datestr=$(echo $r|cut -d "/" -f 10)
    serverstr=$(echo $r|cut -d "/" -f 11)
    $HIVE_HOME/bin/hive -hiveconf hive.root.logger=INFO,console -e "ALTER TABLE my_table ADD PARTITION (header_date='$datestr' , header_servername='$serverstr') LOCATION '$r';"
done

From: Dean Wampler <de...@thinkbiganalytics.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, March 29, 2013 11:37 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Noob question on creating tables

That's a drawback of external tables, but it's actually not as difficult as it sounds. It's easy to write a nightly "cron" job that creates the partition for the next day (or a job per month...), if someone on your team has some bash experience. Other job scheduling tools should support this too. Here's an example. First, a hive script that uses parameters for the date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month = ${MONTH}, day = ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=$(date +%Y)       # returns the string "2013" today.
MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
DAY=$(date +%d)          # returns the string "29" today. Will prefix with 0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
/path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH -d DAY=$DAY

(Of course, all the /path/to will be different...)

So, be careful of how how "03" vs. "3" is handled in both the ALTER TABLE statement and the path. Off hand, I don't know if Hive will complain if you use 03 as an integer value in the ALTER TABLE statement.

On Fri, Mar 29, 2013 at 1:16 PM, Mark <st...@gmail.com>> wrote:
Thanks

Does this mean I need to create a partition for each day manually? There is no way to have infer that from my directory structure?

On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:

> Hi
>
> CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
>
>
> ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
>
> Hive will read gzip and bz2 files out of the box.(so suppose you had
> hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> will be read)
> Snappy and LZO will need some jar installs and configs
> https://github.com/toddlipcon/hadoop-lzo
>
> https://code.google.com/p/snappy/
>
>
> Note that for example - gzip format is not splittable..so huge gzip files
> without splits are not recommended as input to maps
>
> Hope this helps
>
> sanjay
>
>
> On 3/29/13 10:19 AM, "Mark" <st...@gmail.com>> wrote:
>
>> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>>
>> - How can we create a table over this table without hive modifying and/or
>> moving it?
>> - How can we tell Hive to partition this data so it knows about each day
>> of logs?
>> - Does hive out of the box work with reading compressed files?
>>
>> Thanks
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
>

--
Dean Wampler, Ph.D.
thinkbiganalytics.com<http://thinkbiganalytics.com/>
+1-312-339-1330

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Noob question on creating tables

Posted by Mark <st...@gmail.com>.

Sweet this makes perfect sense. We are actually already using an ooze job that wait for input and does some pre-processing on the logs before finally archiving in HDFS. It would make sense to just attach some job to import this data and partition it by date at the same time. 

Just one last question regarding external tables and tables that Hive manages. Am I correct to assume that when you let hive manager your table it moves everything where you want and it takes care of partitioning. But if I want to manage the directory structure myself, or use an existing one like we have, I will need to use the external keyword. Does that sound about right?

On Mar 29, 2013, at 12:45 PM, Sanjay Subramanian <Sa...@wizecommerce.com> wrote:

> I agree. BASH is super easy for things like this 
> 
> I have a daily alter partition script I call thru a java-action in Oozie (that Java class calls a HiveClient interface implementation)
> Example script that I run for us where date and server are partitions
> for r in $(hdfs dfs -ls  /path/to/directory/in/hdfs |awk '{print $8}')
>   do 
>     datestr=$(echo $r|cut -d "/" -f 10)
>     serverstr=$(echo $r|cut -d "/" -f 11)
>     $HIVE_HOME/bin/hive -hiveconf hive.root.logger=INFO,console -e "ALTER TABLE my_table ADD PARTITION (header_date='$datestr' , header_servername='$serverstr') LOCATION '$r';"
> done
> 
> From: Dean Wampler <de...@thinkbiganalytics.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> Date: Friday, March 29, 2013 11:37 AM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Re: Noob question on creating tables
> 
> That's a drawback of external tables, but it's actually not as difficult as it sounds. It's easy to write a nightly "cron" job that creates the partition for the next day (or a job per month...), if someone on your team has some bash experience. Other job scheduling tools should support this too. Here's an example. First, a hive script that uses parameters for the date (Hive v0.8 or newer):
> 
> -- addlogpartition.hql
> ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month = ${MONTH}, day = ${DAY});
> 
> Then, run this bash script AFTER MIDNIGHT:
> 
> #!/bin/bash
> YEAR=$(date +%Y)       # returns the string "2013" today.
> MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
> DAY=$(date +%d)          # returns the string "29" today. Will prefix with 0 for dates < 10.
> 
> # Assumes /path/to/2013/03/29 is the correct directory name:
> /path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH -d DAY=$DAY
> 
> 
> (Of course, all the /path/to will be different...)
> 
> So, be careful of how how "03" vs. "3" is handled in both the ALTER TABLE statement and the path. Off hand, I don't know if Hive will complain if you use 03 as an integer value in the ALTER TABLE statement.
> 
> 
> On Fri, Mar 29, 2013 at 1:16 PM, Mark <st...@gmail.com> wrote:
> Thanks
> 
> Does this mean I need to create a partition for each day manually? There is no way to have infer that from my directory structure?
> 
> On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <Sa...@wizecommerce.com> wrote:
> 
> > Hi
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> > datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> > INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> >
> >
> > ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> > '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
> >
> > Hive will read gzip and bz2 files out of the box.(so suppose you had
> > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> > will be read)
> > Snappy and LZO will need some jar installs and configs
> > https://github.com/toddlipcon/hadoop-lzo
> >
> > https://code.google.com/p/snappy/
> >
> >
> > Note that for example - gzip format is not splittable..so huge gzip files
> > without splits are not recommended as input to maps
> >
> > Hope this helps
> >
> > sanjay
> >
> >
> > On 3/29/13 10:19 AM, "Mark" <st...@gmail.com> wrote:
> >
> >> We have existing log data in directories in the format of YEAR/MONTH/DAY.
> >>
> >> - How can we create a table over this table without hive modifying and/or
> >> moving it?
> >> - How can we tell Hive to partition this data so it knows about each day
> >> of logs?
> >> - Does hive out of the box work with reading compressed files?
> >>
> >> Thanks
> >
> >
> > CONFIDENTIALITY NOTICE
> > ======================
> > This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
> >
> 
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
> 
> 
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Noob question on creating tables

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

I agree. BASH is super easy for things like this

I have a daily alter partition script I call thru a java-action in Oozie (that Java class calls a HiveClient interface implementation)
Example script that I run for us where date and server are partitions
for r in $(hdfs dfs -ls  /path/to/directory/in/hdfs |awk '{print $8}')
  do
    datestr=$(echo $r|cut -d "/" -f 10)
    serverstr=$(echo $r|cut -d "/" -f 11)
    $HIVE_HOME/bin/hive -hiveconf hive.root.logger=INFO,console -e "ALTER TABLE my_table ADD PARTITION (header_date='$datestr' , header_servername='$serverstr') LOCATION '$r';"
done

From: Dean Wampler <de...@thinkbiganalytics.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, March 29, 2013 11:37 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Noob question on creating tables

That's a drawback of external tables, but it's actually not as difficult as it sounds. It's easy to write a nightly "cron" job that creates the partition for the next day (or a job per month...), if someone on your team has some bash experience. Other job scheduling tools should support this too. Here's an example. First, a hive script that uses parameters for the date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month = ${MONTH}, day = ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=$(date +%Y)       # returns the string "2013" today.
MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
DAY=$(date +%d)          # returns the string "29" today. Will prefix with 0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
/path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH -d DAY=$DAY

(Of course, all the /path/to will be different...)

So, be careful of how how "03" vs. "3" is handled in both the ALTER TABLE statement and the path. Off hand, I don't know if Hive will complain if you use 03 as an integer value in the ALTER TABLE statement.

On Fri, Mar 29, 2013 at 1:16 PM, Mark <st...@gmail.com>> wrote:
Thanks

Does this mean I need to create a partition for each day manually? There is no way to have infer that from my directory structure?

On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:

> Hi
>
> CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
>
>
> ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
>
> Hive will read gzip and bz2 files out of the box.(so suppose you had
> hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> will be read)
> Snappy and LZO will need some jar installs and configs
> https://github.com/toddlipcon/hadoop-lzo
>
> https://code.google.com/p/snappy/
>
>
> Note that for example - gzip format is not splittable..so huge gzip files
> without splits are not recommended as input to maps
>
> Hope this helps
>
> sanjay
>
>
> On 3/29/13 10:19 AM, "Mark" <st...@gmail.com>> wrote:
>
>> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>>
>> - How can we create a table over this table without hive modifying and/or
>> moving it?
>> - How can we tell Hive to partition this data so it knows about each day
>> of logs?
>> - Does hive out of the box work with reading compressed files?
>>
>> Thanks
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
>

--
Dean Wampler, Ph.D.
thinkbiganalytics.com<http://thinkbiganalytics.com>
+1-312-339-1330

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Noob question on creating tables

Posted by Dean Wampler <de...@thinkbiganalytics.com>.

That's a drawback of external tables, but it's actually not as difficult as
it sounds. It's easy to write a nightly "cron" job that creates the
partition for the next day (or a job per month...), if someone on your team
has some bash experience. Other job scheduling tools should support this
too. Here's an example. First, a hive script that uses parameters for the
date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month =
${MONTH}, day = ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=$(date +%Y)       # returns the string "2013" today.
MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
DAY=$(date +%d)          # returns the string "29" today. Will prefix with
0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
/path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH
-d DAY=$DAY


(Of course, all the /path/to will be different...)

So, be careful of how how "03" vs. "3" is handled in both the ALTER
TABLE statement and the path. Off hand, I don't know if Hive will complain
if you use 03 as an integer value in the ALTER TABLE statement.


On Fri, Mar 29, 2013 at 1:16 PM, Mark <st...@gmail.com> wrote:

> Thanks
>
> Does this mean I need to create a partition for each day manually? There
> is no way to have infer that from my directory structure?
>
> On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
> > Hi
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> > datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> > INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> >
> >
> > ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> > '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
> >
> > Hive will read gzip and bz2 files out of the box.(so suppose you had
> > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> > will be read)
> > Snappy and LZO will need some jar installs and configs
> > https://github.com/toddlipcon/hadoop-lzo
> >
> > https://code.google.com/p/snappy/
> >
> >
> > Note that for example - gzip format is not splittable..so huge gzip files
> > without splits are not recommended as input to maps
> >
> > Hope this helps
> >
> > sanjay
> >
> >
> > On 3/29/13 10:19 AM, "Mark" <st...@gmail.com> wrote:
> >
> >> We have existing log data in directories in the format of
> YEAR/MONTH/DAY.
> >>
> >> - How can we create a table over this table without hive modifying
> and/or
> >> moving it?
> >> - How can we tell Hive to partition this data so it knows about each day
> >> of logs?
> >> - Does hive out of the box work with reading compressed files?
> >>
> >> Thanks
> >
> >
> > CONFIDENTIALITY NOTICE
> > ======================
> > This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
> >
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Noob question on creating tables

Posted by Mark <st...@gmail.com>.

Thanks

Does this mean I need to create a partition for each day manually? There is no way to have infer that from my directory structure?

On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <Sa...@wizecommerce.com> wrote:

> Hi
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> 
> 
> ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
> 
> Hive will read gzip and bz2 files out of the box.(so suppose you had
> hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> will be read)
> Snappy and LZO will need some jar installs and configs
> https://github.com/toddlipcon/hadoop-lzo
> 
> https://code.google.com/p/snappy/
> 
> 
> Note that for example - gzip format is not splittable..so huge gzip files
> without splits are not recommended as input to maps
> 
> Hope this helps
> 
> sanjay
> 
> 
> On 3/29/13 10:19 AM, "Mark" <st...@gmail.com> wrote:
> 
>> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>> 
>> - How can we create a table over this table without hive modifying and/or
>> moving it?
>> - How can we tell Hive to partition this data so it knows about each day
>> of logs?
>> - Does hive out of the box work with reading compressed files?
>> 
>> Thanks
> 
> 
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
>

Re: Noob question on creating tables

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi

CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
'/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"

Hive will read gzip and bz2 files out of the box.(so suppose you had
hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
will be read)
Snappy and LZO will need some jar installs and configs
https://github.com/toddlipcon/hadoop-lzo

https://code.google.com/p/snappy/

Note that for example - gzip format is not splittable..so huge gzip files
without splits are not recommended as input to maps

Hope this helps

sanjay

On 3/29/13 10:19 AM, "Mark" <st...@gmail.com> wrote:

>We have existing log data in directories in the format of YEAR/MONTH/DAY.
>
>- How can we create a table over this table without hive modifying and/or
>moving it?
>- How can we tell Hive to partition this data so it knows about each day
>of logs?
>- Does hive out of the box work with reading compressed files?
>
>Thanks

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.