You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sai Sai <sa...@yahoo.in> on 2013/05/27 09:38:11 UTC

Re:Partitioning confusion


After creating a partition for a country (USA) and state (IL) and when we go to the the hdfs site to look at the partition in the browser we r seeing  all the records for all the countries and states rather than just for the partition created for US and IL given below, is this correct behavior:
********************
Here is my commands:
********************


CREATE TABLE employees (name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY (country STRING, state STRING);

LOAD DATA LOCAL INPATH '/home/satish/data/employees/input/employees-country.txt' INTO TABLE employees PARTITION (country='USA',state='IL');

********************

Here is my original data file, where i have a few countries data such as USA, INDIA, UK, AUS:
********************


John Doe100000.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
Mary Smith80000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario St.ChicagoIL60601USA
Todd Jones70000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak ParkIL60700USA
Bill King60000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure Dr.ObscuriaIL60100USA
Boss Man200000.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
Fred Finance150000.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
Stacy Accountant60000.0Federal Taxes.15State Taxes.03Insurance.1300 Main St.NapervilleIL60563USA
John Doe 2100000.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
Mary Smith 280000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario St.ChicagoIL60601INDIA
Todd Jones 270000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak ParkIL60700AUSTRALIA
Bill King 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure Dr.ObscuriaIL60100AUSTRALIA
Boss Man2 200000.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
Fred Finance 2150000.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
Stacy Accountant 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Main St.NapervilleIL60563UK
********************

Now when i navigate to:
Contents of directory /user/hive/warehouse/db1.db/employees/country=USA/state=IL

********************

I see all the records and was wondering if it should have only USA & IL records.
Please help.

Re: Partitioning confusion

Posted by Nitin Pawar <ni...@gmail.com>.
if you have a 250GB file, then how did it become 2.5TB ?

if you can not write a mapreduce job to process to write your data into
specific partitions, then the other way around would be load the entire
data into a temporary table and then load data into partitioned tabled and
then drop the temporary table.


approach would be
1) Create a temporary table
CREATE TABLE temp_employees (name STRING, salary FLOAT, subordinates
ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING,
city:STRING, state:STRING, zip:INT, country:STRING> );

2) Load data into this table
LOAD DATA LOCAL INPATH
'/home/satish/data/employees/input/employees-country.txt'
INTO TABLE temp_employees;

3) Create partitoned table
CREATE TABLE employees (name STRING, salary FLOAT, subordinates
ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING,
city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY
(country STRING, state STRING);

2) Load data from temporary data into partitioned tables
   2a) if there are already partitions existing then be careful giving a
blanket query or it will overwrite data into exisiting partitions

insert overwrite employees partition(country='USA',state='IL') select *
from temp_employees where country="USA" and state ="IL";
this you will need to do for all of them


   2b) if your partitioned table is empty currently then you can just
enable dynamic partitioning and hive will take care of adding correct data
into individual partitions
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;

insert overwrite table employees select * from employees;

3) drop the temporary table

by doing above approach you will have your data being replicated 2x times
of replication factor for sometime but then once you drop temporary table
it will be normal HDFS replication factor size data.

Its been sometime I have worked on hive so others may have better
approaches as well, so wait for someone to correct me before going further
:)


On Mon, May 27, 2013 at 2:13 PM, Sai Sai <sa...@yahoo.in> wrote:

> Nitin
> I am still confused, from the below data that  i have given should the
> file which sits in the folder Country=USA and state=IL have only the rows
> where Country=USA and state=IL or will it have rows of other countries also.
> The reason i ask is because if we have a 250GB file and would like to
> create 10 partitions that would end up in 2.5 TB * 3 = 7.5TB. Is this
> expected.
> Thanks
> S
>
>   ------------------------------
>  *From:* Nitin Pawar <ni...@gmail.com>
> *To:* user@hive.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Monday, 27 May 2013 2:08 PM
> *Subject:* Re: Partitioning confusion
>
> when you specify the load data query with specific partition, it will put
> the entire data into that partition.
>
>
>
> On Mon, May 27, 2013 at 1:08 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> After creating a partition for a country (USA) and state (IL) and when we
> go to the the hdfs site to look at the partition in the browser we r seeing
>  all the records for all the countries and states rather than just for the
> partition created for US and IL given below, is this correct behavior:
> ********************
> Here is my commands:
> ********************
>
> CREATE TABLE employees (name STRING, salary FLOAT, subordinates
> ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING,
> city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY
> (country STRING, state STRING);
>
> LOAD DATA LOCAL INPATH
> '/home/satish/data/employees/input/employees-country.txt' INTO TABLE
> employees PARTITION (country='USA',state='IL');
>
> ********************
> Here is my original data file, where i have a few countries data such as
> USA, INDIA, UK, AUS:
> ********************
>
> John Doe100000.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
> Mary Smith80000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601USA
> Todd Jones70000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700USA
> Bill King60000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100USA
> Boss Man200000.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
> Fred Finance150000.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
> Stacy Accountant60000.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563USA
> John Doe 2100000.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
> Mary Smith 280000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601INDIA
> Todd Jones 270000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700AUSTRALIA
> Bill King 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100AUSTRALIA
> Boss Man2 200000.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
> Fred Finance 2150000.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
> Stacy Accountant 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563UK
> ********************
> Now when i navigate to:
> Contents of directory
> /user/hive/warehouse/db1.db/employees/country=USA/state=IL
> ********************
> I see all the records and was wondering if it should have only USA & IL
> records.
> Please help.
>
>
>
>
> --
> Nitin Pawar
>
>
>


-- 
Nitin Pawar

Re: Partitioning confusion

Posted by Sai Sai <sa...@yahoo.in>.
Nitin
I am still confused, from the below data that  i have given should the file which sits in the folder Country=USA and state=IL have only the rows where Country=USA and state=IL or will it have rows of other countries also.
The reason i ask is because if we have a 250GB file and would like to create 10 partitions that would end up in 2.5 TB * 3 = 7.5TB. Is this expected.
Thanks
S


________________________________
 From: Nitin Pawar <ni...@gmail.com>
To: user@hive.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Monday, 27 May 2013 2:08 PM
Subject: Re: Partitioning confusion
 


when you specify the load data query with specific partition, it will put the entire data into that partition. 




On Mon, May 27, 2013 at 1:08 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>After creating a partition for a country (USA) and state (IL) and when we go to the the hdfs site to look at the partition in the browser we r seeing  all the records for all the countries and states rather than just for the partition created for US and IL given below, is this correct behavior:
>********************
>Here is my commands:
>********************
>
>
>
>CREATE TABLE employees (name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY (country STRING, state STRING);
>
>
>LOAD DATA LOCAL INPATH '/home/satish/data/employees/input/employees-country.txt' INTO TABLE employees PARTITION (country='USA',state='IL');
>
>
>********************
>
>Here is my original data file, where i have a few countries data such as USA, INDIA, UK, AUS:
>********************
>
>
>
>John Doe100000.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
>Mary Smith80000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario St.ChicagoIL60601USA
>Todd Jones70000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak ParkIL60700USA
>Bill King60000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure Dr.ObscuriaIL60100USA
>Boss Man200000.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
>Fred Finance150000.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
>Stacy Accountant60000.0Federal Taxes.15State Taxes.03Insurance.1300 Main St.NapervilleIL60563USA
>John Doe 2100000.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
>Mary Smith 280000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario St.ChicagoIL60601INDIA
>Todd Jones 270000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak ParkIL60700AUSTRALIA
>Bill King 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure Dr.ObscuriaIL60100AUSTRALIA
>Boss Man2 200000.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
>Fred Finance 2150000.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
>Stacy Accountant 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Main St.NapervilleIL60563UK
>********************
>
>Now when i navigate to:
>Contents of directory /user/hive/warehouse/db1.db/employees/country=USA/state=IL
>
>********************
>
>I see all the records and was wondering if it should have only USA & IL records.
>Please help.


-- 
Nitin Pawar

Re: Partitioning confusion

Posted by Nitin Pawar <ni...@gmail.com>.
when you specify the load data query with specific partition, it will put
the entire data into that partition.



On Mon, May 27, 2013 at 1:08 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> After creating a partition for a country (USA) and state (IL) and when we
> go to the the hdfs site to look at the partition in the browser we r seeing
>  all the records for all the countries and states rather than just for the
> partition created for US and IL given below, is this correct behavior:
> ********************
> Here is my commands:
> ********************
>
> CREATE TABLE employees (name STRING, salary FLOAT, subordinates
> ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING,
> city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY
> (country STRING, state STRING);
>
> LOAD DATA LOCAL INPATH
> '/home/satish/data/employees/input/employees-country.txt' INTO TABLE
> employees PARTITION (country='USA',state='IL');
>
> ********************
> Here is my original data file, where i have a few countries data such as
> USA, INDIA, UK, AUS:
> ********************
>
> John Doe100000.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
> Mary Smith80000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601USA
> Todd Jones70000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700USA
> Bill King60000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100USA
> Boss Man200000.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
> Fred Finance150000.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
> Stacy Accountant60000.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563USA
> John Doe 2100000.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
> Mary Smith 280000.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601INDIA
> Todd Jones 270000.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700AUSTRALIA
> Bill King 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100AUSTRALIA
> Boss Man2 200000.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
> Fred Finance 2150000.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
> Stacy Accountant 260000.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563UK
> ********************
> Now when i navigate to:
> Contents of directory
> /user/hive/warehouse/db1.db/employees/country=USA/state=IL
> ********************
> I see all the records and was wondering if it should have only USA & IL
> records.
> Please help.
>



-- 
Nitin Pawar