You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/10/10 13:53:44 UTC

Data transformation query

Hello,

I have a data manipulation query.

I have my data in the following format:

*Date   PLZ   Count*
date1   plz1   count1
date1   plz1   count2
date1   plz1   count3
date1   plz2   count4
date1   plz2   count5
date1   plz3   count6
date1   plz3   count7

date2   plz1   count8
date2   plz1   count9
date2   plz3   count10
date2   plz3   count11

date3   plz1   count12
date3   plz1   count13
date3   plz2   count14
date3   plz2   count15
date3   plz2   count16

date4   plz1   count17
date4   plz2   count18
date4   plz3   count19

With Hive queries I have managed to get the data into the following grouped
and aggregated format:

select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
yearfolder='2013' and monthfolder in ('01')
and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
and plz in ('plz1','plz2','plz3')
group by plz, time

*PLZ   Date   TotalCount*
plz1   date1   TC1
plz1   date2   TC2
plz1   date3   TC3
plz1   date4   TC4

plz2   date1   TC5
plz2   date3   TC6
plz2   date4   TC7

plz3   date1   TC8
plz3   date2   TC9
plz3   date4   TC10

The above data is grouped by plz and date. There is no entry of a plz if it
does not exist for a date.
I used the following query to generate the above data:



But I require the data to be in the following format:

*PLZ   Date1   Date2   Date3   Date4 ..... DateN .....*
plz1   TC1       TC2      TC3       TC4   .....  TC N
plz2   TC5       ------      TC6       TC7   .....  TC N
plz3   TC8       TC9      ------       TC10  ....  TC N


This needs to be generated as 1 column for every date of the year.

Any help to generate this format is welcome. Writing my own mapper and
reduce and calling in Hive as function  is to be my last option.

Thanking You,


-- 
Regards,
Ouch Whisper
010101010101

Re: Data transformation query

Posted by Panshul Whisper <ou...@gmail.com>.
Thank you Selva

for the response. But the data is too huge to be handled by Excel, thus I
need to do it with hive.


On Thu, Oct 10, 2013 at 2:31 PM, Selvamohan Neethiraj <
selva.apache@infotekies.com> wrote:

> If it is not so much data, you can use Excel's PivotTable to solve this
> specific requirement:
>
> 1. Select the date and plz columns (w/o header) and create pivotTable on a
> new WorkSheet
> 2. Drag the column name: 'Date' from the 'Pivot Table' Builder  to the
> 'Column Label' section.
> 3. Drag the column name: 'PLZ' from the 'Pivot Table' Builder  to the
> 'Row Labels'
> 4. Drag the column name: 'Date' from the 'Pivot Table' Builder  to the
> 'Values' and ensure the selected summary function is 'Count'
>
> Hope, this helps,
>
> Thanks,
> Selva-
>
>
> On Oct 10, 2013, at 7:53 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> Hello,
>
> I have a data manipulation query.
>
> I have my data in the following format:
>
> *Date   PLZ   Count*
> date1   plz1   count1
> date1   plz1   count2
> date1   plz1   count3
> date1   plz2   count4
> date1   plz2   count5
> date1   plz3   count6
> date1   plz3   count7
>
> date2   plz1   count8
> date2   plz1   count9
> date2   plz3   count10
> date2   plz3   count11
>
> date3   plz1   count12
> date3   plz1   count13
> date3   plz2   count14
> date3   plz2   count15
> date3   plz2   count16
>
> date4   plz1   count17
> date4   plz2   count18
> date4   plz3   count19
>
> With Hive queries I have managed to get the data into the following
> grouped and aggregated format:
>
> select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
> yearfolder='2013' and monthfolder in ('01')
> and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
> and plz in ('plz1','plz2','plz3')
> group by plz, time
>
> *PLZ   Date   TotalCount*
> plz1   date1   TC1
> plz1   date2   TC2
> plz1   date3   TC3
> plz1   date4   TC4
>
> plz2   date1   TC5
> plz2   date3   TC6
> plz2   date4   TC7
>
> plz3   date1   TC8
> plz3   date2   TC9
> plz3   date4   TC10
>
> The above data is grouped by plz and date. There is no entry of a plz if
> it does not exist for a date.
> I used the following query to generate the above data:
>
>
>
> But I require the data to be in the following format:
>
> *PLZ   Date1   Date2   Date3   Date4 ..... DateN .....*
> plz1   TC1       TC2      TC3       TC4   .....  TC N
> plz2   TC5       ------      TC6       TC7   .....  TC N
> plz3   TC8       TC9      ------       TC10  ....  TC N
>
>
> This needs to be generated as 1 column for every date of the year.
>
> Any help to generate this format is welcome. Writing my own mapper and
> reduce and calling in Hive as function  is to be my last option.
>
> Thanking You,
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: Data transformation query

Posted by Selvamohan Neethiraj <se...@infotekies.com>.
If it is not so much data, you can use Excel's PivotTable to solve this specific requirement:

1. Select the date and plz columns (w/o header) and create pivotTable on a new WorkSheet
2. Drag the column name: 'Date' from the 'Pivot Table' Builder  to the 'Column Label' section.
3. Drag the column name: 'PLZ' from the 'Pivot Table' Builder  to the  'Row Labels'
4. Drag the column name: 'Date' from the 'Pivot Table' Builder  to the  'Values' and ensure the selected summary function is 'Count'

Hope, this helps,

Thanks,
Selva-



On Oct 10, 2013, at 7:53 AM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello,
> 
> I have a data manipulation query.
> 
> I have my data in the following format:
> 
> Date   PLZ   Count
> date1   plz1   count1
> date1   plz1   count2
> date1   plz1   count3
> date1   plz2   count4
> date1   plz2   count5
> date1   plz3   count6
> date1   plz3   count7
> 
> date2   plz1   count8
> date2   plz1   count9
> date2   plz3   count10
> date2   plz3   count11
> 
> date3   plz1   count12
> date3   plz1   count13
> date3   plz2   count14
> date3   plz2   count15
> date3   plz2   count16
> 
> date4   plz1   count17
> date4   plz2   count18
> date4   plz3   count19
> 
> With Hive queries I have managed to get the data into the following grouped and aggregated format:
> 
> select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
> yearfolder='2013' and monthfolder in ('01')
> and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
> and plz in ('plz1','plz2','plz3')
> group by plz, time
> 
> PLZ   Date   TotalCount
> plz1   date1   TC1
> plz1   date2   TC2
> plz1   date3   TC3
> plz1   date4   TC4
> 
> plz2   date1   TC5
> plz2   date3   TC6
> plz2   date4   TC7
> 
> plz3   date1   TC8
> plz3   date2   TC9
> plz3   date4   TC10
> 
> The above data is grouped by plz and date. There is no entry of a plz if it does not exist for a date.
> I used the following query to generate the above data:
> 
> 
> 
> But I require the data to be in the following format:
> 
> PLZ   Date1   Date2   Date3   Date4 ..... DateN .....
> plz1   TC1       TC2      TC3       TC4   .....  TC N
> plz2   TC5       ------      TC6       TC7   .....  TC N
> plz3   TC8       TC9      ------       TC10  ....  TC N
> 
> 
> This needs to be generated as 1 column for every date of the year.
> 
> Any help to generate this format is welcome. Writing my own mapper and reduce and calling in Hive as function  is to be my last option.
> 
> Thanking You,
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101