You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/10/10 13:53:44 UTC
Data transformation query
Hello,
I have a data manipulation query.
I have my data in the following format:
*Date PLZ Count*
date1 plz1 count1
date1 plz1 count2
date1 plz1 count3
date1 plz2 count4
date1 plz2 count5
date1 plz3 count6
date1 plz3 count7
date2 plz1 count8
date2 plz1 count9
date2 plz3 count10
date2 plz3 count11
date3 plz1 count12
date3 plz1 count13
date3 plz2 count14
date3 plz2 count15
date3 plz2 count16
date4 plz1 count17
date4 plz2 count18
date4 plz3 count19
With Hive queries I have managed to get the data into the following grouped
and aggregated format:
select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
yearfolder='2013' and monthfolder in ('01')
and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
and plz in ('plz1','plz2','plz3')
group by plz, time
*PLZ Date TotalCount*
plz1 date1 TC1
plz1 date2 TC2
plz1 date3 TC3
plz1 date4 TC4
plz2 date1 TC5
plz2 date3 TC6
plz2 date4 TC7
plz3 date1 TC8
plz3 date2 TC9
plz3 date4 TC10
The above data is grouped by plz and date. There is no entry of a plz if it
does not exist for a date.
I used the following query to generate the above data:
But I require the data to be in the following format:
*PLZ Date1 Date2 Date3 Date4 ..... DateN .....*
plz1 TC1 TC2 TC3 TC4 ..... TC N
plz2 TC5 ------ TC6 TC7 ..... TC N
plz3 TC8 TC9 ------ TC10 .... TC N
This needs to be generated as 1 column for every date of the year.
Any help to generate this format is welcome. Writing my own mapper and
reduce and calling in Hive as function is to be my last option.
Thanking You,
--
Regards,
Ouch Whisper
010101010101
Re: Data transformation query
Posted by Panshul Whisper <ou...@gmail.com>.
Thank you Selva
for the response. But the data is too huge to be handled by Excel, thus I
need to do it with hive.
On Thu, Oct 10, 2013 at 2:31 PM, Selvamohan Neethiraj <
selva.apache@infotekies.com> wrote:
> If it is not so much data, you can use Excel's PivotTable to solve this
> specific requirement:
>
> 1. Select the date and plz columns (w/o header) and create pivotTable on a
> new WorkSheet
> 2. Drag the column name: 'Date' from the 'Pivot Table' Builder to the
> 'Column Label' section.
> 3. Drag the column name: 'PLZ' from the 'Pivot Table' Builder to the
> 'Row Labels'
> 4. Drag the column name: 'Date' from the 'Pivot Table' Builder to the
> 'Values' and ensure the selected summary function is 'Count'
>
> Hope, this helps,
>
> Thanks,
> Selva-
>
>
> On Oct 10, 2013, at 7:53 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> Hello,
>
> I have a data manipulation query.
>
> I have my data in the following format:
>
> *Date PLZ Count*
> date1 plz1 count1
> date1 plz1 count2
> date1 plz1 count3
> date1 plz2 count4
> date1 plz2 count5
> date1 plz3 count6
> date1 plz3 count7
>
> date2 plz1 count8
> date2 plz1 count9
> date2 plz3 count10
> date2 plz3 count11
>
> date3 plz1 count12
> date3 plz1 count13
> date3 plz2 count14
> date3 plz2 count15
> date3 plz2 count16
>
> date4 plz1 count17
> date4 plz2 count18
> date4 plz3 count19
>
> With Hive queries I have managed to get the data into the following
> grouped and aggregated format:
>
> select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
> yearfolder='2013' and monthfolder in ('01')
> and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
> and plz in ('plz1','plz2','plz3')
> group by plz, time
>
> *PLZ Date TotalCount*
> plz1 date1 TC1
> plz1 date2 TC2
> plz1 date3 TC3
> plz1 date4 TC4
>
> plz2 date1 TC5
> plz2 date3 TC6
> plz2 date4 TC7
>
> plz3 date1 TC8
> plz3 date2 TC9
> plz3 date4 TC10
>
> The above data is grouped by plz and date. There is no entry of a plz if
> it does not exist for a date.
> I used the following query to generate the above data:
>
>
>
> But I require the data to be in the following format:
>
> *PLZ Date1 Date2 Date3 Date4 ..... DateN .....*
> plz1 TC1 TC2 TC3 TC4 ..... TC N
> plz2 TC5 ------ TC6 TC7 ..... TC N
> plz3 TC8 TC9 ------ TC10 .... TC N
>
>
> This needs to be generated as 1 column for every date of the year.
>
> Any help to generate this format is welcome. Writing my own mapper and
> reduce and calling in Hive as function is to be my last option.
>
> Thanking You,
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>
>
--
Regards,
Ouch Whisper
010101010101
Re: Data transformation query
Posted by Selvamohan Neethiraj <se...@infotekies.com>.
If it is not so much data, you can use Excel's PivotTable to solve this specific requirement:
1. Select the date and plz columns (w/o header) and create pivotTable on a new WorkSheet
2. Drag the column name: 'Date' from the 'Pivot Table' Builder to the 'Column Label' section.
3. Drag the column name: 'PLZ' from the 'Pivot Table' Builder to the 'Row Labels'
4. Drag the column name: 'Date' from the 'Pivot Table' Builder to the 'Values' and ensure the selected summary function is 'Count'
Hope, this helps,
Thanks,
Selva-
On Oct 10, 2013, at 7:53 AM, Panshul Whisper <ou...@gmail.com> wrote:
> Hello,
>
> I have a data manipulation query.
>
> I have my data in the following format:
>
> Date PLZ Count
> date1 plz1 count1
> date1 plz1 count2
> date1 plz1 count3
> date1 plz2 count4
> date1 plz2 count5
> date1 plz3 count6
> date1 plz3 count7
>
> date2 plz1 count8
> date2 plz1 count9
> date2 plz3 count10
> date2 plz3 count11
>
> date3 plz1 count12
> date3 plz1 count13
> date3 plz2 count14
> date3 plz2 count15
> date3 plz2 count16
>
> date4 plz1 count17
> date4 plz2 count18
> date4 plz3 count19
>
> With Hive queries I have managed to get the data into the following grouped and aggregated format:
>
> select plz, TO_DATE(time), sum(totalcount) from power_pad_part where
> yearfolder='2013' and monthfolder in ('01')
> and eco=0 and TO_DATE(time)>='Date1' and TO_DATE(time)<'Date4'
> and plz in ('plz1','plz2','plz3')
> group by plz, time
>
> PLZ Date TotalCount
> plz1 date1 TC1
> plz1 date2 TC2
> plz1 date3 TC3
> plz1 date4 TC4
>
> plz2 date1 TC5
> plz2 date3 TC6
> plz2 date4 TC7
>
> plz3 date1 TC8
> plz3 date2 TC9
> plz3 date4 TC10
>
> The above data is grouped by plz and date. There is no entry of a plz if it does not exist for a date.
> I used the following query to generate the above data:
>
>
>
> But I require the data to be in the following format:
>
> PLZ Date1 Date2 Date3 Date4 ..... DateN .....
> plz1 TC1 TC2 TC3 TC4 ..... TC N
> plz2 TC5 ------ TC6 TC7 ..... TC N
> plz3 TC8 TC9 ------ TC10 .... TC N
>
>
> This needs to be generated as 1 column for every date of the year.
>
> Any help to generate this format is welcome. Writing my own mapper and reduce and calling in Hive as function is to be my last option.
>
> Thanking You,
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101