You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Fawze Abujaber <fa...@gmail.com> on 2018/12/15 07:35:35 UTC

Impala table partitions

Hi Community.

I would like to create an external table on top of these hdfs files with
parition year, month and day, is there a possible to create one table on
top of these files

/tmp/account=aaaa/year=2018/month=01/day=01
/tmp/account=aaaa/year=2018/month=01/day=02
/tmp/account=bbbb/year=2018/month=01/day=01
/tmp/account=bbbb/year=2018/month=01/day=02

Creating a table with:
PARTITIONED BY (

  year INT,

  month INT,

  day INT

)




STORED AS PARQUET



LOCATION '/tmp'

Is not working for me.

Adding the account to the partition creatng me millions of partitions and i
want to avoid this, in the background i have a compaction job that compact
the small files under the parition day.
-- 
Take Care
Fawze Abujaber

Re: Impala table partitions

Posted by Fawze Abujaber <fa...@gmail.com>.

Thanks guys for you quick responses, Yes in my case the account is a folder
and not a file, so i need to find another design that can support my
variable retention per account and hive is not a viable solution, we still
need to use impala as the query engine.

On Sat, Dec 15, 2018 at 2:12 PM Zoltán Borók-Nagy <bo...@cloudera.com>
wrote:

> Hi,
>
> Yes, and the account column must be present in the data files, otherwise
> Impala won't see it.
> If that's the case you'll need to write a bit more complex job than a copy.
>
> BR,
>      Zoltan
>
>
>
> On Sat, Dec 15, 2018 at 12:19 PM Quanlong Huang <hu...@gmail.com>
> wrote:
>
>> Yes if those are file (not directory) names.
>>
>> However, if /tmp/table1/year/month/day/account=aaaa is a directory and
>> your partition location is /tmp/table1/year/month/day, Impala can't read
>> the underlying files recursively. There's a JIRA for support recursively
>> reading: https://issues.apache.org/jira/browse/IMPALA-4596
>>
>> On Sat, Dec 15, 2018 at 5:46 PM Fawze Abujaber <fa...@gmail.com> wrote:
>>
>>> Thanks Quanlong for you response, I cereated a code who create this
>>> partitions in order to be able to manage and define variable retention by
>>> account.
>>>
>>> Can i conclude if i do my files structure like this, it will works for
>>> me with partition by year,month and day?
>>> /tmp/table1/year/month/day/account=aaaa
>>> /tmp/table1/year/month/day/account=bbbb
>>>
>>>

-- 
Take Care
Fawze Abujaber

Re: Impala table partitions

Posted by Zoltán Borók-Nagy <bo...@cloudera.com>.

Hi,

Yes, and the account column must be present in the data files, otherwise
Impala won't see it.
If that's the case you'll need to write a bit more complex job than a copy.

BR,
     Zoltan



On Sat, Dec 15, 2018 at 12:19 PM Quanlong Huang <hu...@gmail.com>
wrote:

> Yes if those are file (not directory) names.
>
> However, if /tmp/table1/year/month/day/account=aaaa is a directory and
> your partition location is /tmp/table1/year/month/day, Impala can't read
> the underlying files recursively. There's a JIRA for support recursively
> reading: https://issues.apache.org/jira/browse/IMPALA-4596
>
> On Sat, Dec 15, 2018 at 5:46 PM Fawze Abujaber <fa...@gmail.com> wrote:
>
>> Thanks Quanlong for you response, I cereated a code who create this
>> partitions in order to be able to manage and define variable retention by
>> account.
>>
>> Can i conclude if i do my files structure like this, it will works for me
>> with partition by year,month and day?
>> /tmp/table1/year/month/day/account=aaaa
>> /tmp/table1/year/month/day/account=bbbb
>>
>>

Re: Impala table partitions

Posted by Quanlong Huang <hu...@gmail.com>.

Yes if those are file (not directory) names.

However, if /tmp/table1/year/month/day/account=aaaa is a directory and your
partition location is /tmp/table1/year/month/day, Impala can't read the
underlying files recursively. There's a JIRA for support recursively
reading: https://issues.apache.org/jira/browse/IMPALA-4596

On Sat, Dec 15, 2018 at 5:46 PM Fawze Abujaber <fa...@gmail.com> wrote:

> Thanks Quanlong for you response, I cereated a code who create this
> partitions in order to be able to manage and define variable retention by
> account.
>
> Can i conclude if i do my files structure like this, it will works for me
> with partition by year,month and day?
> /tmp/table1/year/month/day/account=aaaa
> /tmp/table1/year/month/day/account=bbbb
>
>

Re: Impala table partitions

Posted by Fawze Abujaber <fa...@gmail.com>.

Thanks Quanlong for you response, I cereated a code who create this
partitions in order to be able to manage and define variable retention by
account.

Can i conclude if i do my files structure like this, it will works for me
with partition by year,month and day?
/tmp/table1/year/month/day/account=aaaa
/tmp/table1/year/month/day/account=bbbb

Re: Impala table partitions

Posted by Quanlong Huang <hu...@gmail.com>.

Hi Fawze,

A hive partition can only have one unique location. So partition
(year=2018/month=01/day=01) can't point to
both /tmp/account=aaaa/year=2018/month=01/day=01
and /tmp/account=bbbb/year=2018/month=01/day=01 together.

For your ploblem, you need to reorganize the directory hierarchy to match
the partition definition: move all files in
/tmp/account=*/year=2018/month=01/day=01 into
/somewhere/year=2018/month=01/day=01.
As you mentioned you have millions of accounts, you may need to do this in
parallel via a Map-only MapReduce job or a Spark job.

For example, to write a MapReduce job for this:
(1) Create a text file with all these directory names.
(2) Using NLineInputFormat as the InputFormat and the text file as input.
(3) Each mapper will process N directories. They move files in
/tmp/account=*/year=YYYY/month=MM/day=DD into
/somewhere/year=YYYY/month=MM/day=DD (create the dir if not exists)

HTH
Quanlong

On Sat, Dec 15, 2018 at 3:36 PM Fawze Abujaber <fa...@gmail.com> wrote:

> Hi Community.
>
> I would like to create an external table on top of these hdfs files with
> parition year, month and day, is there a possible to create one table on
> top of these files
>
> /tmp/account=aaaa/year=2018/month=01/day=01
> /tmp/account=aaaa/year=2018/month=01/day=02
> /tmp/account=bbbb/year=2018/month=01/day=01
> /tmp/account=bbbb/year=2018/month=01/day=02
>
> Creating a table with:
> PARTITIONED BY (
>
>   year INT,
>
>   month INT,
>
>   day INT
>
> )
>
>
>
>
> STORED AS PARQUET
>
>
>
> LOCATION '/tmp'
>
> Is not working for me.
>
> Adding the account to the partition creatng me millions of partitions and
> i want to avoid this, in the background i have a compaction job that
> compact the small files under the parition day.
> --
> Take Care
> Fawze Abujaber
>