You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Fawze Abujaber <fa...@gmail.com> on 2018/07/04 10:11:41 UTC

SET DDL in impala

Hi Community,

I need a quick consult for using the SET DDL while i'm using impala VIP for
the load balancing.

Below are 2 ETL jobs, the first one is using hive and the second is impala
and both have DDL operations.

I will start with the 2nd as it should stright forward since the data
inserted into the table_name2 by impala, so i think i don't need the
invalidate statement but still need the DDL statement, is it right?
should i add the set DDL after ALTER table statement? do you really need it
as the insert isn't based on the ALTER table statement since it's dropping
old partitions.

For the 1st one where i'm using hive, I know i have to use invalidate
metadata, i was thinking to use the set DDL=1 after each invalidate
metadata statement, should i use after both statements, or it's enough to
be used only in the last one?
I read the documntation about this but wasn't sure if get this point right,
so this is why i'm asking for consultation here, note a side i still think
that i should use  refresh statement after the insert while using
invalidate after the ALTER statement.

Thanks for your help in advance.



RetentionDate=$(date --date="183 days ago" +%Y%m%d)
impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS
PARTITION (concat(year,month,day)<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"

hive -e"
set parquet.compression=SNAPPY;
set hive.merge.mapredfiles=true;
set hive.mapred.mode=nonstrict;
set hive.exec.dynamic.partition.mode=nonstrict;


INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)

....
....
....
....

;"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"




=======================================================

2nd one.

impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS
PARTITION (dt<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"

impala-shell -B -i "$LPIMPALA" -q"


INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
....
....
....
....
"

impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"

-- 
Take Care
Fawze Abujaber

Re:Re: Re: SET DDL in impala

Posted by Quanlong Huang <hu...@126.com>.
Yes. Once you add "SYNC_DDL=1" to the default query option, you don't need to set it anywhere in the queries.


Regards,
Quanlong

At 2018-07-04 23:11:11,"Fawze Abujaber" <fa...@gmail.com> wrote:

Thanks Alot,


Just to close the loop for my self for SYN_DDL i can add  "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet in Cloudera manager and i shouldn't add it anywhere in the queries ( even the DDL was done in hive or impala) , right?


For the invalidate and REFRESH, i think i got the point and it much clear for me.


Quanlong, so much appreciate your help here.







On Wed, Jul 4, 2018 at 5:34 PM, Quanlong Huang <hu...@126.com> wrote:

> For your 2nd job, you don't need to run INVALIDATE METADATA statements since DDL and DML in impala will refresh the metadata automatically.
> -- For the invalidate i almost sure i don't need, but what about the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the ALTER statement?


You mentioned that you have load balancing for impalads. So you should add "SET SYNC_DDL=1" before the ALTER statement and in the same session. For example, in your 2nd job:
impala-shell -i "$LPIMPALA" -q "SET SYNC_DDL=1; ALTER TABLE table_name2 DROP IF EXISTS PARTITION (dt<'$RetentionDate');"

> For your 1st job, again, you don't need to run the INVALIDATE METADATA statement after your ALTER TABLE statement in Impala. After the INSERT OVERWRITE statement in hive, you just need to run a REFRESH statement instead of the INVALIDATE METADATA statement.
> -- So you mean i don't need after dropping some partitions to run refresh statement even? after the inserti need the refresh.

Yes. The REFRESH statement is for modifications outside Impala (e.g. in hive). If you run ALTER TABLE inside Impala, you don't need the REFRESH statement.

> What about  SET SYNC_DDL=1? where i should set it, is it something that i can do this globally at impala and not bu query?


If you're accessing Impala through a load-balancing proxy, you should SET SYNC_DDL=1 before any DDL or DML statements. There's a more detail doc explaining this: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_sync_ddl.html
You can set it globally by adding --default_query_options=SYNC_DDL=1 in the startup options. If you're using Cloudera Manager, you can add "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)".


> Do you recommend me to use REFRSH or ALTER table x Recover partitions?


I'm not quite sure about this. But I'm sure that if you're just adding new partitions into a table, you can use ALTER TABLE x Recover partitions. Otherwise, I believe you should not use it.

At 2018-07-04 21:39:29,"Fawze Abujaber" <fa...@gmail.com> wrote:

Hi Quanlong,


Thanks for your quick response.


This is an ETLs that i inherent from other engineers, so trying to make sure the changes i'm planning to do will not impact the current work.


Please see my responses inline.






On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <hu...@126.com> wrote:

Hi Fawze,


As you need a quick consult, let me answer your questions first. The official Cloudera folks will correct me if something wrong.


For your 2nd job, you don't need to run INVALIDATE METADATA statements since DDL and DML in impala will refresh the metadata automatically.
-- For the invalidate i almost sure i don't need, but what about the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the ALTER statement?
For your 1st job, again, you don't need to run the INVALIDATE METADATA statement after your ALTER TABLE statement in Impala. After the INSERT OVERWRITE statement in hive, you just need to run a REFRESH statement instead of the INVALIDATE METADATA statement.
-- So you mean i don't need after dropping some partitions to run refresh statement even? after the insert i need the refresh.
What about  SET SYNC_DDL=1? where i should set it, is it something that i can do this globally at impala and not bu query?
Do you recommend me to use REFRSH or ALTER table x Recover partitions?
BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL option, it's recommended to set it to true if you're using load balancer above of the impala daemons.


If you're interested in the difference between INVALIDATE METADATA and REFRESH, there's a discussion thread: 
https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E


In a nutshell, there're only two cases you should use the INVALIDATE METADATA statement:
- you create a table in hive and want to use it in Impala
- the block locations of the underlying hive table files changed by HDFS balancer
You can ignore the second case since it's just for performance, not for correctness. In other cases like adding/dropping files/partitions in hive, you just need the REFRESH statement.


Regards,
Quanlong



At 2018-07-04 18:11:41, "Fawze Abujaber" <fa...@gmail.com> wrote:

Hi Community,


I need a quick consult for using the SET DDL while i'm using impala VIP for the load balancing.


Below are 2 ETL jobs, the first one is using hive and the second is impala and both have DDL operations.


I will start with the 2nd as it should stright forward since the data inserted into the table_name2 by impala, so i think i don't need the invalidate statement but still need the DDL statement, is it right?
should i add the set DDL after ALTER table statement? do you really need it as the insert isn't based on the ALTER table statement since it's dropping old partitions.


For the 1st one where i'm using hive, I know i have to use invalidate metadata, i was thinking to use the set DDL=1 after each invalidate metadata statement, should i use after both statements, or it's enough to be used only in the last one?
I read the documntation about this but wasn't sure if get this point right, so this is why i'm asking for consultation here, note a side i still think that i should use  refresh statement after the insert while using invalidate after the ALTER statement.


Thanks for your help in advance.






RetentionDate=$(date --date="183 days ago" +%Y%m%d)
impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS PARTITION (concat(year,month,day)<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"


hive -e"
set parquet.compression=SNAPPY;
set hive.merge.mapredfiles=true;
set hive.mapred.mode=nonstrict;
set hive.exec.dynamic.partition.mode=nonstrict;




INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)


....
....
....
....


;"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"








=======================================================


2nd one.


impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS PARTITION (dt<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


impala-shell -B -i "$LPIMPALA" -q"




INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
....
....
....
....
"


impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


--

Take Care
Fawze Abujaber





--

Take Care
Fawze Abujaber





--

Take Care
Fawze Abujaber

Re: Re: SET DDL in impala

Posted by Fawze Abujaber <fa...@gmail.com>.
Thanks Alot,

Just to close the loop for my self for SYN_DDL i can add  "SYNC_DDL=1" in
"Impala Daemon Query Options Advanced Configuration Snippet in Cloudera
manager and i shouldn't add it anywhere in the queries ( even the DDL was
done in hive or impala) , right?

For the invalidate and REFRESH, i think i got the point and it much clear
for me.

Quanlong, so much appreciate your help here.



On Wed, Jul 4, 2018 at 5:34 PM, Quanlong Huang <hu...@126.com>
wrote:

> > For your 2nd job, you don't need to run INVALIDATE METADATA statements
> since DDL and DML in impala will refresh the metadata automatically.
> > -- For the invalidate i almost sure i don't need, but what about
> the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it
> after the ALTER statement?
>
> You mentioned that you have load balancing for impalads. So you should add
> "SET SYNC_DDL=1" *before* the ALTER statement and in the same session.
> For example, in your 2nd job:
> impala-shell -i "$LPIMPALA" -q "*SET SYNC_DDL=1;* ALTER TABLE table_name2
> DROP IF EXISTS PARTITION (dt<'$RetentionDate');"
>
> > For your 1st job, again, you don't need to run the INVALIDATE METADATA
> statement after your ALTER TABLE statement in Impala. After the INSERT
> OVERWRITE statement in hive, you just need to run a REFRESH statement
> instead of the INVALIDATE METADATA statement.
> > -- So you mean i don't need after dropping some partitions to run
> refresh statement even? after the insert i need the refresh.
>
> Yes. The REFRESH statement is for modifications outside Impala (e.g. in
> hive). If you run ALTER TABLE inside Impala, you don't need the REFRESH
> statement.
>
> > What about  SET SYNC_DDL=1? where i should set it, is it something that
> i can do this globally at impala and not bu query?
>
> If you're accessing Impala through a load-balancing proxy, you should SET
> SYNC_DDL=1 before any DDL or DML statements. There's a more detail doc
> explaining this: https://www.cloudera.com/documentation/enterprise/
> 5-14-x/topics/impala_sync_ddl.html
> You can set it globally by adding --default_query_options=SYNC_DDL=1 in
> the startup options. If you're using Cloudera Manager, you can add
> "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet
> (Safety Valve)".
>
> > Do you recommend me to use REFRSH or ALTER table x Recover partitions?
>
> I'm not quite sure about this. But I'm sure that if you're just adding new
> partitions into a table, you can use ALTER TABLE x Recover partitions.
> Otherwise, I believe you should not use it.
>
> At 2018-07-04 21:39:29,"Fawze Abujaber" <fa...@gmail.com> wrote:
>
> Hi Quanlong,
>
> Thanks for your quick response.
>
> This is an ETLs that i inherent from other engineers, so trying to make
> sure the changes i'm planning to do will not impact the current work.
>
> Please see my responses inline.
>
>
>
> On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <hu...@126.com>
> wrote:
>
>> Hi Fawze,
>>
>> As you need a quick consult, let me answer your questions first. The
>> official Cloudera folks will correct me if something wrong.
>>
>> For your 2nd job, you don't need to run INVALIDATE METADATA statements
>> since DDL and DML in impala will refresh the metadata automatically.
>> -- For the invalidate i almost sure i don't need, but what about the SET
>> SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the
>> ALTER statement?
>> For your 1st job, again, you don't need to run the INVALIDATE METADATA
>> statement after your ALTER TABLE statement in Impala. After the INSERT
>> OVERWRITE statement in hive, you just need to run a REFRESH statement
>> instead of the INVALIDATE METADATA statement.
>> -- So you mean i don't need after dropping some partitions to run refresh
>> statement even? after the insert i need the refresh.
>>
> What about  SET SYNC_DDL=1? where i should set it, is it something that i
> can do this globally at impala and not bu query?
> Do you recommend me to use REFRSH or ALTER table x Recover partitions?
>
>> BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL
>> option, it's recommended to set it to true if you're using load balancer
>> above of the impala daemons.
>>
>> If you're interested in the difference between INVALIDATE METADATA and
>> REFRESH, there's a discussion thread:
>> https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d
>> 5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E
>>
>> In a nutshell, there're only two cases you should use the INVALIDATE
>> METADATA statement:
>> - you create a table in hive and want to use it in Impala
>> - the block locations of the underlying hive table files changed by HDFS
>> balancer
>> You can ignore the second case since it's just for performance, not for
>> correctness. In other cases like adding/dropping files/partitions in hive,
>> you just need the REFRESH statement.
>>
>> Regards,
>> Quanlong
>>
>>
>> At 2018-07-04 18:11:41, "Fawze Abujaber" <fa...@gmail.com> wrote:
>>
>> Hi Community,
>>
>> I need a quick consult for using the SET DDL while i'm using impala VIP
>> for the load balancing.
>>
>> Below are 2 ETL jobs, the first one is using hive and the second is
>> impala and both have DDL operations.
>>
>> I will start with the 2nd as it should stright forward since the data
>> inserted into the table_name2 by impala, so i think i don't need the
>> invalidate statement but still need the DDL statement, is it right?
>> should i add the set DDL after ALTER table statement? do you really need
>> it as the insert isn't based on the ALTER table statement since it's
>> dropping old partitions.
>>
>> For the 1st one where i'm using hive, I know i have to use invalidate
>> metadata, i was thinking to use the set DDL=1 after each invalidate
>> metadata statement, should i use after both statements, or it's enough to
>> be used only in the last one?
>> I read the documntation about this but wasn't sure if get this point
>> right, so this is why i'm asking for consultation here, note a side i still
>> think that i should use  refresh statement after the insert while using
>> invalidate after the ALTER statement.
>>
>> Thanks for your help in advance.
>>
>>
>>
>> RetentionDate=$(date --date="183 days ago" +%Y%m%d)
>> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS
>> PARTITION (concat(year,month,day)<'$RetentionDate');"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
>>
>> hive -e"
>> set parquet.compression=SNAPPY;
>> set hive.merge.mapredfiles=true;
>> set hive.mapred.mode=nonstrict;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>>
>>
>> INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)
>>
>> ....
>> ....
>> ....
>> ....
>>
>> ;"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
>> impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"
>>
>>
>>
>>
>> =======================================================
>>
>> 2nd one.
>>
>> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS
>> PARTITION (dt<'$RetentionDate');"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>>
>> impala-shell -B -i "$LPIMPALA" -q"
>>
>>
>> INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
>> ....
>> ....
>> ....
>> ....
>> "
>>
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>
>


-- 
Take Care
Fawze Abujaber

Re:Re: SET DDL in impala

Posted by Quanlong Huang <hu...@126.com>.
> For your 2nd job, you don't need to run INVALIDATE METADATA statements since DDL and DML in impala will refresh the metadata automatically.
> -- For the invalidate i almost sure i don't need, but what about the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the ALTER statement?


You mentioned that you have load balancing for impalads. So you should add "SET SYNC_DDL=1" before the ALTER statement and in the same session. For example, in your 2nd job:
impala-shell -i "$LPIMPALA" -q "SET SYNC_DDL=1; ALTER TABLE table_name2 DROP IF EXISTS PARTITION (dt<'$RetentionDate');"

> For your 1st job, again, you don't need to run the INVALIDATE METADATA statement after your ALTER TABLE statement in Impala. After the INSERT OVERWRITE statement in hive, you just need to run a REFRESH statement instead of the INVALIDATE METADATA statement.
> -- So you mean i don't need after dropping some partitions to run refresh statement even? after the inserti need the refresh.

Yes. The REFRESH statement is for modifications outside Impala (e.g. in hive). If you run ALTER TABLE inside Impala, you don't need the REFRESH statement.

> What about  SET SYNC_DDL=1? where i should set it, is it something that i can do this globally at impala and not bu query?


If you're accessing Impala through a load-balancing proxy, you should SET SYNC_DDL=1 before any DDL or DML statements. There's a more detail doc explaining this: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_sync_ddl.html
You can set it globally by adding --default_query_options=SYNC_DDL=1 in the startup options. If you're using Cloudera Manager, you can add "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)".


> Do you recommend me to use REFRSH or ALTER table x Recover partitions?


I'm not quite sure about this. But I'm sure that if you're just adding new partitions into a table, you can use ALTER TABLE x Recover partitions. Otherwise, I believe you should not use it.

At 2018-07-04 21:39:29,"Fawze Abujaber" <fa...@gmail.com> wrote:

Hi Quanlong,


Thanks for your quick response.


This is an ETLs that i inherent from other engineers, so trying to make sure the changes i'm planning to do will not impact the current work.


Please see my responses inline.






On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <hu...@126.com> wrote:

Hi Fawze,


As you need a quick consult, let me answer your questions first. The official Cloudera folks will correct me if something wrong.


For your 2nd job, you don't need to run INVALIDATE METADATA statements since DDL and DML in impala will refresh the metadata automatically.
-- For the invalidate i almost sure i don't need, but what about the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the ALTER statement?
For your 1st job, again, you don't need to run the INVALIDATE METADATA statement after your ALTER TABLE statement in Impala. After the INSERT OVERWRITE statement in hive, you just need to run a REFRESH statement instead of the INVALIDATE METADATA statement.
-- So you mean i don't need after dropping some partitions to run refresh statement even? after the insert i need the refresh.
What about  SET SYNC_DDL=1? where i should set it, is it something that i can do this globally at impala and not bu query?
Do you recommend me to use REFRSH or ALTER table x Recover partitions?
BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL option, it's recommended to set it to true if you're using load balancer above of the impala daemons.


If you're interested in the difference between INVALIDATE METADATA and REFRESH, there's a discussion thread: 
https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E


In a nutshell, there're only two cases you should use the INVALIDATE METADATA statement:
- you create a table in hive and want to use it in Impala
- the block locations of the underlying hive table files changed by HDFS balancer
You can ignore the second case since it's just for performance, not for correctness. In other cases like adding/dropping files/partitions in hive, you just need the REFRESH statement.


Regards,
Quanlong



At 2018-07-04 18:11:41, "Fawze Abujaber" <fa...@gmail.com> wrote:

Hi Community,


I need a quick consult for using the SET DDL while i'm using impala VIP for the load balancing.


Below are 2 ETL jobs, the first one is using hive and the second is impala and both have DDL operations.


I will start with the 2nd as it should stright forward since the data inserted into the table_name2 by impala, so i think i don't need the invalidate statement but still need the DDL statement, is it right?
should i add the set DDL after ALTER table statement? do you really need it as the insert isn't based on the ALTER table statement since it's dropping old partitions.


For the 1st one where i'm using hive, I know i have to use invalidate metadata, i was thinking to use the set DDL=1 after each invalidate metadata statement, should i use after both statements, or it's enough to be used only in the last one?
I read the documntation about this but wasn't sure if get this point right, so this is why i'm asking for consultation here, note a side i still think that i should use  refresh statement after the insert while using invalidate after the ALTER statement.


Thanks for your help in advance.






RetentionDate=$(date --date="183 days ago" +%Y%m%d)
impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS PARTITION (concat(year,month,day)<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"


hive -e"
set parquet.compression=SNAPPY;
set hive.merge.mapredfiles=true;
set hive.mapred.mode=nonstrict;
set hive.exec.dynamic.partition.mode=nonstrict;




INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)


....
....
....
....


;"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"








=======================================================


2nd one.


impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS PARTITION (dt<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


impala-shell -B -i "$LPIMPALA" -q"




INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
....
....
....
....
"


impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


--

Take Care
Fawze Abujaber





--

Take Care
Fawze Abujaber

Re: SET DDL in impala

Posted by Fawze Abujaber <fa...@gmail.com>.
Hi Quanlong,

Thanks for your quick response.

This is an ETLs that i inherent from other engineers, so trying to make
sure the changes i'm planning to do will not impact the current work.

Please see my responses inline.



On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <hu...@126.com>
wrote:

> Hi Fawze,
>
> As you need a quick consult, let me answer your questions first. The
> official Cloudera folks will correct me if something wrong.
>
> For your 2nd job, you don't need to run INVALIDATE METADATA statements
> since DDL and DML in impala will refresh the metadata automatically.
> -- For the invalidate i almost sure i don't need, but what about the SET
> SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the
> ALTER statement?
> For your 1st job, again, you don't need to run the INVALIDATE METADATA
> statement after your ALTER TABLE statement in Impala. After the INSERT
> OVERWRITE statement in hive, you just need to run a REFRESH statement
> instead of the INVALIDATE METADATA statement.
> -- So you mean i don't need after dropping some partitions to run refresh
> statement even? after the insert i need the refresh.
>
What about  SET SYNC_DDL=1? where i should set it, is it something that i
can do this globally at impala and not bu query?
Do you recommend me to use REFRSH or ALTER table x Recover partitions?

> BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL
> option, it's recommended to set it to true if you're using load balancer
> above of the impala daemons.
>
> If you're interested in the difference between INVALIDATE METADATA and
> REFRESH, there's a discussion thread:
> https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d5dfba50
> 378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E
>
> In a nutshell, there're only two cases you should use the INVALIDATE
> METADATA statement:
> - you create a table in hive and want to use it in Impala
> - the block locations of the underlying hive table files changed by HDFS
> balancer
> You can ignore the second case since it's just for performance, not for
> correctness. In other cases like adding/dropping files/partitions in hive,
> you just need the REFRESH statement.
>
> Regards,
> Quanlong
>
>
> At 2018-07-04 18:11:41, "Fawze Abujaber" <fa...@gmail.com> wrote:
>
> Hi Community,
>
> I need a quick consult for using the SET DDL while i'm using impala VIP
> for the load balancing.
>
> Below are 2 ETL jobs, the first one is using hive and the second is impala
> and both have DDL operations.
>
> I will start with the 2nd as it should stright forward since the data
> inserted into the table_name2 by impala, so i think i don't need the
> invalidate statement but still need the DDL statement, is it right?
> should i add the set DDL after ALTER table statement? do you really need
> it as the insert isn't based on the ALTER table statement since it's
> dropping old partitions.
>
> For the 1st one where i'm using hive, I know i have to use invalidate
> metadata, i was thinking to use the set DDL=1 after each invalidate
> metadata statement, should i use after both statements, or it's enough to
> be used only in the last one?
> I read the documntation about this but wasn't sure if get this point
> right, so this is why i'm asking for consultation here, note a side i still
> think that i should use  refresh statement after the insert while using
> invalidate after the ALTER statement.
>
> Thanks for your help in advance.
>
>
>
> RetentionDate=$(date --date="183 days ago" +%Y%m%d)
> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS
> PARTITION (concat(year,month,day)<'$RetentionDate');"
> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
>
> hive -e"
> set parquet.compression=SNAPPY;
> set hive.merge.mapredfiles=true;
> set hive.mapred.mode=nonstrict;
> set hive.exec.dynamic.partition.mode=nonstrict;
>
>
> INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)
>
> ....
> ....
> ....
> ....
>
> ;"
> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
> impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"
>
>
>
>
> =======================================================
>
> 2nd one.
>
> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS
> PARTITION (dt<'$RetentionDate');"
> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>
> impala-shell -B -i "$LPIMPALA" -q"
>
>
> INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
> ....
> ....
> ....
> ....
> "
>
> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>
> --
> Take Care
> Fawze Abujaber
>
>


-- 
Take Care
Fawze Abujaber

Re:SET DDL in impala

Posted by Quanlong Huang <hu...@126.com>.
Hi Fawze,


As you need a quick consult, let me answer your questions first. The official Cloudera folks will correct me if something wrong.


For your 2nd job, you don't need to run INVALIDATE METADATA statements since DDL and DML in impala will refresh the metadata automatically.


For your 1st job, again, you don't need to run the INVALIDATE METADATA statement after your ALTER TABLE statement in Impala. After the INSERT OVERWRITE statement in hive, you just need to run a REFRESH statement instead of the INVALIDATE METADATA statement.


BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL option, it's recommended to set it to true if you're using load balancer above of the impala daemons.


If you're interested in the difference between INVALIDATE METADATA and REFRESH, there's a discussion thread: 
https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E


In a nutshell, there're only two cases you should use the INVALIDATE METADATA statement:
- you create a table in hive and want to use it in Impala
- the block locations of the underlying hive table files changed by HDFS balancer
You can ignore the second case since it's just for performance, not for correctness. In other cases like adding/dropping files/partitions in hive, you just need the REFRESH statement.


Regards,
Quanlong



At 2018-07-04 18:11:41, "Fawze Abujaber" <fa...@gmail.com> wrote:

Hi Community,


I need a quick consult for using the SET DDL while i'm using impala VIP for the load balancing.


Below are 2 ETL jobs, the first one is using hive and the second is impala and both have DDL operations.


I will start with the 2nd as it should stright forward since the data inserted into the table_name2 by impala, so i think i don't need the invalidate statement but still need the DDL statement, is it right?
should i add the set DDL after ALTER table statement? do you really need it as the insert isn't based on the ALTER table statement since it's dropping old partitions.


For the 1st one where i'm using hive, I know i have to use invalidate metadata, i was thinking to use the set DDL=1 after each invalidate metadata statement, should i use after both statements, or it's enough to be used only in the last one?
I read the documntation about this but wasn't sure if get this point right, so this is why i'm asking for consultation here, note a side i still think that i should use  refresh statement after the insert while using invalidate after the ALTER statement.


Thanks for your help in advance.






RetentionDate=$(date --date="183 days ago" +%Y%m%d)
impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS PARTITION (concat(year,month,day)<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"


hive -e"
set parquet.compression=SNAPPY;
set hive.merge.mapredfiles=true;
set hive.mapred.mode=nonstrict;
set hive.exec.dynamic.partition.mode=nonstrict;




INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)


....
....
....
....


;"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"








=======================================================


2nd one.


impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS PARTITION (dt<'$RetentionDate');"
impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


impala-shell -B -i "$LPIMPALA" -q"




INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
....
....
....
....
"


impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"


--

Take Care
Fawze Abujaber