You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Sri Harsha Chavali <sr...@outlook.com> on 2020/08/10 16:18:29 UTC

Improper Rowresults from Impala query

Hi All,

We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing an issue where one of our queries are returning wrong results when there is a predicate (where condition) on the timestamp field (stored as string in our case). Given below is a sample query which is failing on our end. The table is parquet table and is loaded using hive.

select a.testidid,a.testdate from dbname.tablename a where a.testdate <= now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and a.testid = 123456;

I researched further and looked at the plan and found that there might be rowgroup filtering happening in my case. I tried to unset the below property and the rowresults were proper.

set parquet_read_statistics=false;

Do you think this might be related to an existing bug or am I doing something wrong?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

Re: Improper Rowresults from Impala query

Posted by Aman Sinha <am...@gmail.com>.

If feasible, I suggest also trying that set of parquet files with Impala
3.4 (last released version) to see if it works.  That way, we can at least
confirm whether there is still an issue.
Since this is data dependent, it is hard to debug without the actual data.
I think if you kept the timestamp column and anonymized the rest of the
columns (via a CTAS script), that may be one way to move forward.

-Aman

On Wed, Aug 12, 2020 at 1:56 PM Sri Harsha Chavali <
sriharsha.chavali@outlook.com> wrote:

> Hi Aman,
>
> Unfortunately I cannot provide the dataset as we have PHI data in there. I
> couldn't reproduce the issue even with a 2 Million record dataset that I
> created in Impala 3.2. It's really strange why only this dataset has the
> issue. Is there a way you think I can reproduce the issue?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Sri Harsha Chavali <sr...@outlook.com>
> *Sent:* Monday, August 10, 2020 6:18 PM
> *To:* user@impala.apache.org <us...@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Also another observation is that the below query returns same resultset
> with or without the property set. See how I have to_date() function around
> now().
>
> select count(1)
> from dbname.tablename a
>   where a.testdate <= to_date(now())
>   and a.testdate >= '2018-05-01 00:00:00';
>
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Sri Harsha Chavali <sr...@outlook.com>
> *Sent:* Monday, August 10, 2020 5:03 PM
> *To:* user@impala.apache.org <us...@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Hi Aman,
>
> Thank you for the quick response. I tried three things.
> 1. Removed all filters and only had a.testdate <= now() and it's a perfect
> match.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now();
> set parquet_read_statistics=false;
> 5879452
> set parquet_read_statistics=true;
> 5879452
>
> 2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00';
> and it's a perfect match.
> select count(1)
> from dbname.tablename a
>   where a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 12906263
> set parquet_read_statistics=true;
> 12906263
>
> 3. Removed all filters and had   a.testdate <= now()  and a.testdate >=
> '2018-05-01 00:00:00' and I found the discrepancy.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now()
>   and a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 1687250
> set parquet_read_statistics=true;
> 12892421
>
> I eliminated the parquet files one after the other and the issue existed
> in all files. I also used parquet-tools command line tool to debug the
> files and they looked good.
>
> I also created duplicate table using hive and impala (using CTAS) and
> still face the issue with the newly created tables. Any inputs on  why the
> combination of filters might cause the issue?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Aman Sinha <am...@gmail.com>
> *Sent:* Monday, August 10, 2020 3:52 PM
> *To:* user@impala.apache.org <us...@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Harsha,
> to eliminate issues with other data types, could you check just with the
> testdate column ?
> i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >=
> '2018-05-01 00:00:00'
> Is the result different with and without the parquet_read_statistics ?
>
> There could be 2 possibilities: (a) the parquet stats for one or more of
> those files may be corrupted (not sure how they were created) ,.. can you
> narrow down the set of parquet files ?  Does it happen even with 1 parquet
> file ?
>  (b) there could be a timestamp related bug with pruning using the parquet
> stats.
> Either way, you may want to file a JIRA and provide a sample file if
> possible at https://issues.apache.org/jira/projects/IMPALA/
>
> -Aman
>
> On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <
> sriharsha.chavali@outlook.com> wrote:
>
> Hi All,
>
> We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing
> an issue where one of our queries are returning wrong results when there is
> a predicate (where condition) on the timestamp field (stored as string in
> our case). Given below is a sample query which is failing on our end. The
> table is parquet table and is loaded using hive.
>
> select a.testidid,a.testdate from dbname.tablename a where a.testdate <=
> now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and
> a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and
> a.testid = 123456;
>
> I researched further and looked at the plan and found that there might be
> rowgroup filtering happening in my case. I tried to unset the below
> property and the rowresults were proper.
>
> set parquet_read_statistics=false;
>
> Do you think this might be related to an existing bug or am I doing
> something wrong?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
>
>

Re: Improper Rowresults from Impala query

Posted by Sri Harsha Chavali <sr...@outlook.com>.

Hi Aman,

Unfortunately I cannot provide the dataset as we have PHI data in there. I couldn't reproduce the issue even with a 2 Million record dataset that I created in Impala 3.2. It's really strange why only this dataset has the issue. Is there a way you think I can reproduce the issue?

Thank you,
Harsha

Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Sri Harsha Chavali <sr...@outlook.com>
Sent: Monday, August 10, 2020 6:18 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Also another observation is that the below query returns same resultset with or without the property set. See how I have to_date() function around now().

select count(1)
from dbname.tablename a
  where a.testdate <= to_date(now())
  and a.testdate >= '2018-05-01 00:00:00';


Thank you,
Harsha

Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Sri Harsha Chavali <sr...@outlook.com>
Sent: Monday, August 10, 2020 5:03 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Hi Aman,

Thank you for the quick response. I tried three things.
1. Removed all filters and only had a.testdate <= now() and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate <= now();
set parquet_read_statistics=false;
5879452
set parquet_read_statistics=true;
5879452

2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00'; and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
12906263
set parquet_read_statistics=true;
12906263

3. Removed all filters and had   a.testdate <= now()  and a.testdate >= '2018-05-01 00:00:00' and I found the discrepancy.
select count(1)
from dbname.tablename a
  where a.testdate <= now()
  and a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
1687250
set parquet_read_statistics=true;
12892421

I eliminated the parquet files one after the other and the issue existed in all files. I also used parquet-tools command line tool to debug the files and they looked good.

I also created duplicate table using hive and impala (using CTAS) and still face the issue with the newly created tables. Any inputs on  why the combination of filters might cause the issue?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Aman Sinha <am...@gmail.com>
Sent: Monday, August 10, 2020 3:52 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Harsha,
to eliminate issues with other data types, could you check just with the testdate column ?
i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >= '2018-05-01 00:00:00'
Is the result different with and without the parquet_read_statistics ?

There could be 2 possibilities: (a) the parquet stats for one or more of those files may be corrupted (not sure how they were created) ,.. can you narrow down the set of parquet files ?  Does it happen even with 1 parquet file ?
 (b) there could be a timestamp related bug with pruning using the parquet stats.
Either way, you may want to file a JIRA and provide a sample file if possible at https://issues.apache.org/jira/projects/IMPALA/

-Aman

On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <sr...@outlook.com>> wrote:
Hi All,

We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing an issue where one of our queries are returning wrong results when there is a predicate (where condition) on the timestamp field (stored as string in our case). Given below is a sample query which is failing on our end. The table is parquet table and is loaded using hive.

select a.testidid,a.testdate from dbname.tablename a where a.testdate <= now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and a.testid = 123456;

I researched further and looked at the plan and found that there might be rowgroup filtering happening in my case. I tried to unset the below property and the rowresults were proper.

set parquet_read_statistics=false;

Do you think this might be related to an existing bug or am I doing something wrong?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

Re: Improper Rowresults from Impala query

Posted by Aman Sinha <am...@gmail.com>.

It sounds like a bug on the version that you are using (3.2) but on the
current 3.4 version,  I tried with a simpler test case with some sample
data with timestamps and could not reproduce the problem using the same
query pattern you ran.  It is probable that your Timestamp type issues may
have been fixed but in order to confirm that you will have to provide a
sample data file (assuming it does not contain sensitive data) and attach
it to the JIRA using the link I sent earlier.

Aman

On Mon, Aug 10, 2020 at 3:19 PM Sri Harsha Chavali <
sriharsha.chavali@outlook.com> wrote:

> Also another observation is that the below query returns same resultset
> with or without the property set. See how I have to_date() function around
> now().
>
> select count(1)
> from dbname.tablename a
>   where a.testdate <= to_date(now())
>   and a.testdate >= '2018-05-01 00:00:00';
>
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Sri Harsha Chavali <sr...@outlook.com>
> *Sent:* Monday, August 10, 2020 5:03 PM
> *To:* user@impala.apache.org <us...@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Hi Aman,
>
> Thank you for the quick response. I tried three things.
> 1. Removed all filters and only had a.testdate <= now() and it's a perfect
> match.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now();
> set parquet_read_statistics=false;
> 5879452
> set parquet_read_statistics=true;
> 5879452
>
> 2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00';
> and it's a perfect match.
> select count(1)
> from dbname.tablename a
>   where a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 12906263
> set parquet_read_statistics=true;
> 12906263
>
> 3. Removed all filters and had   a.testdate <= now()  and a.testdate >=
> '2018-05-01 00:00:00' and I found the discrepancy.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now()
>   and a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 1687250
> set parquet_read_statistics=true;
> 12892421
>
> I eliminated the parquet files one after the other and the issue existed
> in all files. I also used parquet-tools command line tool to debug the
> files and they looked good.
>
> I also created duplicate table using hive and impala (using CTAS) and
> still face the issue with the newly created tables. Any inputs on  why the
> combination of filters might cause the issue?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Aman Sinha <am...@gmail.com>
> *Sent:* Monday, August 10, 2020 3:52 PM
> *To:* user@impala.apache.org <us...@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Harsha,
> to eliminate issues with other data types, could you check just with the
> testdate column ?
> i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >=
> '2018-05-01 00:00:00'
> Is the result different with and without the parquet_read_statistics ?
>
> There could be 2 possibilities: (a) the parquet stats for one or more of
> those files may be corrupted (not sure how they were created) ,.. can you
> narrow down the set of parquet files ?  Does it happen even with 1 parquet
> file ?
>  (b) there could be a timestamp related bug with pruning using the parquet
> stats.
> Either way, you may want to file a JIRA and provide a sample file if
> possible at https://issues.apache.org/jira/projects/IMPALA/
>
> -Aman
>
> On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <
> sriharsha.chavali@outlook.com> wrote:
>
> Hi All,
>
> We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing
> an issue where one of our queries are returning wrong results when there is
> a predicate (where condition) on the timestamp field (stored as string in
> our case). Given below is a sample query which is failing on our end. The
> table is parquet table and is loaded using hive.
>
> select a.testidid,a.testdate from dbname.tablename a where a.testdate <=
> now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and
> a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and
> a.testid = 123456;
>
> I researched further and looked at the plan and found that there might be
> rowgroup filtering happening in my case. I tried to unset the below
> property and the rowresults were proper.
>
> set parquet_read_statistics=false;
>
> Do you think this might be related to an existing bug or am I doing
> something wrong?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
>
>

Re: Improper Rowresults from Impala query

Posted by Sri Harsha Chavali <sr...@outlook.com>.

Also another observation is that the below query returns same resultset with or without the property set. See how I have to_date() function around now().

select count(1)
from dbname.tablename a
  where a.testdate <= to_date(now())
  and a.testdate >= '2018-05-01 00:00:00';


Thank you,
Harsha

Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Sri Harsha Chavali <sr...@outlook.com>
Sent: Monday, August 10, 2020 5:03 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Hi Aman,

Thank you for the quick response. I tried three things.
1. Removed all filters and only had a.testdate <= now() and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate <= now();
set parquet_read_statistics=false;
5879452
set parquet_read_statistics=true;
5879452

2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00'; and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
12906263
set parquet_read_statistics=true;
12906263

3. Removed all filters and had   a.testdate <= now()  and a.testdate >= '2018-05-01 00:00:00' and I found the discrepancy.
select count(1)
from dbname.tablename a
  where a.testdate <= now()
  and a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
1687250
set parquet_read_statistics=true;
12892421

I eliminated the parquet files one after the other and the issue existed in all files. I also used parquet-tools command line tool to debug the files and they looked good.

I also created duplicate table using hive and impala (using CTAS) and still face the issue with the newly created tables. Any inputs on  why the combination of filters might cause the issue?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Aman Sinha <am...@gmail.com>
Sent: Monday, August 10, 2020 3:52 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Harsha,
to eliminate issues with other data types, could you check just with the testdate column ?
i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >= '2018-05-01 00:00:00'
Is the result different with and without the parquet_read_statistics ?

There could be 2 possibilities: (a) the parquet stats for one or more of those files may be corrupted (not sure how they were created) ,.. can you narrow down the set of parquet files ?  Does it happen even with 1 parquet file ?
 (b) there could be a timestamp related bug with pruning using the parquet stats.
Either way, you may want to file a JIRA and provide a sample file if possible at https://issues.apache.org/jira/projects/IMPALA/

-Aman

On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <sr...@outlook.com>> wrote:
Hi All,

We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing an issue where one of our queries are returning wrong results when there is a predicate (where condition) on the timestamp field (stored as string in our case). Given below is a sample query which is failing on our end. The table is parquet table and is loaded using hive.

select a.testidid,a.testdate from dbname.tablename a where a.testdate <= now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and a.testid = 123456;

I researched further and looked at the plan and found that there might be rowgroup filtering happening in my case. I tried to unset the below property and the rowresults were proper.

set parquet_read_statistics=false;

Do you think this might be related to an existing bug or am I doing something wrong?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

Re: Improper Rowresults from Impala query

Posted by Sri Harsha Chavali <sr...@outlook.com>.

Hi Aman,

Thank you for the quick response. I tried three things.
1. Removed all filters and only had a.testdate <= now() and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate <= now();
set parquet_read_statistics=false;
5879452
set parquet_read_statistics=true;
5879452

2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00'; and it's a perfect match.
select count(1)
from dbname.tablename a
  where a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
12906263
set parquet_read_statistics=true;
12906263

3. Removed all filters and had   a.testdate <= now()  and a.testdate >= '2018-05-01 00:00:00' and I found the discrepancy.
select count(1)
from dbname.tablename a
  where a.testdate <= now()
  and a.testdate >= '2018-05-01 00:00:00';
set parquet_read_statistics=false;
1687250
set parquet_read_statistics=true;
12892421

I eliminated the parquet files one after the other and the issue existed in all files. I also used parquet-tools command line tool to debug the files and they looked good.

I also created duplicate table using hive and impala (using CTAS) and still face the issue with the newly created tables. Any inputs on  why the combination of filters might cause the issue?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

________________________________
From: Aman Sinha <am...@gmail.com>
Sent: Monday, August 10, 2020 3:52 PM
To: user@impala.apache.org <us...@impala.apache.org>
Subject: Re: Improper Rowresults from Impala query

Harsha,
to eliminate issues with other data types, could you check just with the testdate column ?
i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >= '2018-05-01 00:00:00'
Is the result different with and without the parquet_read_statistics ?

There could be 2 possibilities: (a) the parquet stats for one or more of those files may be corrupted (not sure how they were created) ,.. can you narrow down the set of parquet files ?  Does it happen even with 1 parquet file ?
 (b) there could be a timestamp related bug with pruning using the parquet stats.
Either way, you may want to file a JIRA and provide a sample file if possible at https://issues.apache.org/jira/projects/IMPALA/

-Aman

On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <sr...@outlook.com>> wrote:
Hi All,

We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing an issue where one of our queries are returning wrong results when there is a predicate (where condition) on the timestamp field (stored as string in our case). Given below is a sample query which is failing on our end. The table is parquet table and is loaded using hive.

select a.testidid,a.testdate from dbname.tablename a where a.testdate <= now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and a.testid = 123456;

I researched further and looked at the plan and found that there might be rowgroup filtering happening in my case. I tried to unset the below property and the rowresults were proper.

set parquet_read_statistics=false;

Do you think this might be related to an existing bug or am I doing something wrong?

Thank you,
Harsha


Sent from Outlook<http://aka.ms/weboutlook>

Re: Improper Rowresults from Impala query

Posted by Aman Sinha <am...@gmail.com>.

Harsha,
to eliminate issues with other data types, could you check just with the
testdate column ?
i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >=
'2018-05-01 00:00:00'
Is the result different with and without the parquet_read_statistics ?

There could be 2 possibilities: (a) the parquet stats for one or more of
those files may be corrupted (not sure how they were created) ,.. can you
narrow down the set of parquet files ?  Does it happen even with 1 parquet
file ?
 (b) there could be a timestamp related bug with pruning using the parquet
stats.
Either way, you may want to file a JIRA and provide a sample file if
possible at https://issues.apache.org/jira/projects/IMPALA/

-Aman

On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <
sriharsha.chavali@outlook.com> wrote:

> Hi All,
>
> We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing
> an issue where one of our queries are returning wrong results when there is
> a predicate (where condition) on the timestamp field (stored as string in
> our case). Given below is a sample query which is failing on our end. The
> table is parquet table and is loaded using hive.
>
> select a.testidid,a.testdate from dbname.tablename a where a.testdate <=
> now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and
> a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and
> a.testid = 123456;
>
> I researched further and looked at the plan and found that there might be
> rowgroup filtering happening in my case. I tried to unset the below
> property and the rowresults were proper.
>
> set parquet_read_statistics=false;
>
> Do you think this might be related to an existing bug or am I doing
> something wrong?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
>