You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Felix Bier (Jira)" <ji...@apache.org> on 2023/06/28 19:48:00 UTC

[jira] [Updated] (IMPALA-12254) Inconsistent empty string handling in CSV parser

     [ https://issues.apache.org/jira/browse/IMPALA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Felix Bier updated IMPALA-12254:
--------------------------------
    Description: 
h3. Steps to reproduce:

CSV table:

 
{code:java}
,
,, {code}
DDL:

 

 
{code:java}
CREATE EXTERNAL TABLE tab1
(
   col_1 VARCHAR(256),
   col_2 VARCHAR(256),
   col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:

Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:

Some fields are NULL and some fields are empty string.

 
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | false         |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
h3. Analysis:

Problem 1:

 
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
  char* dummy = NULL;
  if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the pointer and perform increment. As last_column points to dummy, the increment is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each further iteration in the loop in FillColumns will further increment dummy. This results in non-null pointers being set as the column value, which results in WriteSlot not entering the NULL case and writing an empty string instead.
 
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
 
h3. Suggested fix:

 
Problem 1:  The pointer increment in AddColumn could be guarded so that it is only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
     ++(*num_fields);
   }
   if (PROCESS_ESCAPES) current_column_has_escape_ = false;
-  *next_column_start += len + 1;
+  if (*next_column_start != NULL) {
+    *next_column_start += len + 1;
+  }
   ++column_idx_;
   return Status::OK();
 }
{code}
With this fix, querying the example tables gives:
{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | true          |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
Problem 2: Here it is not clear to what degree that is expected behavior. If this is not expected behavior, I would expect that some calls to AddColumn would have to be changed such that they pass a nullptr for the current column position if len=0.

If the behavior is intended, the rule that is applied is not fully clear to me. For example, if the rule is that each empty token should be represented as empty string and each delimiter starts a new token, then the first two lines in the output (after applying fix for problem 1) make sense. First line in CSV file is empty, so it is interpreted as one empty token and then two NULL values are filled in. Second line contains "<empty token> <delimiter> <empty token>", so one NULL value gets filled in. But then the third line is not consistent, as it contains "<empty token> <delimiter> <empty token> <delimiter> <empty token> <eof>", so I would expect all three values to be non-NULL if this rule is followed.
 
 

  was:
h3. Steps to reproduce:

CSV table:

 
{code:java}

,
,, {code}
DDL:

 

 
{code:java}
CREATE EXTERNAL TABLE tab1
(
   col_1 VARCHAR(256),
   col_2 VARCHAR(256),
   col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:

Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:

Some fields are NULL and some fields are empty string.

 
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | false         |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
h3. Analysis:

Problem 1:

 
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
  char* dummy = NULL;
  if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the pointer and perform increment. As last_column points to dummy, the increment is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each further iteration in the loop in FillColumns will further increment dummy. This results in non-null pointers being set as the column value, which results in WriteSlot not entering the NULL case and writing an empty string instead.
 
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
 
h3. Suggested fix:
 
Problem 1:  The pointer increment in AddColumn could be guarded so that it is only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
     ++(*num_fields);
   }
   if (PROCESS_ESCAPES) current_column_has_escape_ = false;
-  *next_column_start += len + 1;
+  if (*next_column_start != NULL) {
+    *next_column_start += len + 1;
+  }
   ++column_idx_;
   return Status::OK();
 }
{code}
With this fix, querying the example tables gives:

{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | true          |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
Problem 2: Here it is not clear to what degree that is expected behavior. If this is not expected behavior, I would expect that some calls to AddColumn would have to be changed such that they pass a nullptr for the current column position if len=0.
 
 


> Inconsistent empty string handling in CSV parser
> ------------------------------------------------
>
>                 Key: IMPALA-12254
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12254
>             Project: IMPALA
>          Issue Type: Bug
>          Components: be
>            Reporter: Felix Bier
>            Priority: Minor
>
> h3. Steps to reproduce:
> CSV table:
>  
> {code:java}
> ,
> ,, {code}
> DDL:
>  
>  
> {code:java}
> CREATE EXTERNAL TABLE tab1
> (
>    col_1 VARCHAR(256),
>    col_2 VARCHAR(256),
>    col_3 VARCHAR(256)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/tmp/tab1'; {code}
> h3. Expected behavior:
> Querying tab1 should consistently either give all nulls or all empty strings.
> h3. Actual behavior:
> Some fields are NULL and some fields are empty string.
>  
> {code:java}
> [localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | false         |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> h3. Analysis:
> Problem 1:
>  
> DelimitedTextParser has function FillColumns for filling missing
> columns. This function creates a local variable for representing the fill value:
> {code:java}
>   char* dummy = NULL;
>   if (last_column == NULL) last_column = &dummy; {code}
> The last_column pointer is passed to AddColumns, which will derefence the pointer and perform increment. As last_column points to dummy, the increment is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each further iteration in the loop in FillColumns will further increment dummy. This results in non-null pointers being set as the column value, which results in WriteSlot not entering the NULL case and writing an empty string instead.
>  
> Problem 2:
> {color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, but non-null
> column pointers. These are interpreted as empty string instead of SQL-NULL
> in the result.
>  
> h3. Suggested fix:
>  
> Problem 1:  The pointer increment in AddColumn could be guarded so that it is only performed on non-NULL pointers:
> {code:java}
> diff --git a/be/src/exec/delimited-text-parser.inline.h b/be/src/exec/delimited-text-parser.inline.h
> index 266fa06dd..868f4e9bc 100644
> --- a/be/src/exec/delimited-text-parser.inline.h
> +++ b/be/src/exec/delimited-text-parser.inline.h
> @@ -70,7 +70,9 @@ inline Status DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
>      ++(*num_fields);
>    }
>    if (PROCESS_ESCAPES) current_column_has_escape_ = false;
> -  *next_column_start += len + 1;
> +  if (*next_column_start != NULL) {
> +    *next_column_start += len + 1;
> +  }
>    ++column_idx_;
>    return Status::OK();
>  }
> {code}
> With this fix, querying the example tables gives:
> {code:java}
> [localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, col_3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | true          |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> Problem 2: Here it is not clear to what degree that is expected behavior. If this is not expected behavior, I would expect that some calls to AddColumn would have to be changed such that they pass a nullptr for the current column position if len=0.
> If the behavior is intended, the rule that is applied is not fully clear to me. For example, if the rule is that each empty token should be represented as empty string and each delimiter starts a new token, then the first two lines in the output (after applying fix for problem 1) make sense. First line in CSV file is empty, so it is interpreted as one empty token and then two NULL values are filled in. Second line contains "<empty token> <delimiter> <empty token>", so one NULL value gets filled in. But then the third line is not consistent, as it contains "<empty token> <delimiter> <empty token> <delimiter> <empty token> <eof>", so I would expect all three values to be non-NULL if this rule is followed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org