You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Felix Bier (Jira)" <ji...@apache.org> on 2023/06/28 19:31:00 UTC
[jira] [Created] (IMPALA-12254) Inconsistent empty string handling in CSV parser
Felix Bier created IMPALA-12254:
-----------------------------------
Summary: Inconsistent empty string handling in CSV parser
Key: IMPALA-12254
URL: https://issues.apache.org/jira/browse/IMPALA-12254
Project: IMPALA
Issue Type: Bug
Components: be
Reporter: Felix Bier
h3. Steps to reproduce:
CSV table:
{code:java}
,
,, {code}
DDL:
{code:java}
CREATE EXTERNAL TABLE tab1
(
col_1 VARCHAR(256),
col_2 VARCHAR(256),
col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:
Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:
Some fields are NULL and some fields are empty string.
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false | true | false |
| false | false | true |
| false | false | true |
+---------------+---------------+---------------+
{code}
h3. Analysis:
Problem 1:
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
char* dummy = NULL;
if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the pointer and perform increment. As last_column points to dummy, the increment is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each further iteration in the loop in FillColumns will further increment dummy. This results in non-null pointers being set as the column value, which results in WriteSlot not entering the NULL case and writing an empty string instead.
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
h3. Suggested fix:
Problem 1: The pointer increment in AddColumn could be guarded so that it is only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
++(*num_fields);
}
if (PROCESS_ESCAPES) current_column_has_escape_ = false;
- *next_column_start += len + 1;
+ if (*next_column_start != NULL) {
+ *next_column_start += len + 1;
+ }
++column_idx_;
return Status::OK();
}
{code}
With this fix, querying the example tables gives:
{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false | true | true |
| false | false | true |
| false | false | true |
+---------------+---------------+---------------+
{code}
Problem 2: Here it is not clear to what degree that is expected behavior. If this is not expected behavior, I would expect that some calls to AddColumn would have to be changed such that they pass a nullptr for the current column position if len=0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org