You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2017/07/06 07:11:00 UTC
[jira] [Updated] (DRILL-5486) Compliant text reader tries, but
fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Khurram Faraaz updated DRILL-5486:
----------------------------------
Component/s: Storage - Text & CSV
> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.10.0
> Reporter: Paul Rogers
> Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
> public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
> batchIndex++;
> recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic:
> {code}
> @Override
> public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
> batchIndex++;
> recordCount++;
> }
> }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record.
> This would be a problem if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the first row.
> * That bug is masked because we always set the row offset, even if we ignore the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting the container row count.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)