You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/08 20:57:04 UTC

[jira] [Updated] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

     [ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-5486:
-------------------------------
    Description: 
The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns.

When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:

{code}
  public void finishRecord() {
    ...
    // if there were no defined fields, skip.
    if (fieldIndex > -1) {
      batchIndex++;
      recordCount++;
    }
    ...
{code}

As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set.

The result is that the code skips the first empty row, but not empty rows elsewhere in the file.

Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic:

{code}
  @Override
  public void finishRecord() {
    recordStart = characterData;
    ...
    int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
    PlatformDependent.putInt(repeatedOffset, newOffset);
    repeatedOffset += 4;

    // if there were no defined fields, skip.
    if (fieldIndex > -1) {
      batchIndex++;
      recordCount++;
    }
  }
{code}

Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record.

This would be a problem if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader.

In sense, there are multiple self-cancelling bugs:

* The skip-empty-row logic is not synchronized with the upper layer.
* That bug is partly masked because the logic is not triggered except on the first row.
* That bug is masked because we always set the row offset, even if we ignore the row in the row count.
* That bug is masked because the incorrect row count is ignored when setting the container row count.


  was:
The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns.

When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:

{code}
  public void finishRecord() {
    ...
    // if there were no defined fields, skip.
    if (fieldIndex > -1) {
      batchIndex++;
      recordCount++;
    }
    ...
{code}

As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set.

The result is that the code skips the first empty row, but not empty rows elsewhere in the file.

Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic:

{code}
  @Override
  public void finishRecord() {
    recordStart = characterData;
    ...
    int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
    PlatformDependent.putInt(repeatedOffset, newOffset);
    repeatedOffset += 4;

    // if there were no defined fields, skip.
    if (fieldIndex > -1) {
      batchIndex++;
      recordCount++;
    }
  }
{code}

Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record.

Suppose we have this data:
[]
[abc, def]

The above bug leaves the data in this form:

[]

Why? We counted only one record (the second), but we created an entry for the first (empty) record. When we set row count, we set it to 1, which is only the first. The result is the loss of the second row.

Or, that would be true if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader.

In sense, there are multiple self-cancelling bugs:

* The skip-empty-row logic is not synchronized with the upper layer.
* That bug is partly masked because the logic is not triggered except on the first row.
* That bug is masked because we always set the row offset, even if we ignore the row in the row count.
* That bug is masked because the incorrect row count is ignored when setting the container row count.



> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>
>                 Key: DRILL-5486
>                 URL: https://issues.apache.org/jira/browse/DRILL-5486
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
>     ...
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>     ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
>     recordStart = characterData;
>     ...
>     int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
>     PlatformDependent.putInt(repeatedOffset, newOffset);
>     repeatedOffset += 4;
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record.
> This would be a problem if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the first row.
> * That bug is masked because we always set the row offset, even if we ignore the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)