You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Anton Gozhiy (JIRA)" <ji...@apache.org> on 2019/05/21 15:01:00 UTC
[jira] [Closed] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

     [ https://issues.apache.org/jira/browse/DRILL-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Gozhiy closed DRILL-5487.
-------------------------------
    Resolution: Fixed

Verified with Dill version 1.17.0-SNAPSHOT (commit id 0195d1f34be7fd385ba76d2fd3e14a9fa13bd375)

The issue is fixed in V3 Text Reader.

> Vector corruption in CSV with headers and truncated last row
> ------------------------------------------------------------
>
>                 Key: DRILL-5487
>                 URL: https://issues.apache.org/jira/browse/DRILL-5487
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text &amp; CSV
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Major
>             Fix For: Future
>
>
> The CSV format plugin allows two ways of reading data:
> * As named columns
> * As a single array, called {{columns}}, that holds all columns for a row
> The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns.
> To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:
> {code}
> h,u
> abc,def
> ghi
> {code}
> Note that the file is truncated: the command and second field is missing on the last line.
> Then, I created a simple test using the "cluster fixture" framework:
> {code}
>   @Test
>   public void readerTest() throws Exception {
>     FixtureBuilder builder = ClusterFixture.builder()
>         .maxParallelization(1);
>     try (ClusterFixture cluster = builder.build();
>          ClientFixture client = cluster.clientFixture()) {
>       TextFormatConfig csvFormat = new TextFormatConfig();
>       csvFormat.fieldDelimiter = ',';
>       csvFormat.skipFirstLine = false;
>       csvFormat.extractHeader = true;
>       cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
>       String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
>       client.queryBuilder().sql(sql).printCsv();
>     }
>   }
> {code}
> The results show we've got a problem:
> {code}
> Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: length: -3 (expected: >= 0)
> {code}
> If the last line were:
> {code}
> efg,
> {code}
> Then the offset vector should look like this:
> {code}
> [0, 3, 3]
> {code}
> Very likely we have an offset vector that looks like this instead:
> {code}
> [0, 3, 0]
> {code}
> When we compute the second column of the second row, we should compute:
> {code}
> length = offset[2] - offset[1] = 3 - 3 = 0
> {code}
> Instead we get:
> {code}
> length = offset[2] - offset[1] = 0 - 3 = -3
> {code}
> The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)