You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Dominik Moritz (Jira)" <ji...@apache.org> on 2021/03/28 19:19:00 UTC

[jira] [Created] (ARROW-12124) [Rust] Parquet writer creates invalid parquet files

Dominik Moritz created ARROW-12124:
--------------------------------------

             Summary: [Rust] Parquet writer creates invalid parquet files
                 Key: ARROW-12124
                 URL: https://issues.apache.org/jira/browse/ARROW-12124
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust
            Reporter: Dominik Moritz


I wrote a simple CSV to Parquet converter at https://github.com/domoritz/csv2parquet/blob/f53feb5bd995eab41dee09f2c4d722512052d7ca/src/main.rs. 

Running it (`csv2parquet test.txt test.parquet`) with a simple file such as 

```
a,b,c
0,1,hello world
0,1,hello world
0,1,hello world
0,1,hello world
0,1,hello world
0,1,hello world
0,1,hello world
```

And then trying to read in Python with

```
import pandas as pd
df = pd.read_parquet('test.parquet')
df.to_csv('test2.csv')
```

Results in this error

```
OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
```

The schema seems to be inferred correctly

```
Inferred Schema:
{
  "fields": [
    {
      "name": "a",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 64,
        "isSigned": true
      },
      "children": []
    },
    {
      "name": "b",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 64,
        "isSigned": true
      },
      "children": []
    },
    {
      "name": "c",
      "nullable": false,
      "type": {
        "name": "utf8"
      },
      "children": []
    }
  ],
  "metadata": {}
}
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)