You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jeffrey Wong (JIRA)" <ji...@apache.org> on 2018/12/14 04:22:00 UTC
[jira] [Updated] (ARROW-4027) Reading partitioned datasets using
RecordBatchFileReader into R
[ https://issues.apache.org/jira/browse/ARROW-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeffrey Wong updated ARROW-4027:
--------------------------------
Description:
I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it
```
data/foo1
data/foo2
data/foo3
data/foo4
```
Using pyarrow I can read them via
`pq.read_table("data/foo1").to_pandas()`
I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.
```
Error in ipc___RecordBatchFileReader__Open(file) :
Invalid: Not an Arrow file
```
From debugging, I've traced it to this line [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112], which then goes to this Rcpp code [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85]. It seems that this c++ function is expecting a single "[file like object]([https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e])"; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.
If I pass the whole directory using `read_table("data/")` I will get
```
Error in ipc___RecordBatchFileReader__Open(file) :
IOError: Error reading bytes from file: Is a directory
```
So, how can I use the R package to correctly read multiple parquet files? If I need to call RecordBatchFileReader with a pointer to the footer, file layout and schema, how do I find the footer of the dataset?
I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great
was:
I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it
```
data/foo1
data/foo2
data/foo3
data/foo4
```
Using pyarrow I can read them via
`pq.read_table("data/foo1").to_pandas()`
I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.
```
Error in ipc___RecordBatchFileReader__Open(file) :
Invalid: Not an Arrow file
```
From debugging, I've traced it to this line https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112, which then goes to this Rcpp code https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85. It seems that this c++ function is expecting a single "[file like object](https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e)"; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.
If I pass the whole directory using `read_table("data/")` I will get
```
Error in ipc___RecordBatchFileReader__Open(file) :
IOError: Error reading bytes from file: Is a directory
```
I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great
> Reading partitioned datasets using RecordBatchFileReader into R
> ---------------------------------------------------------------
>
> Key: ARROW-4027
> URL: https://issues.apache.org/jira/browse/ARROW-4027
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.11.1
> Environment: Ubuntu 16.04, building R package from master on github
> Reporter: Jeffrey Wong
> Priority: Major
>
> I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it
> ```
> data/foo1
> data/foo2
> data/foo3
> data/foo4
> ```
> Using pyarrow I can read them via
> `pq.read_table("data/foo1").to_pandas()`
> I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.
> ```
> Error in ipc___RecordBatchFileReader__Open(file) :
> Invalid: Not an Arrow file
> ```
> From debugging, I've traced it to this line [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112], which then goes to this Rcpp code [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85]. It seems that this c++ function is expecting a single "[file like object]([https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e])"; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.
>
> If I pass the whole directory using `read_table("data/")` I will get
> ```
> Error in ipc___RecordBatchFileReader__Open(file) :
> IOError: Error reading bytes from file: Is a directory
> ```
> So, how can I use the R package to correctly read multiple parquet files? If I need to call RecordBatchFileReader with a pointer to the footer, file layout and schema, how do I find the footer of the dataset?
>
>
> I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)