You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/20 13:50:41 UTC

[GitHub] [arrow] romainfrancois opened a new pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

romainfrancois opened a new pull request #7807:
URL: https://github.com/apache/arrow/pull/7807


   Either passing down NULL or a Schema. 
   
   But perhaps a schema is confusing because the only thing that is being controlled by it here is the types, not their order etc .. which I believe feels implied if you supply a schema. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] romainfrancois commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

romainfrancois commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-691148776

It feels complicated to bend the various options from `ParseOptions`, `ConvertOptions` and `ReadOptions` to something that looks like `readr::` as they mean different things.

e.g. `ConvertOptions/column_types` which we know handle with a `schema` is only used to specify the types of some columns.

```
/// Optional per-column types (disabling type inference on those columns)
```

and then `ReadOptions/column_names` gives the names of all the columns:

```
/// Column names for the target table.
/// If empty, fall back on autogenerate_column_names.
std::vector<std::string> column_names;
/// Whether to autogenerate column names if `column_names` is empty.
/// If true, column names will be of the form "f0", "f1"...
/// If false, column names will be read from the first CSV row after `skip_rows`.
bool autogenerate_column_names = false;
```

There is also `ConvertOptions/include_columns` to control which to keep

```
/// If non-empty, indicates the names of columns from the CSV file that should
/// be actually read and converted (in the vector's order).
/// Columns not in this vector will be ignored.
std::vector<std::string> include_columns;
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #7807:
URL: https://github.com/apache/arrow/pull/7807


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-661060579


   https://issues.apache.org/jira/browse/ARROW-6537


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-705301068


   @romainfrancois PTAL, I adjusted a few things, wrote docs, and then adjusted a little more based on what made sense to document. I think the new section in the docs for read_csv_arrow summarizes it well, and I think this is the best we can do without ARROW-10219.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] romainfrancois commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

romainfrancois commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-691148776






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] romainfrancois commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

romainfrancois commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-661717723


   Added a `schema=` argument that. when specified overrules `col_names` and `col_types`. 
   
   I'm still uncertain about the compact readr specification, because this needs col_names as well, i.e. we can't make the compact spec relevant to guessed or autogenerated names. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #7807:
URL: https://github.com/apache/arrow/pull/7807


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-662045381


   > I'm still uncertain about the compact readr specification, because this needs col_names as well, i.e. we can't make the compact spec relevant to guessed or autogenerated names.
   
   I see the logic where the column names are inferred/generated in cpp/src/arrow/csv/reader.cc, it's just not exposed publicly. I could see adding a `column_names` attribute to `arrow::csv::TableReader`, so we could instantiate a reader, get column names, then make a new reader with the appropriate `*Options` objects. I can make a JIRA but I don't think we need to block this PR on that. 
   
   I get that the compact readr specification isn't all that useful as is since you also have to provide the col_names, but if we are planning to expose column names on TableReader, would it make sense to keep it in this PR for now? Or would you rather delete/stash it completely until we can support it without requiring col_names?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] romainfrancois commented on pull request #7807: ARROW-6537 [R]: Pass column_types to CSV reader

Posted by GitBox <gi...@apache.org>.

romainfrancois commented on pull request #7807:
URL: https://github.com/apache/arrow/pull/7807#issuecomment-691148776






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org