You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "eitsupi (via GitHub)" <gi...@apache.org> on 2023/02/24 18:44:23 UTC

[GitHub] [arrow] eitsupi opened a new pull request, #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

eitsupi opened a new pull request, #34340:
URL: https://github.com/apache/arrow/pull/34340

   ### Rationale for this change
   
   The `skip_rows_after_names` option implemented in C++ was not exposed to R.
   
   ### What changes are included in this PR?
   
   The new `skip_rows_after_names` option is now available for the `read_csv_arrow` function.
   
   ``` r
   > csv <- "a,b\n1,2\n3,4"
   > arrow::read_csv_arrow(I(csv), read_options = list(skip_rows_after_names = 1))
   # A tibble: 1 × 2
         a     b
     <int> <int>
   1     3     4
   ```
   
   <sup>Created on 2023-02-24 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   ### Are these changes tested?
   
   Tests are added for the new option.
   
   ### Are there any user-facing changes?
   
   The new `skip_rows_after_names` option can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv_arrow`'s options

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.

eitsupi commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1448281835

   > I absolutely agree that it would make a lot more sense if they were side-by-side in documentation such as vignettes. I don't think we currently mention these options in the vignettes at all, but that could make for a really helpful follow-up issue/PR.
   
   Thanks for the suggestion. I have created a follow-up issue #34383 and would be happy to discuss it here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on code in PR #34340:
URL: https://github.com/apache/arrow/pull/34340#discussion_r1118616783


##########
r/R/csv.R:
##########
@@ -385,7 +385,13 @@ CsvTableReader$create <- function(file,
 #'
 #' `CsvReadOptions$create()` further accepts these additional arguments:
 #'
-#' - `skip_rows` Number of lines to skip before reading data (default 0)
+#' - `skip_rows` Number of lines to skip before reading data (default 0).
+#' - `skip_rows_after_names` Number of lines to skip after the column names (default 0).
+#' This number can be larger than the number of rows in one block, and empty rows are counted.
+#' The order of application is as follows:
+#'   - `skip_rows` is applied (if non-zero);
+#'   - column names aread (unless `column_names` is set);
+#'   - `skip_rows_after_names` is applied (if non-zero).

Review Comment:
   ```suggestion
   #' - `skip_rows_after_names` Number of lines to skip after the column names (default 0).
   #'    This number can be larger than the number of rows in one block, and empty rows are counted.
   #'    The order of application is as follows:
   #'      - `skip_rows` is applied (if non-zero);
   #'      - column names are read (unless `column_names` is set);
   #'      - `skip_rows_after_names` is applied (if non-zero).
   ```



##########
r/R/csv.R:
##########
@@ -467,6 +473,7 @@ CsvReadOptions <- R6Class("CsvReadOptions",
 CsvReadOptions$create <- function(use_threads = option_use_threads(),
                                   block_size = 1048576L,
                                   skip_rows = 0L,
+                                  skip_rows_after_names = 0L,

Review Comment:
   Might it be better to add new parameters at the end of the list, so that this won't break code of users who are already using `CsvReadOptions$create()` and passing in parameters without names?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic merged pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv_arrow`'s options

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic merged PR #34340:
URL: https://github.com/apache/arrow/pull/34340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.

eitsupi commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1446270506

   Thank you for your review.
   
   I have changed the order of the arguments and have changed the order of the list in the documentation too.
   However, it may be easier to understand if `skip` and `skip_rows_after_names` were side by side in documents.
   If you think so, please let me know. (I have no strong opinions about it)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv_arrow`'s options

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1450150295

   Benchmark runs are scheduled for baseline = 762329b73bac05edfaf0a1e229ef21d801b57048 and contender = 18e255ed758f3e953196ed2cdbfc7424b9861afd. 18e255ed758f3e953196ed2cdbfc7424b9861afd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/05dbc3fb25094099b230ffc9c21052e3...60681eb6238c4b8b999e7fcc4ce86609/)
   [Failed :arrow_down:0.67% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b046ccc6f58d4a508013a6b42039611a...eb41a1d3c90d4c0a8bfb92580b6136cd/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d1ab2ecccaf04ec28b2715d877c44fbb...098ec3b5a5164d4799edc77ab78d04d0/)
   [Finished :arrow_down:0.29% :arrow_up:0.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d937ba0a9c154b1aa4206c71033c1fe9...42078be4780243d1bc67530192da9b6d/)
   Buildkite builds:
   [Finished] [`18e255ed` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2452)
   [Finished] [`18e255ed` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2482)
   [Failed] [`18e255ed` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2450)
   [Finished] [`18e255ed` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2473)
   [Finished] [`762329b7` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2451)
   [Failed] [`762329b7` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2481)
   [Finished] [`762329b7` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2449)
   [Finished] [`762329b7` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2472)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1444242482

   * Closes: #34339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on a diff in pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.

eitsupi commented on code in PR #34340:
URL: https://github.com/apache/arrow/pull/34340#discussion_r1118672911


##########
r/R/csv.R:
##########
@@ -385,7 +385,13 @@ CsvTableReader$create <- function(file,
 #'
 #' `CsvReadOptions$create()` further accepts these additional arguments:
 #'
-#' - `skip_rows` Number of lines to skip before reading data (default 0)
+#' - `skip_rows` Number of lines to skip before reading data (default 0).
+#' - `skip_rows_after_names` Number of lines to skip after the column names (default 0).
+#' This number can be larger than the number of rows in one block, and empty rows are counted.
+#' The order of application is as follows:
+#'   - `skip_rows` is applied (if non-zero);
+#'   - column names aread (unless `column_names` is set);
+#'   - `skip_rows_after_names` is applied (if non-zero).

Review Comment:
   Thank you for finding this. This is obviously copied from the following, so the following line also needs to be corrected too.
   (I will send a PR later)
   
   https://github.com/apache/arrow/blob/6a4bcb36c091fea07c03c57b2e31dd29f9846ac2/python/pyarrow/_csv.pyx#L113



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on a diff in pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.

eitsupi commented on code in PR #34340:
URL: https://github.com/apache/arrow/pull/34340#discussion_r1118672911


##########
r/R/csv.R:
##########
@@ -385,7 +385,13 @@ CsvTableReader$create <- function(file,
 #'
 #' `CsvReadOptions$create()` further accepts these additional arguments:
 #'
-#' - `skip_rows` Number of lines to skip before reading data (default 0)
+#' - `skip_rows` Number of lines to skip before reading data (default 0).
+#' - `skip_rows_after_names` Number of lines to skip after the column names (default 0).
+#' This number can be larger than the number of rows in one block, and empty rows are counted.
+#' The order of application is as follows:
+#'   - `skip_rows` is applied (if non-zero);
+#'   - column names aread (unless `column_names` is set);
+#'   - `skip_rows_after_names` is applied (if non-zero).

Review Comment:
   Thank you for finding this. This is obviously copied from the following, so the following line also needs to be corrected too.
   (I will send you a PR later)
   
   https://github.com/apache/arrow/blob/6a4bcb36c091fea07c03c57b2e31dd29f9846ac2/python/pyarrow/_csv.pyx#L113



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1446490563

   > Thank you for your review.
   > 
   > I have changed the order of the arguments and have changed the order of the list in the documentation too in [94b41eb](https://github.com/apache/arrow/commit/94b41eb56518e478109e226bd138f5057056479d). However, it may be easier to understand if `skip` and `skip_rows_after_names` were side by side in documents. If you think so, please let me know. (I have no strong opinions about it)
   
   I absolutely agree that it would make a lot more sense if they were side-by-side in documentation such as vignettes.  I don't think we currently mention these options in the vignettes at all, but that could make for a really helpful follow-up issue/PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34340: GH-34339: [R] Add `skip_rows_after_names` option to `read_csv`'s options

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34340:
URL: https://github.com/apache/arrow/pull/34340#issuecomment-1444242618

   :warning: GitHub issue #34339 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org