You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/05 21:27:17 UTC

[GitHub] [arrow] n3world opened a new pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

n3world opened a new pull request #10255:
URL: https://github.com/apache/arrow/pull/10255


   Add a new csv reader option which allows the reader to skip rows after reading the column names from the csv.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r631076862



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       My guess is that the reason it is simple is because of the comment that it is intended to skip corrupt rows so for that it probably has to be a bit brute force.
   
   While adding a more csv aware skip does add some more complexity that parsing knowledge is already contained in the BoundryFinder implementations, so it already exists. Also, as a user specifying the number of rows to skip I would expect that csv rows would be skipped and not file lines.
   
   If it sways your opinion any, I did get the implementation working that uses the BlockReaders, Chunker and BoundryFinders to skip over the lines and the parser and everything downstream are unaware. Also, it can skip lines beyond a single block, satisfying ARROW-8527.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646677174



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       You're right about options validation. Do you want to open a separate JIRA for it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r643232930



##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -254,22 +271,12 @@ class ThreadedBlockReader : public BlockReader {
  public:
   using BlockReader::BlockReader;
 
-  static Iterator<CSVBlock> MakeIterator(

Review comment:
       Yes. I think it was in just for symmetry with BlockReader




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r630994114



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       The fact that SkipRows is a bit dumb is a separate concern. We don't seem to have had any user complaints about it, so no need to complicate it for now, IMHO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-855945821


   Thanks for the update @n3world ! I will merge this once CI is green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world closed pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world closed pull request #10255:
URL: https://github.com/apache/arrow/pull/10255


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r626978925



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,42 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r643231248



##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -261,5 +291,43 @@ TEST_P(BaseChunkerTest, EscapingNewline) {
   }
 }
 
+TEST_P(BaseChunkerTest, ParseSkip) {
+  {
+    auto csv = MakeCSVData({"ab,c,\n", "def,,gh\n", ",ij,kl\n"});
+    ASSERT_NO_FATAL_FAILURE(AssertSkip(csv, 1, 0, 15));

Review comment:
       Without ASSERT_NO_FATAL_FAILURE the test will continue to execute even if AssertSkip fails.
   
   With gtest assert failures in called functions do not cause the test to exit unless it is wrapped by ASSERT_NO_FATAL_FAILURE




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646615349



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       That's the convention we use across the Arrow codebase (for example all lengths and offsets are signed, even though they are positive). Given that C++ (and especially some compilers) can be annoying with implicit casts or mixed signed-unsigned operations, it's better not to introduce unsigned variables gratuitously.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r643237969



##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -261,5 +291,43 @@ TEST_P(BaseChunkerTest, EscapingNewline) {
   }
 }
 
+TEST_P(BaseChunkerTest, ParseSkip) {
+  {
+    auto csv = MakeCSVData({"ab,c,\n", "def,,gh\n", ",ij,kl\n"});
+    ASSERT_NO_FATAL_FAILURE(AssertSkip(csv, 1, 0, 15));

Review comment:
       Ok, but how is that a problem? It would just accumulate failures instead of stopping at the first one.
   Regardless, this is not important :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646631692



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       This is an internal API. We may want to do it at a higher-level, for example in the CSV reader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r630155517



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       Actually after looking at it a bit more it doesn't have to be moved out of the reader but I don't think it can use SkipRows. SkipRows is very simple in its implementation it doesn't actually skip rows but lines in the file. I make the distinction here because a row can contain values with new lines. If any of the rows contain a quoted or escaped new line skip rows will consider that two lines and not one.
   
   I was thinking it might be better to add add a FirstN method to Chunker to be able to get the Nth occurrence of the line endings. I was thinking this could be integrated into the BlockReader implementations to be able to skip over rows even beyond the first block. This could also solve ARROW-8527.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world edited a comment on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world edited a comment on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833185925


   > That would be good. Eventually the dataset scanner will probably be getting a skip operation of some kind as well so that'll increase the pressure on [ARROW-8527](https://issues.apache.org/jira/browse/ARROW-8527). [ARROW-12598](https://issues.apache.org/jira/browse/ARROW-12598) is also (admittedly tangentially) related since you seem to be on a roll smile
   
   The only tricky part about a count(*) implementation with this is that skip_rows documented that it was skipping header rows which shouldn't be counted as part of a data row count. I feel like the row count operation would have to be a little different and maybe give an indicator for on which line the actual data rows start so that the header rows before that point could be skipped.
   
   Maybe a simpler solution would a set of two indexes: column names and first data row . While this doesn't allow arbitrary row skipping in the middle this would allow for the most common use cases, including skipping over valid rows to first desired row. With another option or operation could be used to count the number of data rows starting at first data row. The defaults would be 0, 1 for when column names are part of the csv or -1, 0 when they are not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646676681



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       We can keep it a `int32_t` for consistency with `skip_rows`. There is zero use case for skipping more than 2**31 rows :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r629540920



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       OK. I think the skip still needs to be moved to the parser though so that escaped or quoted new lines are properly handled. Also, should it be an error to provide `skip_rows` and  `skip_rows_after_names` at the same time if names are not being parsed from the csv? There is no reason they both cannot be provided it just seems strange to use both and might indicate a lack of understanding by the user.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r626935893



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,42 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       Can you add a test case where `autogenerate_column_names` is set to `true` (i.e. there is no header in the CSV file)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r644463979



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -234,6 +235,38 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
+  Result<uint64_t> FindN(util::string_view partial, util::string_view block,
+                         uint64_t count, int64_t* out_pos) override {
+    Lexer<quoting, escaping> lexer(options_);
+    uint64_t found = 0;
+    const char* data = block.data();
+    const char* const data_end = block.data() + block.size();
+
+    const char* line_end;
+    if (partial.size()) {

Review comment:
       Yes because the lexer cannot be called with an empty block of data. It asserts that it is passed data and doesn't handle when `data == data_end`

##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -261,5 +291,43 @@ TEST_P(BaseChunkerTest, EscapingNewline) {
   }
 }
 
+TEST_P(BaseChunkerTest, ParseSkip) {
+  {
+    auto csv = MakeCSVData({"ab,c,\n", "def,,gh\n", ",ij,kl\n"});
+    ASSERT_NO_FATAL_FAILURE(AssertSkip(csv, 1, 0, 15));

Review comment:
       It can be a problem for some tests which might cause a segfault or have bad behavior if test execution fails after an ASSERT fails, I know this is not one of those types of tests. Normally ASSERT failures halt test execution for non halting checks it is standard to use the `EXPECT_` macros




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833018730


   https://issues.apache.org/jira/browse/ARROW-12661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r645735637



##########
File path: python/pyarrow/tests/test_csv.py
##########
@@ -319,6 +321,79 @@ def test_header_skip_rows(self):
             "kl": ["op"],
         }
 
+    def test_skip_rows_after_names(self):

Review comment:
       Done. Would you like more is just the one good?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r629383462



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       I can look into writing them in python. I just wrote the test in C++ because I implemented the feature in C++.
   Right now I am actually not working on this because after the above discussions I was under the impression that the desired implementation would be to have skip_rows be a list of ranges of row indexes to skip. With that in mind I actually started to work on https://issues.apache.org/jira/browse/ARROW-12675 so that the parser could know what absolute line number it was parsing and could skip the line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833115352


   Thanks for submitting these PRs!  I'm still brushing up on my parser knowledge but I'll take a look at the other one too.
   
   This one looks pretty straightforward to me.  It's a bit unfortunate that we have two `skip_rows` but it does seem that there are cases for skipping before and after.  I do prefer this over the before/after boolean.  Another option would be to take in a list of row indices.  This is what pandas does.  So [0:1] skips two rows before the header (if any) and [1:2] would skip two rows after the header or [0, 2] would skip a row before and after.
   
   @pitrou should probably take a look at this as he's got the most experience with the CSV reader.  He's out for the rest of the week so it might be next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r631184057



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       Well, if it's not too much code and you want to do it, then why not :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r645736085



##########
File path: cpp/src/arrow/util/delimiting.cc
##########
@@ -138,4 +167,25 @@ Status Chunker::ProcessFinal(std::shared_ptr<Buffer> partial,
   return Status::OK();
 }
 
+Status Chunker::ProcessSkip(std::shared_ptr<Buffer> partial,
+                            std::shared_ptr<Buffer> block, bool final, uint64_t* count,
+                            std::shared_ptr<Buffer>* rest) {
+  DCHECK_GT(*count, 0);
+  int64_t pos;
+  ARROW_ASSIGN_OR_RAISE(auto skipped,
+                        boundary_finder_->FindN(util::string_view(*partial),
+                                                util::string_view(*block), *count, &pos));
+  if (pos == BoundaryFinder::kNoDelimiterFound) {
+    return StraddlingTooLarge();
+  }
+  if (ARROW_PREDICT_FALSE(final && *count > skipped && block->size() != pos)) {
+    ++skipped;

Review comment:
       Done. Is the comment descriptive enough?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r629390150



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       IMHO a list of ranges would be overkill. `skip_rows_after_names` seems ok to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r640631129



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       I updated the PR with a version which uses the BoundryFinder and can skip rows in more than one block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646612914



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       Out of curiosity why use signed types for values which should never be negative?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646666716



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       Looking at the csv options there currently is no validation on any of those values other than asserts. It might make sense to add a options validation to the ReaderMixin as part of another PR.
   
   Mildly related, I also noticed I probably should have made `skip_rows_after_names` an `int64_t` to match the type of the variable in the chunker. Do you want me to make that change quickly or do you want to do it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r629351629



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       These options are currently tested on the Python side, so I would appreciate if you wrote your tests in Python instead (also, they'll be easier to write :-)). This implies that you need to expose the new option in Python as well. If that is difficult for you, please say so and we can help.
   
   See https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_csv.py#L286




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646629832



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -235,10 +235,10 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
-  Status FindNth(util::string_view partial, util::string_view block, uint64_t count,
-                 int64_t* out_pos, uint64_t* num_found) override {
+  Status FindNth(util::string_view partial, util::string_view block, int64_t count,

Review comment:
       Should I add asserts or checks that negative values are never passed in then because right now the code does not handle negative values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r646590868



##########
File path: python/pyarrow/tests/test_csv.py
##########
@@ -319,6 +321,79 @@ def test_header_skip_rows(self):
             "kl": ["op"],
         }
 
+    def test_skip_rows_after_names(self):

Review comment:
       It's good, thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833140856


   > Thanks for submitting these PRs! I'm still brushing up on my parser knowledge but I'll take a look at the other one too.
   > 
   > This one looks pretty straightforward to me. It's a bit unfortunate that we have two `skip_rows` but it does seem that there are cases for skipping before and after. I do prefer this over the before/after boolean. Another option would be to take in a list of row indices. This is what pandas does. So [0:1] skips two rows before the header (if any) and [1:2] would skip two rows after the header or [0, 2] would skip a row before and after.
   > 
   > @pitrou should probably take a look at this as he's got the most experience with the CSV reader. He's out for the rest of the week so it might be next week.
   
   I like the idea of replacing the current skip_rows with a list of ranges to skip. This would be a more complex but a more robust solution. I assume that skip_rows would have to be kept but deprecated. The one difficulty would be if `ParseOptions::newlines_in_values` were true then the reader would have to be aware of the rows that are parsed to have a correct row index. I can start working on this solution ignoring the newlines_in_values issue now if that is the consensus.
   
   I might make it so any range starting with 0 is handled by the reader, as skip_row does now, and any range not starting with 0 is handled by the parser so that potentially quoted new lines are handled correctly.
   
   There might be a way to also tackle https://issues.apache.org/jira/browse/ARROW-8527 with this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833710546


   Valid points.  In that case I wouldn't worry much about ARROW-8527 in this PR at all.  I think 8527 is really looking for "skip data rows" (even though the title says otherwise).  I'll add a comment over there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r629977397



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       I don't really understand why the skip needs to be moved to the parser? Currently, `skip_rows` is handled in the reader. I see no reason why `skip_rows_after_names` should be different.
   (also, it sounds ok to me to provide both at the same time)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10255:
URL: https://github.com/apache/arrow/pull/10255


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r644463979



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -234,6 +235,38 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
+  Result<uint64_t> FindN(util::string_view partial, util::string_view block,
+                         uint64_t count, int64_t* out_pos) override {
+    Lexer<quoting, escaping> lexer(options_);
+    uint64_t found = 0;
+    const char* data = block.data();
+    const char* const data_end = block.data() + block.size();
+
+    const char* line_end;
+    if (partial.size()) {

Review comment:
       Yes because the lexer cannot be called with an empty block of data. It asserts that it is passed data and doesn't handle when `data == data_end`

##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -261,5 +291,43 @@ TEST_P(BaseChunkerTest, EscapingNewline) {
   }
 }
 
+TEST_P(BaseChunkerTest, ParseSkip) {
+  {
+    auto csv = MakeCSVData({"ab,c,\n", "def,,gh\n", ",ij,kl\n"});
+    ASSERT_NO_FATAL_FAILURE(AssertSkip(csv, 1, 0, 15));

Review comment:
       It can be a problem for some tests which might cause a segfault or have bad behavior if test execution fails after an ASSERT fails, I know this is not one of those types of tests. Normally ASSERT failures halt test execution for non halting checks it is standard to use the `EXPECT_` macros




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r630155517



##########
File path: cpp/src/arrow/csv/reader_test.cc
##########
@@ -216,5 +216,83 @@ TEST(StreamingReaderTests, NestedParallelism) {
   TestNestedParallelism(thread_pool, table_factory);
 }
 
+TEST(ReaderOptionsTests, SkipRowsAfterNames) {

Review comment:
       Actually after looking at it a bit more it doesn't have to be moved out of the reader but I don't think it can use SkipRows. SkipRows is very simple in its implementation it doesn't actually skip rows but lines in the file. I make the distinction here because a row can contain values with new lines. If any of the rows contain a quoted or escaped new line skip rows will consider that two lines and not one.
   
   I was thinking it might be better to add add a FirstN method to BoundaryFind to be able to get the Nth occurrence of the line endings and use that in Chunker. I was thinking this could be integrated into the BlockReader implementations to be able to skip over rows even beyond the first block. This could also solve ARROW-8527.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833171709


   Ah, I hadn't realized we had `newlines_in_values`.  Yes, sounds like there is a bit of complexity but you have a good grip on it.
   
   > I assume that skip_rows would have to be kept but deprecated.
   
   I don't think so but I'll let @pitrou weigh in.  As long as there is a good reason for it the project seems to be in a place where breaking changes are ok.  Especially since this one would be pretty trivial to understand and adapt to.
   
   > There might be a way to also tackle https://issues.apache.org/jira/browse/ARROW-8527 with this change.
   
   That would be good.  Eventually the dataset scanner will probably be getting a skip operation of some kind as well so that'll increase the pressure on ARROW-8527.  ARROW-12598 is also (admittedly tangentially) related since you seem to be on a roll :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r645736575



##########
File path: cpp/src/arrow/util/delimiting.h
##########
@@ -53,6 +54,20 @@ class ARROW_EXPORT BoundaryFinder {
   /// `out_pos` will be -1 if no delimiter is found.
   virtual Status FindLast(util::string_view block, int64_t* out_pos) = 0;
 
+  /// \brief Find the position of the Nth delimiter inside the block
+  ///
+  /// `partial` is taken to be the beginning of the block, and `block`
+  /// its continuation.  Also, `partial` doesn't contain a delimiter.
+  ///
+  /// The returned `out_pos` is relative to `block`'s start and should point
+  /// to the first character after the first delimiter.
+  /// `out_pos` will be -1 if no delimiter is found.
+  /// \return the number of delimiters found
+  virtual Result<uint64_t> FindN(util::string_view partial, util::string_view block,
+                                 uint64_t count, int64_t* out_pos) {
+    return Status::NotImplemented("BoundaryFinder::FindN");

Review comment:
       Used two out parameters




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#discussion_r643030430



##########
File path: cpp/src/arrow/util/delimiting.h
##########
@@ -138,6 +153,26 @@ class ARROW_EXPORT Chunker {
   Status ProcessFinal(std::shared_ptr<Buffer> partial, std::shared_ptr<Buffer> block,
                       std::shared_ptr<Buffer>* completion, std::shared_ptr<Buffer>* rest);
 
+  /// \brief Skip count number of rows
+  /// Pre-conditions:
+  /// - `partial` is the start of a valid block of delimited data
+  ///   (i.e. starts just after a delimiter)
+  /// - `block` follows `partial` in file order
+  ///
+  /// Post-conditions:
+  /// - Count is updated to indicate the number of rows that still need to be skipped

Review comment:
       Nit, but can you quote variable names as above: e.g. ``` `count` ```, etc.

##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -71,6 +71,36 @@ class BaseChunkerTest : public ::testing::TestWithParam<bool> {
 
   void MakeChunker() { chunker_ = ::arrow::csv::MakeChunker(options_); }
 
+  void AssertSkip(const std::string& str, const uint64_t count, const uint64_t rem_count,
+                  const int64_t rest_size) {
+    MakeChunker();
+
+    {
+      auto test_count = count;
+      auto partial = std::make_shared<Buffer>("");
+      auto block = std::make_shared<Buffer>(reinterpret_cast<const uint8_t*>(str.data()),
+                                            static_cast<int64_t>(str.size()));
+      std::shared_ptr<Buffer> rest;
+      ASSERT_OK(chunker_->ProcessSkip(partial, block, true, &test_count, &rest));
+      ASSERT_EQ(rem_count, test_count);
+      ASSERT_EQ(rest_size, rest->size());

Review comment:
       Can you also check the contents of `rest`?

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -166,14 +166,17 @@ namespace {
 // iterator APIs (e.g. Visit)) even though an empty optional is never used in this code.
 class BlockReader {
  public:
-  BlockReader(std::unique_ptr<Chunker> chunker, std::shared_ptr<Buffer> first_buffer)
+  BlockReader(std::unique_ptr<Chunker> chunker, std::shared_ptr<Buffer> first_buffer,
+              uint64_t skip)

Review comment:
       Can you use a more precise name? Is it `skip_rows`?

##########
File path: python/pyarrow/tests/test_csv.py
##########
@@ -319,6 +321,79 @@ def test_header_skip_rows(self):
             "kl": ["op"],
         }
 
+    def test_skip_rows_after_names(self):

Review comment:
       Can you perhaps add a test with both `skip_rows` and `skip_rows_after_names`?

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -254,22 +271,12 @@ class ThreadedBlockReader : public BlockReader {
  public:
   using BlockReader::BlockReader;
 
-  static Iterator<CSVBlock> MakeIterator(

Review comment:
       I presume this was unused?

##########
File path: cpp/src/arrow/util/delimiting.h
##########
@@ -53,6 +54,20 @@ class ARROW_EXPORT BoundaryFinder {
   /// `out_pos` will be -1 if no delimiter is found.
   virtual Status FindLast(util::string_view block, int64_t* out_pos) = 0;
 
+  /// \brief Find the position of the Nth delimiter inside the block
+  ///
+  /// `partial` is taken to be the beginning of the block, and `block`
+  /// its continuation.  Also, `partial` doesn't contain a delimiter.
+  ///
+  /// The returned `out_pos` is relative to `block`'s start and should point
+  /// to the first character after the first delimiter.
+  /// `out_pos` will be -1 if no delimiter is found.
+  /// \return the number of delimiters found
+  virtual Result<uint64_t> FindN(util::string_view partial, util::string_view block,
+                                 uint64_t count, int64_t* out_pos) {
+    return Status::NotImplemented("BoundaryFinder::FindN");

Review comment:
       Suggestions:
   * call this `FindNth`
   * make this a pure virtual method
   * either use two out-parameters, or return the two values as a result, e.g.:
   ```c++
   struct FindNthResult {
     int64_t num_found;
     int64_t pos;
   };
   virtual Result<FindNthResult> FindNth(...) = 0;
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -72,6 +72,13 @@ cdef class ReadOptions(_Weakrefable):
     encoding: str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+    skip_rows_after_names: int, optional (default 0)

Review comment:
       Can you move this up just after `skip_rows`?

##########
File path: cpp/src/arrow/util/delimiting.cc
##########
@@ -138,4 +167,25 @@ Status Chunker::ProcessFinal(std::shared_ptr<Buffer> partial,
   return Status::OK();
 }
 
+Status Chunker::ProcessSkip(std::shared_ptr<Buffer> partial,
+                            std::shared_ptr<Buffer> block, bool final, uint64_t* count,
+                            std::shared_ptr<Buffer>* rest) {
+  DCHECK_GT(*count, 0);
+  int64_t pos;
+  ARROW_ASSIGN_OR_RAISE(auto skipped,
+                        boundary_finder_->FindN(util::string_view(*partial),
+                                                util::string_view(*block), *count, &pos));
+  if (pos == BoundaryFinder::kNoDelimiterFound) {
+    return StraddlingTooLarge();
+  }
+  if (ARROW_PREDICT_FALSE(final && *count > skipped && block->size() != pos)) {
+    ++skipped;

Review comment:
       Why? Can you add a comment?

##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -234,6 +235,38 @@ class LexingBoundaryFinder : public BoundaryFinder {
     return Status::OK();
   }
 
+  Result<uint64_t> FindN(util::string_view partial, util::string_view block,
+                         uint64_t count, int64_t* out_pos) override {
+    Lexer<quoting, escaping> lexer(options_);
+    uint64_t found = 0;
+    const char* data = block.data();
+    const char* const data_end = block.data() + block.size();
+
+    const char* line_end;
+    if (partial.size()) {

Review comment:
       Is this conditional useful?

##########
File path: cpp/src/arrow/csv/chunker_test.cc
##########
@@ -261,5 +291,43 @@ TEST_P(BaseChunkerTest, EscapingNewline) {
   }
 }
 
+TEST_P(BaseChunkerTest, ParseSkip) {
+  {
+    auto csv = MakeCSVData({"ab,c,\n", "def,,gh\n", ",ij,kl\n"});
+    ASSERT_NO_FATAL_FAILURE(AssertSkip(csv, 1, 0, 15));

Review comment:
       I don't think `ASSERT_NO_FATAL_FAILURE` is useful, but we can keep it if you want.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-849764577


   Sorry meant to delete my other branch but accidentally deleted this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on pull request #10255: ARROW-12661: [C++] Add ReaderOptions::skip_rows_after_names

Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833185925


   > That would be good. Eventually the dataset scanner will probably be getting a skip operation of some kind as well so that'll increase the pressure on [ARROW-8527](https://issues.apache.org/jira/browse/ARROW-8527). [ARROW-12598](https://issues.apache.org/jira/browse/ARROW-12598) is also (admittedly tangentially) related since you seem to be on a roll smile
   
   The only tricky part about a count(*) implementation with this is that skip_rows documented that it was skipping header rows which shouldn't be counted as part of a data row count. I fee like the row count operation would have to be a little different and maybe give an indicator for on which line the actual data rows start so that the header rows before that point could be skipped.
   
   Maybe a simpler solution would a set of two indexes: column names and first data row . While this doesn't allow arbitrary row skipping in the middle this would allow for the most common use cases, including skipping over valid rows to first desired row. With another option or operation could be used to count the number of data rows starting at first data row. The defaults would be 0, 1 for when column names are part of the csv or -1, 0 when they are not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org