You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/25 11:27:30 UTC

[GitHub] [arrow] vibhatha opened a new pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

vibhatha opened a new pull request #12252:
URL: https://github.com/apache/arrow/pull/12252


   Adding a fix: Could be caused due to the smaller number of data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791843026



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Removed the logs 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

raulcd commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791743995



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Hi, thanks for the PR!
   It is the first time I am contributing to the project reviewing some PRs and trying to understand the project a little better. I was taking a look and from what I can see on the rest of the test files there are not many prints on them. Was it for debugging? I would probably remove the print unless there is a reason to have it.
   Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791749029



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Yeah and also left the print statement for
   Future debugging purpose 
   We can remove it too. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021091474


   https://issues.apache.org/jira/browse/ARROW-15438


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791749029



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Yeah and also left the print statement for
   Future debugging purpose 
   We can remove it too. 

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Thank you for the review. 
   Yeah and also left the print statement for
   Future debugging purpose 
   We can remove it too. 

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Removed the logs 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

kszucs commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021396494


   @lidavidm can you verify that the patch resolves the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021488945


   I would probably just remove the second half of the test (it's all tested in C++ anyways)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #12252:
URL: https://github.com/apache/arrow/pull/12252


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791749029



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Thank you for the review. 
   Yeah and also left the print statement for
   Future debugging purpose 
   We can remove it too. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021091474


   https://issues.apache.org/jira/browse/ARROW-15438


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021491520


   @lidavidm yes it is tested in C++. It's a richer API than Python’s and the exact case intended can be written in C++ well. And we could remove it but an important part get missed. Does this also show a limitation in the Python API? 
   
   @westonpace any thoughts on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021487828


   I'm not sure why use_threads doesn't affect this (but GLog below is showing various thread IDs):
   
   ```
   pyarrow/tests/test_dataset.py::test_write_dataset_max_open_files WARNING: Logging before InitGoogleLogging() is written to STDERR
   W20220125 13:26:12.820377 27754 file_base.cc:297] Next batch: c1:   [
       1,
       2,
       3,
       4,
       0,
       10,
       20,
       30,
       40
     ]
   c2:   [
       "a",
       "b",
       "c",
       "d",
       "e",
       "a",
       "a",
       "b",
       "c"
     ]
   W20220125 13:26:12.820533 27755 file_base.cc:297] Next batch: c1:   [
       5,
       6,
       7,
       8,
       0,
       1,
       10,
       20,
       30
     ]
   c2:   [
       "a",
       "b",
       "c",
       "d",
       "e",
       "c",
       "a",
       "b",
       "c"
     ]
   W20220125 13:26:12.820778 27757 file_base.cc:297] Next batch: c1:   [
       9,
       10,
       11,
       12,
       0,
       1,
       30,
       40,
       50
     ]
   c2:   [
       "a",
       "b",
       "c",
       "d",
       "e",
       "d",
       "a",
       "b",
       "c"
     ]
   W20220125 13:26:12.820991 27752 file_base.cc:297] Next batch: c1:   [
       13,
       14,
       15,
       16,
       0,
       1,
       10,
       20,
       30,
       40,
       50
     ]
   c2:   [
       "a",
       "b",
       "c",
       "d",
       "e",
       "b",
       "b",
       "a",
       "d",
       "b",
       "a"
     ]
   W20220125 13:26:12.821352 27754 file_base.cc:319] Next partition c1:   [
       1,
       10,
       20
     ]
   W20220125 13:26:12.821383 27754 file_base.cc:320] Next destination c2=a
   W20220125 13:26:12.821465 27754 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.821472 27754 dataset_writer.cc:309] Opening file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=a/part-0.parquet
   W20220125 13:26:12.821590 27755 file_base.cc:319] Next partition c1:   [
       5,
       10
     ]
   W20220125 13:26:12.821632 27755 file_base.cc:320] Next destination c2=a
   W20220125 13:26:12.821770 27757 file_base.cc:319] Next partition c1:   [
       9,
       30
     ]
   W20220125 13:26:12.821805 27757 file_base.cc:320] Next destination c2=a
   W20220125 13:26:12.821910 27752 file_base.cc:319] Next partition c1:   [
       13,
       20,
       50
     ]
   W20220125 13:26:12.821943 27752 file_base.cc:320] Next destination c2=a
   W20220125 13:26:12.822041 27755 file_base.cc:319] Next partition c1:   [
       6,
       20
     ]
   W20220125 13:26:12.822079 27755 file_base.cc:320] Next destination c2=b
   W20220125 13:26:12.822149 27755 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.822168 27755 dataset_writer.cc:502] Need to close file
   W20220125 13:26:12.824494 27761 dataset_writer.cc:321] Closing file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=a/part-0.parquet
   W20220125 13:26:12.824517 27761 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.824522 27761 dataset_writer.cc:309] Opening file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=b/part-0.parquet
   W20220125 13:26:12.824643 27761 file_base.cc:319] Next partition c1:   [
       2,
       30
     ]
   W20220125 13:26:12.824677 27761 file_base.cc:320] Next destination c2=b
   W20220125 13:26:12.824752 27761 file_base.cc:319] Next partition c1:   [
       10,
       40
     ]
   W20220125 13:26:12.824780 27761 file_base.cc:320] Next destination c2=b
   W20220125 13:26:12.824851 27761 file_base.cc:319] Next partition c1:   [
       14,
       1,
       10,
       40
     ]
   W20220125 13:26:12.824880 27761 file_base.cc:320] Next destination c2=b
   W20220125 13:26:12.824952 27761 file_base.cc:319] Next partition c1:   [
       7,
       1,
       30
     ]
   W20220125 13:26:12.824980 27761 file_base.cc:320] Next destination c2=c
   W20220125 13:26:12.825034 27761 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.825044 27761 dataset_writer.cc:502] Need to close file
   W20220125 13:26:12.826997 27762 dataset_writer.cc:321] Closing file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=b/part-0.parquet
   W20220125 13:26:12.827020 27762 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.827025 27762 dataset_writer.cc:309] Opening file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=c/part-0.parquet
   W20220125 13:26:12.827148 27762 file_base.cc:319] Next partition c1:   [
       3,
       40
     ]
   W20220125 13:26:12.827184 27762 file_base.cc:320] Next destination c2=c
   W20220125 13:26:12.827260 27762 file_base.cc:319] Next partition c1:   [
       11,
       50
     ]
   W20220125 13:26:12.827288 27762 file_base.cc:320] Next destination c2=c
   W20220125 13:26:12.827361 27762 file_base.cc:319] Next partition c1:   [
       15
     ]
   W20220125 13:26:12.827389 27762 file_base.cc:320] Next destination c2=c
   W20220125 13:26:12.827461 27762 file_base.cc:319] Next partition c1:   [
       4
     ]
   W20220125 13:26:12.827487 27762 file_base.cc:320] Next destination c2=d
   W20220125 13:26:12.827543 27762 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.827553 27762 dataset_writer.cc:502] Need to close file
   W20220125 13:26:12.829382 27761 dataset_writer.cc:321] Closing file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=c/part-0.parquet
   W20220125 13:26:12.829403 27761 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.829409 27761 dataset_writer.cc:309] Opening file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=d/part-0.parquet
   W20220125 13:26:12.829531 27761 file_base.cc:319] Next partition c1:   [
       8
     ]
   W20220125 13:26:12.829560 27761 file_base.cc:320] Next destination c2=d
   W20220125 13:26:12.829636 27761 file_base.cc:319] Next partition c1:   [
       12,
       1
     ]
   W20220125 13:26:12.829663 27761 file_base.cc:320] Next destination c2=d
   W20220125 13:26:12.829735 27761 file_base.cc:319] Next partition c1:   [
       16,
       30
     ]
   W20220125 13:26:12.829762 27761 file_base.cc:320] Next destination c2=d
   W20220125 13:26:12.829834 27761 file_base.cc:319] Next partition c1:   [
       0
     ]
   W20220125 13:26:12.829860 27761 file_base.cc:320] Next destination c2=e
   W20220125 13:26:12.829912 27761 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.829921 27761 dataset_writer.cc:502] Need to close file
   W20220125 13:26:12.831791 27762 dataset_writer.cc:321] Closing file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=d/part-0.parquet
   W20220125 13:26:12.831813 27762 dataset_writer.cc:499] Need to open file
   W20220125 13:26:12.831820 27762 dataset_writer.cc:309] Opening file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=e/part-0.parquet
   W20220125 13:26:12.831940 27762 file_base.cc:319] Next partition c1:   [
       0
     ]
   W20220125 13:26:12.831971 27762 file_base.cc:320] Next destination c2=e
   W20220125 13:26:12.832048 27762 file_base.cc:319] Next partition c1:   [
       0
     ]
   W20220125 13:26:12.832075 27762 file_base.cc:320] Next destination c2=e
   W20220125 13:26:12.832150 27762 file_base.cc:319] Next partition c1:   [
       0
     ]
   W20220125 13:26:12.832176 27762 file_base.cc:320] Next destination c2=e
   W20220125 13:26:12.834204 27762 dataset_writer.cc:321] Closing file /tmp/pytest-of-lidavidm/pytest-232/test_write_dataset_max_open_fi0/ds/max_1/c2=e/part-0.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021399420






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1023284508


   Should this be closed now that #12263 has been merged?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on a change in pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

raulcd commented on a change in pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#discussion_r791743995



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3804,6 +3807,8 @@ def _get_compare_pair(data_source, record_batch, file_format, col_id):
     num_of_files_generated, number_of_partitions \
         = _get_compare_pair(data_source_2, record_batch_1, file_format,
                             partition_column_id)
+    print(f"Generated File Count: {num_of_files_generated}")

Review comment:
       Hi, thanks for the PR!
   It is the first time I am contributing to the project reviewing some PRs and trying to understand the project a little better. I was taking a look and from what I can see on the rest of the test files there are not many prints on them. Was it for debugging? I would probably remove the print unless there is a reason to have it.
   Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021485326


   Basically you are getting "very unlucky" since ConsumingSinkNode does not serialize calls to Consume and all data in batches is ordered by the partition column, so what you get is that each batch gets partitioned, and then the threads _happen_ to line up such that each thread writes out the batch for partition A, then for partition B, ... and so on, so even if only one file can be open at a time, we don't ever need to open a second file for a partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #12252: ARROW-15438: [Python] Flaky test test_write_dataset_max_open_files

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #12252:
URL: https://github.com/apache/arrow/pull/12252#issuecomment-1021485822


   Oh, cross that part out about the data ordering. The partition step makes that moot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org