You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/16 00:26:28 UTC

[GitHub] [arrow] nealrichardson opened a new pull request, #13625: ARROW-16612: [R] parquet files with compression extensions should use parquet writer for compression

nealrichardson opened a new pull request, #13625:
URL: https://github.com/apache/arrow/pull/13625

   Needs documentation update. Also should do the same for feather since it handles compression in the writer and not by wrapping the stream. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on a diff in pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
eitsupi commented on code in PR #13625:
URL: https://github.com/apache/arrow/pull/13625#discussion_r927476810


##########
r/R/feather.R:
##########
@@ -67,11 +70,22 @@ write_feather <- function(x,
                           sink,
                           version = 2,
                           chunk_size = 65536L,
-                          compression = c("default", "lz4", "uncompressed", "zstd"),
+                          compression = c("default", "lz4", "lz4_frame", "uncompressed", "zstd"),

Review Comment:
   I think the same change is needed in L145.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1192545063

   > after this PR we get a file with a .gz extension that is not gzipped
   
   The file isn't gzipped but gzip compression is used internally in compressing the Parquet file contents. I agree that that is odd, but it is consistent with my understanding of how compression filename extensions are used with Parquet customarily.
   
   Weirdness aside, the bigger issue IMO is that this PR fixes at least 3 bugs where the current code fails. On master, `write_parquet("XXXX.gz")` writes a Parquet file and then compresses it with gzip around it, but then `read_parquet("XXXX.gz")` can't read it. Moreover, `write_parquet("XXXX.zst")` would also write a gzipped file, and `write_parquet("XXXX.snappy")` wouldn't compress at all. If we think that write_parquet shouldn't infer compression from the filename at all, that's fine, we can make that change on top of this PR, but we should move forward with the rest of the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kszucs commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
kszucs commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1195263557

   @nealrichardson could you please finalize this so we can include it in the release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on a diff in pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on code in PR #13625:
URL: https://github.com/apache/arrow/pull/13625#discussion_r927720953


##########
r/R/io.R:
##########
@@ -305,45 +284,36 @@ make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
 
   if (inherits(x, "SubTreeFileSystem")) {
     filesystem <- x$base_fs
-    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject file names
-    # with trailing slashes, so we need to remove it here.
-    x <- sub("/$", "", x$base_path)
+    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject
+    # file names with trailing slashes, so we need to remove it here.
+    path <- sub("/$", "", x$base_path)
+    filesystem$OpenOutputStream(path)
   } else if (is_url(x)) {
     fs_and_path <- FileSystem$from_uri(x)
-    filesystem <- fs_and_path$fs
-    x <- fs_and_path$path
-  }
-
-  if (is.null(compression)) {
-    # Infer compression from sink
-    compression <- detect_compression(x)
-  }
-
-  assert_that(is.string(x))
-  if (is.null(filesystem) && is_compressed(compression)) {
-    CompressedOutputStream$create(x) ## compressed local
-  } else if (is.null(filesystem) && !is_compressed(compression)) {
-    FileOutputStream$create(x) ## uncompressed local
-  } else if (!is.null(filesystem) && is_compressed(compression)) {
-    CompressedOutputStream$create(filesystem$OpenOutputStream(x)) ## compressed remote
+    fs_and_path$fs$OpenOutputStream(fs_and_path$path)
   } else {
-    filesystem$OpenOutputStream(x) ## uncompressed remote
+    assert_that(is.string(x))
+    FileOutputStream$create(x)
   }
 }
 
 detect_compression <- function(path) {

Review Comment:
   If the .gz.parquet / .snappy.parquet naming convention is actually a thing, you could do
   
   ```r
   detect_internal_compression <- function(path, format) {
     if (detect_compression(path) != "uncompressed") warn("ignoring .whatever extension because that's not a thing")
     detect_compression(gsub(paste0("\\.", format, "$"), path)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13625: ARROW-16612: [R] parquet files with compression extensions should use parquet writer for compression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1186037348

   https://issues.apache.org/jira/browse/ARROW-16612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson merged pull request #13625: ARROW-16612: [R] Fix compression inference from filename

Posted by GitBox <gi...@apache.org>.
nealrichardson merged PR #13625:
URL: https://github.com/apache/arrow/pull/13625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1193401806

   > Something about the `filesystem` change in the readable/writable file creation seems to have caused failures on Windows although I don't know why this would be the case only on Windows.
   
   I think I just need to write the test differently. 
   
   I'll remove the compression inference from parquet and feather, fix the test, and ship with the rest of the changes. 🤞 this can get in before 9.0 is branched.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1195904195

   @paleolimbot I addressed your feedback in https://github.com/apache/arrow/pull/13625/commits/9d05c9f704652091ffe584eb24b7eafb2c79bddb, PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13625: ARROW-16612: [R] Fix compression inference from filename

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1196914876

   Benchmark runs are scheduled for baseline = 95aec82bd6080d8fcfedb4fa558d306e2a3dd7ec and contender = cc63a5da02984954b91a512da42d37916d5efd01. cc63a5da02984954b91a512da42d37916d5efd01 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ae0904089ee3438caee8ff4e0af3f2f7...4b4ddabd63634972820d55c6748c0071/)
   [Finished :arrow_down:0.48% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/274a37afbcd24528aa0d9ffb1d02035d...88612ea0678c4732a0ceda9dd056572b/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8bffd1120a8e4970a823aea8d04dd3a6...39597a2276b046548b5af470f2123544/)
   [Finished :arrow_down:0.25% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6dac1103d234b2c83678edd94682d23...1db20772b1c74fa4a95c16f0f3fd266c/)
   Buildkite builds:
   [Failed] [`cc63a5da` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1204)
   [Finished] [`cc63a5da` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1216)
   [Failed] [`cc63a5da` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1198)
   [Finished] [`cc63a5da` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1218)
   [Failed] [`95aec82b` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1203)
   [Finished] [`95aec82b` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1215)
   [Finished] [`95aec82b` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1197)
   [Finished] [`95aec82b` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1217)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org