You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Sam Albers (Jira)" <ji...@apache.org> on 2022/09/07 18:54:00 UTC
[jira] [Updated] (ARROW-17448) [R][Python] Fix cloud storage paths in some documentation
[ https://issues.apache.org/jira/browse/ARROW-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sam Albers updated ARROW-17448:
-------------------------------
Description:
There are a few issues with the documentation for the cloud storage examples where paths are incorrect. For example in this vignette: [https://arrow.apache.org/docs/r/articles/fs.html]
This doesn't work:
{code:java}
df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code}
rather it should be:
{code:java}
df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code}
which I think makes sense as part-0 is the default writing convention for write_dataset and therefore something users are likely to see. Indeed this the way the file structure was written:
{code:java}
library(arrow)
bucket <- s3_bucket("voltrondata-labs-datasets")
bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE)
#> [1] "nyc-taxi/year=2011/month=1"
#> [2] "nyc-taxi/year=2011/month=1/part-0.parquet"
#> [3] "nyc-taxi/year=2011/month=10"
#> [4] "nyc-taxi/year=2011/month=10/part-0.parquet"
#> [5] "nyc-taxi/year=2011/month=11"
#> [6] "nyc-taxi/year=2011/month=11/part-0.parquet"
#> [7] "nyc-taxi/year=2011/month=12"
#> [8] "nyc-taxi/year=2011/month=12/part-0.parquet"
#> [9] "nyc-taxi/year=2011/month=2"
#> [10] "nyc-taxi/year=2011/month=2/part-0.parquet"
#> [11] "nyc-taxi/year=2011/month=3"
#> [12] "nyc-taxi/year=2011/month=3/part-0.parquet"
#> [13] "nyc-taxi/year=2011/month=4"
#> [14] "nyc-taxi/year=2011/month=4/part-0.parquet"
#> [15] "nyc-taxi/year=2011/month=5"
#> [16] "nyc-taxi/year=2011/month=5/part-0.parquet"
#> [17] "nyc-taxi/year=2011/month=6"
#> [18] "nyc-taxi/year=2011/month=6/part-0.parquet"
#> [19] "nyc-taxi/year=2011/month=7"
#> [20] "nyc-taxi/year=2011/month=7/part-0.parquet"
#> [21] "nyc-taxi/year=2011/month=8"
#> [22] "nyc-taxi/year=2011/month=8/part-0.parquet"
#> [23] "nyc-taxi/year=2011/month=9"
#> [24] "nyc-taxi/year=2011/month=9/part-0.parquet"
{code}
was:
There are a few issues with the documentation for the cloud storage examples where paths are incorrect. For example in this vignette: [https://arrow.apache.org/docs/r/articles/fs.html]
This doesn't work:
{code:java}
df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code}
rather it should be:
{code:java}
df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code}
which I think makes sense as part-0 is the default writing convention for write_dataset and therefore something users are likely to see. Indeed this the way the file structure was written:
{code:java}
library(arrow)
bucket <- s3_bucket("voltrondata-labs-datasets")
bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE)
#> [1] "nyc-taxi/year=2011/month=1"
#> [2] "nyc-taxi/year=2011/month=1/part-0.parquet"
#> [3] "nyc-taxi/year=2011/month=10"
#> [4] "nyc-taxi/year=2011/month=10/part-0.parquet"
#> [5] "nyc-taxi/year=2011/month=11"
#> [6] "nyc-taxi/year=2011/month=11/part-0.parquet"
#> [7] "nyc-taxi/year=2011/month=12"
#> [8] "nyc-taxi/year=2011/month=12/part-0.parquet"
#> [9] "nyc-taxi/year=2011/month=2"
#> [10] "nyc-taxi/year=2011/month=2/part-0.parquet"
#> [11] "nyc-taxi/year=2011/month=3"
#> [12] "nyc-taxi/year=2011/month=3/part-0.parquet"
#> [13] "nyc-taxi/year=2011/month=4"
#> [14] "nyc-taxi/year=2011/month=4/part-0.parquet"
#> [15] "nyc-taxi/year=2011/month=5"
#> [16] "nyc-taxi/year=2011/month=5/part-0.parquet"
#> [17] "nyc-taxi/year=2011/month=6"
#> [18] "nyc-taxi/year=2011/month=6/part-0.parquet"
#> [19] "nyc-taxi/year=2011/month=7"
#> [20] "nyc-taxi/year=2011/month=7/part-0.parquet"
#> [21] "nyc-taxi/year=2011/month=8"
#> [22] "nyc-taxi/year=2011/month=8/part-0.parquet"
#> [23] "nyc-taxi/year=2011/month=9"
#> [24] "nyc-taxi/year=2011/month=9/part-0.parquet"
{code}
I also see some examples that need updating in the cookbooks here:
[https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#read-a-parquet-file-from-s3]
and here:
[https://arrow.apache.org/cookbook/py/io.html#reading-partitioned-data-from-s3]
> [R][Python] Fix cloud storage paths in some documentation
> ---------------------------------------------------------
>
> Key: ARROW-17448
> URL: https://issues.apache.org/jira/browse/ARROW-17448
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python, R
> Affects Versions: 9.0.0
> Reporter: Sam Albers
> Priority: Minor
>
> There are a few issues with the documentation for the cloud storage examples where paths are incorrect. For example in this vignette: [https://arrow.apache.org/docs/r/articles/fs.html]
> This doesn't work:
> {code:java}
> df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")){code}
> rather it should be:
> {code:java}
> df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet")){code}
> which I think makes sense as part-0 is the default writing convention for write_dataset and therefore something users are likely to see. Indeed this the way the file structure was written:
> {code:java}
> library(arrow)
> bucket <- s3_bucket("voltrondata-labs-datasets")
> bucket$ls(path = "nyc-taxi/year=2011", recursive = TRUE)
> #> [1] "nyc-taxi/year=2011/month=1"
> #> [2] "nyc-taxi/year=2011/month=1/part-0.parquet"
> #> [3] "nyc-taxi/year=2011/month=10"
> #> [4] "nyc-taxi/year=2011/month=10/part-0.parquet"
> #> [5] "nyc-taxi/year=2011/month=11"
> #> [6] "nyc-taxi/year=2011/month=11/part-0.parquet"
> #> [7] "nyc-taxi/year=2011/month=12"
> #> [8] "nyc-taxi/year=2011/month=12/part-0.parquet"
> #> [9] "nyc-taxi/year=2011/month=2"
> #> [10] "nyc-taxi/year=2011/month=2/part-0.parquet"
> #> [11] "nyc-taxi/year=2011/month=3"
> #> [12] "nyc-taxi/year=2011/month=3/part-0.parquet"
> #> [13] "nyc-taxi/year=2011/month=4"
> #> [14] "nyc-taxi/year=2011/month=4/part-0.parquet"
> #> [15] "nyc-taxi/year=2011/month=5"
> #> [16] "nyc-taxi/year=2011/month=5/part-0.parquet"
> #> [17] "nyc-taxi/year=2011/month=6"
> #> [18] "nyc-taxi/year=2011/month=6/part-0.parquet"
> #> [19] "nyc-taxi/year=2011/month=7"
> #> [20] "nyc-taxi/year=2011/month=7/part-0.parquet"
> #> [21] "nyc-taxi/year=2011/month=8"
> #> [22] "nyc-taxi/year=2011/month=8/part-0.parquet"
> #> [23] "nyc-taxi/year=2011/month=9"
> #> [24] "nyc-taxi/year=2011/month=9/part-0.parquet"
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)