You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Leona Yoda (Jira)" <ji...@apache.org> on 2021/07/06 05:04:00 UTC
[jira] [Updated] (SPARK-36024) Switch the datasource example due to the depreciation of the dataset

     [ https://issues.apache.org/jira/browse/SPARK-36024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leona Yoda updated SPARK-36024:
-------------------------------
    Description: 
The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 [https://registry.opendata.aws/landsat-8/ |https://registry.opendata.aws/landsat-8/]

The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. [https://registry.opendata.aws/usgs-landsat/]

So I think it's better to change the datasource like this.

[https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022]

I chose [NYC Taxi data|[https://registry.opendata.aws/nyc-tlc-trip-records-pds/|https://registry.opendata.aws/nyc-tlc-trip-records-pds/),]] here for an example. 
 Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark  (e.g. [https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)]

 

Reed test result
{code:java}
scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
{code}

  was:
The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 [https://registry.opendata.aws/landsat-8/ |https://registry.opendata.aws/landsat-8/]

The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. [https://registry.opendata.aws/usgs-landsat/]

So I think it's better to change the datasource like this.

[https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022]

I chose NYC Taxi data ([https://registry.opendata.aws/nyc-tlc-trip-records-pds/)|https://registry.opendata.aws/nyc-tlc-trip-records-pds/),] here for an example. 
Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark  (e.g. [https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)]

 

Reed test result
{code:java}
scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
{code}


> Switch the datasource example due to the depreciation of the dataset
> --------------------------------------------------------------------
>
>                 Key: SPARK-36024
>                 URL: https://issues.apache.org/jira/browse/SPARK-36024
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 3.1.2
>            Reporter: Leona Yoda
>            Priority: Trivial
>
> The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 [https://registry.opendata.aws/landsat-8/ |https://registry.opendata.aws/landsat-8/]
> The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. [https://registry.opendata.aws/usgs-landsat/]
> So I think it's better to change the datasource like this.
> [https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022]
> I chose [NYC Taxi data|[https://registry.opendata.aws/nyc-tlc-trip-records-pds/|https://registry.opendata.aws/nyc-tlc-trip-records-pds/),]] here for an example. 
>  Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark  (e.g. [https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)]
>  
> Reed test result
> {code:java}
> scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org