You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Leona Yoda (Jira)" <ji...@apache.org> on 2021/07/02 06:01:00 UTC
[jira] [Comment Edited] (HADOOP-17784) hadoop-aws landsat-pds test
bucket will be deleted after Jul 1, 2021
[ https://issues.apache.org/jira/browse/HADOOP-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373245#comment-17373245 ]
Leona Yoda edited comment on HADOOP-17784 at 7/2/21, 6:00 AM:
--------------------------------------------------------------
I checked Registry of Open Data on AWS([https://registry.opendata.aws/]), there are several datasets which format is csv.gz.
* NOAA Global Historical Climatology Network Daily
[https://registry.opendata.aws/noaa-ghcn/]
{code:java}
// code placeholder
$ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable
2021-07-02 04:08:17 3.3 KiB 1763.csv.gz
2021-07-02 04:08:27 3.2 KiB 1764.csv.gz
...
2021-07-02 04:09:04 143.1 MiB 2019.csv.gz
2021-07-02 04:09:04 138.8 MiB n
2021-07-02 04:09:04 66.6 MiB 2021.csv.gz
$ filename="2020.csv.gz"
$ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
AE000041196,20200101,TMIN,168,,,S,
AE000041196,20200101,PRCP,0,D,,S,
AE000041196,20200101,TAVG,211,H,,S,
...
$ wc -l /tmp/$filename
698966 /tmp/2020.csv.gz{code}
The datesets on these years seems enough size.
* NOAA Integrated Surface Database
[https://registry.opendata.aws/noaa-isd/]
{code:java}
// code placeholder
$ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable
...
2021-07-02 09:57:30 12.1 MiB isd-inventory.csv.z
2020-07-04 09:24:18 428 Bytes isd-inventory.txt
2021-07-02 09:57:14 13.1 MiB isd-inventory.txt.z
...
$ filename="isd-inventory.csv.z"
$ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
"USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"
"007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0"
"007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0"
...
$ wc -l /tmp/$filename
44296 /tmp/isd-inventory.csv.z{code}
Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but they are sepalated by space.
* iNaturalist Licensed Observation Images
[https://registry.opendata.aws/inaturalist-open-data/]
{code:java}
// code placeholder
aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable
PRE metadata/
PRE photos/
2021-05-20 15:59:08 1.8 GiB observations.csv.gz
2021-05-20 15:54:47 3.8 MiB observers.csv.gz
2021-05-20 16:02:14 3.1 GiB photos.csv.gz
2021-05-20 15:54:52 25.9 MiB taxa.csv.gz
$ filename="taxa.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
taxon_id ancestry rank_level rank name active
3736 48460/1/2/355675/3/67566/3727/3735 10
species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 species Snowornis cryptolophus true
...
$ wc -l /tmp/$filename
108058 /tmp/taxa.csv.gz
$ filename="observations.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
observation_uuid observer_id latitude longitude positional_accuracy taxon_id quality_grade observed_on
7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 research 2011-09-03
b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 research 2011-09-04
...
$ wc -l /tmp/$filename
8692639 /tmp/observations.csv.gz{code}
The files on top seems good, but they're sepalated by tab.
cf. LandSat-8
{code:java}
// code placeholder
aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable
PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/
PRE L8/
PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/
PRE c1/
PRE e6acf117-1cbf-4e88-af62-2098f464effe/
PRE runs/
PRE tarq/
PRE tarq_corrupt/
PRE test/
2017-05-17 22:42:27 23.2 KiB index.html
2016-08-20 02:12:04 105 Bytes robots.txt
2021-07-02 14:52:06 39 Bytes run_info.json
2021-07-02 14:02:06 3.2 KiB run_list.txt
2018-08-29 09:45:15 43.5 MiB scene_list.gz
$ filename="scene_list.gz"
$ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url
LC80101172015002LGN00,2015-01-02 15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html
LC80260392015002LGN00,2015-01-02 16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html
...
$ wc -l /tmp/$filename
183059 /tmp/scene_list.gz{code}
was (Author: yoda-mon):
I checked Registry of Open Data on AWS(https://registry.opendata.aws/), there are several datasets which format is csv.gz.
* NOAA Global Historical Climatology Network Daily
[https://registry.opendata.aws/noaa-ghcn/
]
{code:java}
// code placeholder
$ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable
2021-07-02 04:08:17 3.3 KiB 1763.csv.gz
2021-07-02 04:08:27 3.2 KiB 1764.csv.gz
...
2021-07-02 04:09:04 143.1 MiB 2019.csv.gz
2021-07-02 04:09:04 138.8 MiB n
2021-07-02 04:09:04 66.6 MiB 2021.csv.gz
$ filename="2020.csv.gz"
$ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
AE000041196,20200101,TMIN,168,,,S,
AE000041196,20200101,PRCP,0,D,,S,
AE000041196,20200101,TAVG,211,H,,S,
...
$ wc -l /tmp/$filename
698966 /tmp/2020.csv.gz{code}
The datesets on these years seems enough size.
* NOAA Integrated Surface Database
[https://registry.opendata.aws/noaa-isd/]
{code:java}
// code placeholder
$ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable
...
2021-07-02 09:57:30 12.1 MiB isd-inventory.csv.z
2020-07-04 09:24:18 428 Bytes isd-inventory.txt
2021-07-02 09:57:14 13.1 MiB isd-inventory.txt.z
...
$ filename="isd-inventory.csv.z"
$ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
"USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"
"007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0"
"007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0"
...
$ wc -l /tmp/$filename
44296 /tmp/isd-inventory.csv.z{code}
Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but they are sepalated by space.
* iNaturalist Licensed Observation Images
[https://registry.opendata.aws/inaturalist-open-data/]
{code:java}
// code placeholder
aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable
PRE metadata/
PRE photos/
2021-05-20 15:59:08 1.8 GiB observations.csv.gz
2021-05-20 15:54:47 3.8 MiB observers.csv.gz
2021-05-20 16:02:14 3.1 GiB photos.csv.gz
2021-05-20 15:54:52 25.9 MiB taxa.csv.gz
$ filename="taxa.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
taxon_id ancestry rank_level rank name active
3736 48460/1/2/355675/3/67566/3727/3735 10
species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 species Snowornis cryptolophus true
...
$ wc -l /tmp/$filename
108058 /tmp/taxa.csv.gz
$ filename="observations.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
observation_uuid observer_id latitude longitude positional_accuracy taxon_id quality_grade observed_on
7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 research 2011-09-03
b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 research 2011-09-04
...
$ wc -l /tmp/$filename
8692639 /tmp/observations.csv.gz{code}
The files on top seems good, but they're sepalated by tab.
cf. LandSat-8
{code:java}
// code placeholder
aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable
PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/
PRE L8/
PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/
PRE c1/
PRE e6acf117-1cbf-4e88-af62-2098f464effe/
PRE runs/
PRE tarq/
PRE tarq_corrupt/
PRE test/
2017-05-17 22:42:27 23.2 KiB index.html
2016-08-20 02:12:04 105 Bytes robots.txt
2021-07-02 14:52:06 39 Bytes run_info.json
2021-07-02 14:02:06 3.2 KiB run_list.txt
2018-08-29 09:45:15 43.5 MiB scene_list.gz
$ filename="scene_list.gz"
$ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url
LC80101172015002LGN00,2015-01-02 15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html
LC80260392015002LGN00,2015-01-02 16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html
...
$ wc -l /tmp/$filename
183059 /tmp/scene_list.gz{code}
> hadoop-aws landsat-pds test bucket will be deleted after Jul 1, 2021
> --------------------------------------------------------------------
>
> Key: HADOOP-17784
> URL: https://issues.apache.org/jira/browse/HADOOP-17784
> Project: Hadoop Common
> Issue Type: Test
> Components: fs/s3, test
> Reporter: Leona Yoda
> Priority: Major
>
> I found an anouncement that landsat-pds buket will be deleted on July 1, 2021
> (https://registry.opendata.aws/landsat-8/)
> and I think this bucket is used in th test of hadoop-aws module use
> [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestConstants.java#L93]
>
> At this time I can access the bucket but we might have to change the test bucket in someday.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org