You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Bing Li <lb...@gmail.com> on 2012/02/12 08:14:54 UTC

The NCDC Weather Data for Hadoop the Definitive Guide

Dear all,

I am following the book, Hadoop: the Definitive Guide. However, I got stuck
because I could not get the NCDC Weather data that is used by the source
code in the book. The Appendix C told me I could follow some instructions
in www.hadoopbook.com. But I didn't get the instructions there. Could you
give me a hand?

Thanks so much!

Best regards,
Bing

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Sujit Dhamale <su...@gmail.com>.
To avoid creation of recursively folder follow below steps


1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
do
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"
http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
done





On Fri, Nov 16, 2012 at 1:01 PM, Sujit Dhamale <su...@gmail.com>wrote:

> Hi,
> If Needed you can run Below Script for Storing Data on your Local System
>
> for i in {1901..2012}
> do
> cd /home/ubuntu/work/
> wget -r -np -nH .cut-dirs=3 -R index.html
> http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
> cd pub/data/noaa/$i/
> cp *.gz /home/ubuntu/work/files
> cd /home/ubuntu/work/
> rm -r pub/
> done
>
>
>
> On Mon, Feb 13, 2012 at 3:43 PM, Andy Doddington <an...@doddington.net>wrote:
>
>> OK, well for starters, I think you can safely ignore the PDF data; to
>> paraphrase Star Wars" “that isn’t the data
>> in which you are interested”.
>>
>> Page 16 of the book describes the data format and refers to a data store
>> that contains directories for each year from
>> 1901 to 2001. It also shows the naming of .gz files within a sample
>> directory (1990). The files in this directory have
>> names "010010-99999-1990.gz", "010014-99999-1990.gz",
>> "010015-99999-1990.gz", and so on…
>>
>> Referring back to the NCDC web site, at the link below (
>> http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
>> link on the left-hand side of the screen beings up a new screen, as shown
>> below:
>>
>>
>> Clicking again on the ‘Free Data’ link in the middle section of this page
>> brings up another page, listing the available
>> data sets:
>>
>>
>> As this page notes, although some of this data needs to be paid for,
>> there is at least one ‘free’ options within
>> each section. For simplicity, I went for the first one - the one labelled
>> “3505 FTP data access” - which the comment
>> says is free. I used anonymous FTP and found that this site contained
>> directories for each year from 1901 to 2012.
>> I expect the additional directories reflect the fact that time has moved
>> on since the book was written :-)
>>
>> There are also several text or pdf files that provide further information
>> on the contents of the site. I suggest you
>> read some of these to get more details. One of these is called
>> "ish-format-document.pdf" and it seems to describe
>> the document format in some detail. If you open this, you can check
>> whether it matches the formate expected by
>> the hadoop sample code. There is also a ‘software’ directory, which
>> contains various bits of code that might
>> prove useful.
>>
>> On drilling down into the directory for 1990, I get the following list of
>> files:
>>
>>
>> Which looks close enough to the the file names in the hadoop book - I’d
>> guess that these are the correct files.
>>
>> Given the passage of time, it is still possible that the file format has
>> changed to make it incompatible with the
>> hadoop code. However, it shouldn’t be that difficult to modify the code
>> to suit the new format (which is very
>> well documented, as already noted).
>>
>> Good luck!
>>
>>  Andy
>>
>> ——————————————
>>
>> On 12 Feb 2012, at 08:50, Bing Li wrote:
>>
>> Andy,
>>
>> Since there is a lot of data on the free data of the site, I cannot figure
>> out which one is the one talked in the book. Any format differences might
>> cause the source code to get exceptions. Some data is even in PDF format!
>>
>> Thanks so much!
>> Bing
>>
>> On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <andy@doddington.net
>> >wrote:
>>
>> According to Page 15 of the book, this data is available from the US
>>
>> National Climatic Data Center, at
>>
>> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
>>
>> links on the left-hand side of the
>>
>> page, listed under the heading ‘Data & Products’. I suspect that the entry
>>
>> labelled ‘Free Data’ is the most
>>
>> likely area you need to investigate :-)
>>
>>
>> Good Luck
>>
>>
>> Andy D
>>
>>
>> ————————————————————
>>
>>
>> On 12 Feb 2012, at 07:14, Bing Li wrote:
>>
>>
>> Dear all,
>>
>>
>> I am following the book, Hadoop: the Definitive Guide. However, I got
>>
>> stuck
>>
>> because I could not get the NCDC Weather data that is used by the source
>>
>> code in the book. The Appendix C told me I could follow some instructions
>>
>> in www.hadoopbook.com. But I didn't get the instructions there. Could
>>
>> you
>>
>> give me a hand?
>>
>>
>> Thanks so much!
>>
>>
>> Best regards,
>>
>> Bing
>>
>>
>>
>>
>>
>

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Sujit Dhamale <su...@gmail.com>.
Hi,
If Needed you can run Below Script for Storing Data on your Local System

for i in {1901..2012}
do
cd /home/ubuntu/work/
wget -r -np -nH .cut-dirs=3 -R index.html
http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cd pub/data/noaa/$i/
cp *.gz /home/ubuntu/work/files
cd /home/ubuntu/work/
rm -r pub/
done



On Mon, Feb 13, 2012 at 3:43 PM, Andy Doddington <an...@doddington.net>wrote:

> OK, well for starters, I think you can safely ignore the PDF data; to
> paraphrase Star Wars" “that isn’t the data
> in which you are interested”.
>
> Page 16 of the book describes the data format and refers to a data store
> that contains directories for each year from
> 1901 to 2001. It also shows the naming of .gz files within a sample
> directory (1990). The files in this directory have
> names "010010-99999-1990.gz", "010014-99999-1990.gz",
> "010015-99999-1990.gz", and so on…
>
> Referring back to the NCDC web site, at the link below (
> http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
> link on the left-hand side of the screen beings up a new screen, as shown
> below:
>
>
> Clicking again on the ‘Free Data’ link in the middle section of this page
> brings up another page, listing the available
> data sets:
>
>
> As this page notes, although some of this data needs to be paid for, there
> is at least one ‘free’ options within
> each section. For simplicity, I went for the first one - the one labelled
> “3505 FTP data access” - which the comment
> says is free. I used anonymous FTP and found that this site contained
> directories for each year from 1901 to 2012.
> I expect the additional directories reflect the fact that time has moved
> on since the book was written :-)
>
> There are also several text or pdf files that provide further information
> on the contents of the site. I suggest you
> read some of these to get more details. One of these is called
> "ish-format-document.pdf" and it seems to describe
> the document format in some detail. If you open this, you can check
> whether it matches the formate expected by
> the hadoop sample code. There is also a ‘software’ directory, which
> contains various bits of code that might
> prove useful.
>
> On drilling down into the directory for 1990, I get the following list of
> files:
>
>
> Which looks close enough to the the file names in the hadoop book - I’d
> guess that these are the correct files.
>
> Given the passage of time, it is still possible that the file format has
> changed to make it incompatible with the
> hadoop code. However, it shouldn’t be that difficult to modify the code to
> suit the new format (which is very
> well documented, as already noted).
>
> Good luck!
>
> Andy
>
> ——————————————
>
> On 12 Feb 2012, at 08:50, Bing Li wrote:
>
> Andy,
>
> Since there is a lot of data on the free data of the site, I cannot figure
> out which one is the one talked in the book. Any format differences might
> cause the source code to get exceptions. Some data is even in PDF format!
>
> Thanks so much!
> Bing
>
> On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <andy@doddington.net
> >wrote:
>
> According to Page 15 of the book, this data is available from the US
>
> National Climatic Data Center, at
>
> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
>
> links on the left-hand side of the
>
> page, listed under the heading ‘Data & Products’. I suspect that the entry
>
> labelled ‘Free Data’ is the most
>
> likely area you need to investigate :-)
>
>
> Good Luck
>
>
> Andy D
>
>
> ————————————————————
>
>
> On 12 Feb 2012, at 07:14, Bing Li wrote:
>
>
> Dear all,
>
>
> I am following the book, Hadoop: the Definitive Guide. However, I got
>
> stuck
>
> because I could not get the NCDC Weather data that is used by the source
>
> code in the book. The Appendix C told me I could follow some instructions
>
> in www.hadoopbook.com. But I didn't get the instructions there. Could
>
> you
>
> give me a hand?
>
>
> Thanks so much!
>
>
> Best regards,
>
> Bing
>
>
>
>
>

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Andy Doddington <an...@doddington.net>.
OK, well for starters, I think you can safely ignore the PDF data; to paraphrase Star Wars" “that isn’t the data
in which you are interested”.

Page 16 of the book describes the data format and refers to a data store that contains directories for each year from
1901 to 2001. It also shows the naming of .gz files within a sample directory (1990). The files in this directory have
names "010010-99999-1990.gz", "010014-99999-1990.gz", "010015-99999-1990.gz", and so on…

Referring back to the NCDC web site, at the link below (http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
link on the left-hand side of the screen beings up a new screen, as shown below:



Clicking again on the ‘Free Data’ link in the middle section of this page brings up another page, listing the available
data sets:



As this page notes, although some of this data needs to be paid for, there is at least one ‘free’ options within
each section. For simplicity, I went for the first one - the one labelled “3505 FTP data access” - which the comment
says is free. I used anonymous FTP and found that this site contained directories for each year from 1901 to 2012.
I expect the additional directories reflect the fact that time has moved on since the book was written :-)

There are also several text or pdf files that provide further information on the contents of the site. I suggest you
read some of these to get more details. One of these is called "ish-format-document.pdf" and it seems to describe
the document format in some detail. If you open this, you can check whether it matches the formate expected by
the hadoop sample code. There is also a ‘software’ directory, which contains various bits of code that might
prove useful.

On drilling down into the directory for 1990, I get the following list of files:



Which looks close enough to the the file names in the hadoop book - I’d guess that these are the correct files.

Given the passage of time, it is still possible that the file format has changed to make it incompatible with the
hadoop code. However, it shouldn’t be that difficult to modify the code to suit the new format (which is very
well documented, as already noted).

Good luck!

	Andy

——————————————

On 12 Feb 2012, at 08:50, Bing Li wrote:

> Andy,
> 
> Since there is a lot of data on the free data of the site, I cannot figure
> out which one is the one talked in the book. Any format differences might
> cause the source code to get exceptions. Some data is even in PDF format!
> 
> Thanks so much!
> Bing
> 
> On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <an...@doddington.net>wrote:
> 
>> According to Page 15 of the book, this data is available from the US
>> National Climatic Data Center, at
>> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
>> links on the left-hand side of the
>> page, listed under the heading ‘Data & Products’. I suspect that the entry
>> labelled ‘Free Data’ is the most
>> likely area you need to investigate :-)
>> 
>> Good Luck
>> 
>> Andy D
>> 
>> ————————————————————
>> 
>> On 12 Feb 2012, at 07:14, Bing Li wrote:
>> 
>>> Dear all,
>>> 
>>> I am following the book, Hadoop: the Definitive Guide. However, I got
>> stuck
>>> because I could not get the NCDC Weather data that is used by the source
>>> code in the book. The Appendix C told me I could follow some instructions
>>> in www.hadoopbook.com. But I didn't get the instructions there. Could
>> you
>>> give me a hand?
>>> 
>>> Thanks so much!
>>> 
>>> Best regards,
>>> Bing
>> 
>> 


Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Bing Li <lb...@gmail.com>.
Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <an...@doddington.net>wrote:

> According to Page 15 of the book, this data is available from the US
> National Climatic Data Center, at
> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
> links on the left-hand side of the
> page, listed under the heading ‘Data & Products’. I suspect that the entry
> labelled ‘Free Data’ is the most
> likely area you need to investigate :-)
>
> Good Luck
>
> Andy D
>
> ————————————————————
>
> On 12 Feb 2012, at 07:14, Bing Li wrote:
>
> > Dear all,
> >
> > I am following the book, Hadoop: the Definitive Guide. However, I got
> stuck
> > because I could not get the NCDC Weather data that is used by the source
> > code in the book. The Appendix C told me I could follow some instructions
> > in www.hadoopbook.com. But I didn't get the instructions there. Could
> you
> > give me a hand?
> >
> > Thanks so much!
> >
> > Best regards,
> > Bing
>
>

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Bing Li <lb...@gmail.com>.
Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <an...@doddington.net>wrote:

> According to Page 15 of the book, this data is available from the US
> National Climatic Data Center, at
> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
> links on the left-hand side of the
> page, listed under the heading ‘Data & Products’. I suspect that the entry
> labelled ‘Free Data’ is the most
> likely area you need to investigate :-)
>
> Good Luck
>
> Andy D
>
> ————————————————————
>
> On 12 Feb 2012, at 07:14, Bing Li wrote:
>
> > Dear all,
> >
> > I am following the book, Hadoop: the Definitive Guide. However, I got
> stuck
> > because I could not get the NCDC Weather data that is used by the source
> > code in the book. The Appendix C told me I could follow some instructions
> > in www.hadoopbook.com. But I didn't get the instructions there. Could
> you
> > give me a hand?
> >
> > Thanks so much!
> >
> > Best regards,
> > Bing
>
>

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Posted by Andy Doddington <an...@doddington.net>.
According to Page 15 of the book, this data is available from the US National Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of links on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D

————————————————————

On 12 Feb 2012, at 07:14, Bing Li wrote:

> Dear all,
> 
> I am following the book, Hadoop: the Definitive Guide. However, I got stuck
> because I could not get the NCDC Weather data that is used by the source
> code in the book. The Appendix C told me I could follow some instructions
> in www.hadoopbook.com. But I didn't get the instructions there. Could you
> give me a hand?
> 
> Thanks so much!
> 
> Best regards,
> Bing