You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Yin Steve <st...@gmail.com> on 2012/12/07 16:01:33 UTC

I need some raw big data

 Hello, I'm Steve who need some raw big data for studying mapreduce
programming. Where can i find them? especially those about weblog, traffic
info etc. My English is not so well, if you can give me a URL which
directly help me download the big file, That'll be great.
Waiting for your reply......

Re: I need some raw big data

Posted by Sujit Dhamale <su...@gmail.com>.
Hi,
you can use National Climatic Data Center (NCDC)  data which is good
candidate for Hadoop
Below are steps to download Data.


1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
do
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"  http://ftp3.ncdc
.noaa.gov/pub/data/noaa/$i/
done

Kind Regards
Sujit Dhamale
(+91 9970086652)

On Sat, Dec 8, 2012 at 4:05 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Yin,
>
>        You may find this interesting :
> https://github.com/unitedstates
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:
>
>> Another suggestion is Google Books Ngrams:
>>
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>
>> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <motley.crue.fan@gmail.com
>> > wrote:
>>
>>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>>> >
>>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com>
>>> wrote:
>>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>>> >> programming. Where can i find them? especially those about weblog,
>>> traffic
>>> >> info etc. My English is not so well, if you can give me a URL which
>>> directly
>>> >> help me download the big file, That'll be great.
>>> >> Waiting for your reply......
>>>
>>> Try some of the links off of this Quora thread:
>>>
>>>
>>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>>
>>> You might also try googling "Enron corpus".   Or check out
>>> CommonCrawl.org.
>>>
>>>
>>> Phil
>>>
>>
>>
>

Re: I need some raw big data

Posted by Sujit Dhamale <su...@gmail.com>.
Hi,
you can use National Climatic Data Center (NCDC)  data which is good
candidate for Hadoop
Below are steps to download Data.


1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
do
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"  http://ftp3.ncdc
.noaa.gov/pub/data/noaa/$i/
done

Kind Regards
Sujit Dhamale
(+91 9970086652)

On Sat, Dec 8, 2012 at 4:05 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Yin,
>
>        You may find this interesting :
> https://github.com/unitedstates
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:
>
>> Another suggestion is Google Books Ngrams:
>>
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>
>> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <motley.crue.fan@gmail.com
>> > wrote:
>>
>>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>>> >
>>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com>
>>> wrote:
>>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>>> >> programming. Where can i find them? especially those about weblog,
>>> traffic
>>> >> info etc. My English is not so well, if you can give me a URL which
>>> directly
>>> >> help me download the big file, That'll be great.
>>> >> Waiting for your reply......
>>>
>>> Try some of the links off of this Quora thread:
>>>
>>>
>>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>>
>>> You might also try googling "Enron corpus".   Or check out
>>> CommonCrawl.org.
>>>
>>>
>>> Phil
>>>
>>
>>
>

Re: I need some raw big data

Posted by Sujit Dhamale <su...@gmail.com>.
Hi,
you can use National Climatic Data Center (NCDC)  data which is good
candidate for Hadoop
Below are steps to download Data.


1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
do
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"  http://ftp3.ncdc
.noaa.gov/pub/data/noaa/$i/
done

Kind Regards
Sujit Dhamale
(+91 9970086652)

On Sat, Dec 8, 2012 at 4:05 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Yin,
>
>        You may find this interesting :
> https://github.com/unitedstates
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:
>
>> Another suggestion is Google Books Ngrams:
>>
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>
>> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <motley.crue.fan@gmail.com
>> > wrote:
>>
>>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>>> >
>>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com>
>>> wrote:
>>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>>> >> programming. Where can i find them? especially those about weblog,
>>> traffic
>>> >> info etc. My English is not so well, if you can give me a URL which
>>> directly
>>> >> help me download the big file, That'll be great.
>>> >> Waiting for your reply......
>>>
>>> Try some of the links off of this Quora thread:
>>>
>>>
>>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>>
>>> You might also try googling "Enron corpus".   Or check out
>>> CommonCrawl.org.
>>>
>>>
>>> Phil
>>>
>>
>>
>

Re: I need some raw big data

Posted by Sujit Dhamale <su...@gmail.com>.
Hi,
you can use National Climatic Data Center (NCDC)  data which is good
candidate for Hadoop
Below are steps to download Data.


1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
do
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"  http://ftp3.ncdc
.noaa.gov/pub/data/noaa/$i/
done

Kind Regards
Sujit Dhamale
(+91 9970086652)

On Sat, Dec 8, 2012 at 4:05 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Yin,
>
>        You may find this interesting :
> https://github.com/unitedstates
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:
>
>> Another suggestion is Google Books Ngrams:
>>
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>
>> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <motley.crue.fan@gmail.com
>> > wrote:
>>
>>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>>> >
>>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com>
>>> wrote:
>>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>>> >> programming. Where can i find them? especially those about weblog,
>>> traffic
>>> >> info etc. My English is not so well, if you can give me a URL which
>>> directly
>>> >> help me download the big file, That'll be great.
>>> >> Waiting for your reply......
>>>
>>> Try some of the links off of this Quora thread:
>>>
>>>
>>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>>
>>> You might also try googling "Enron corpus".   Or check out
>>> CommonCrawl.org.
>>>
>>>
>>> Phil
>>>
>>
>>
>

Re: I need some raw big data

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Yin,

       You may find this interesting :
https://github.com/unitedstates

Regards,
    Mohammad Tariq



On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Another suggestion is Google Books Ngrams:
>
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>
> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:
>
>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>> >
>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> >> programming. Where can i find them? especially those about weblog,
>> traffic
>> >> info etc. My English is not so well, if you can give me a URL which
>> directly
>> >> help me download the big file, That'll be great.
>> >> Waiting for your reply......
>>
>> Try some of the links off of this Quora thread:
>>
>>
>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>
>> You might also try googling "Enron corpus".   Or check out
>> CommonCrawl.org.
>>
>>
>> Phil
>>
>
>

Re: I need some raw big data

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Yin,

       You may find this interesting :
https://github.com/unitedstates

Regards,
    Mohammad Tariq



On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Another suggestion is Google Books Ngrams:
>
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>
> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:
>
>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>> >
>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> >> programming. Where can i find them? especially those about weblog,
>> traffic
>> >> info etc. My English is not so well, if you can give me a URL which
>> directly
>> >> help me download the big file, That'll be great.
>> >> Waiting for your reply......
>>
>> Try some of the links off of this Quora thread:
>>
>>
>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>
>> You might also try googling "Enron corpus".   Or check out
>> CommonCrawl.org.
>>
>>
>> Phil
>>
>
>

Re: I need some raw big data

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Yin,

       You may find this interesting :
https://github.com/unitedstates

Regards,
    Mohammad Tariq



On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Another suggestion is Google Books Ngrams:
>
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>
> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:
>
>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>> >
>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> >> programming. Where can i find them? especially those about weblog,
>> traffic
>> >> info etc. My English is not so well, if you can give me a URL which
>> directly
>> >> help me download the big file, That'll be great.
>> >> Waiting for your reply......
>>
>> Try some of the links off of this Quora thread:
>>
>>
>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>
>> You might also try googling "Enron corpus".   Or check out
>> CommonCrawl.org.
>>
>>
>> Phil
>>
>
>

Re: I need some raw big data

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Yin,

       You may find this interesting :
https://github.com/unitedstates

Regards,
    Mohammad Tariq



On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Another suggestion is Google Books Ngrams:
>
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>
> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:
>
>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>> >
>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> >> programming. Where can i find them? especially those about weblog,
>> traffic
>> >> info etc. My English is not so well, if you can give me a URL which
>> directly
>> >> help me download the big file, That'll be great.
>> >> Waiting for your reply......
>>
>> Try some of the links off of this Quora thread:
>>
>>
>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>>
>> You might also try googling "Enron corpus".   Or check out
>> CommonCrawl.org.
>>
>>
>> Phil
>>
>
>

Re: I need some raw big data

Posted by Chris Nauroth <cn...@hortonworks.com>.
Another suggestion is Google Books Ngrams:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:

> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
> >> programming. Where can i find them? especially those about weblog,
> traffic
> >> info etc. My English is not so well, if you can give me a URL which
> directly
> >> help me download the big file, That'll be great.
> >> Waiting for your reply......
>
> Try some of the links off of this Quora thread:
>
>
> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>
> You might also try googling "Enron corpus".   Or check out CommonCrawl.org.
>
>
> Phil
>

Re: "attempt*" directories in user logs

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
However, in the case Oleg is talking about the attempts are:
attempt_201212051224_0021_m_000000_0
attempt_201212051224_0021_m_000002_0
attempt_201212051224_0021_m_000003_0

These aren't multiple attempts of a single task, are they ? They are
actually different tasks. If they were multiple attempts, I would expect
the last digit to get incremented, like attempt_201212051224_0021_m_000000_0
and attempt_201212051224_0021_m_000000_1, for instance.

It looks like at least 3 different tasks were launched on this node. One of
them could be setup task. Oleg, how many map tasks does the Jobtracker UI
show for this job.

Thanks
hemanth


On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> MR launches multiple attempts for single Task in case of TaskAttempt
> failures or when speculative execution is turned on. In either case, a
> given Task will only ever have one successful TaskAttempt whose output will
> be accepted (committed).
>
> Number of reduces is set to 1 by default in mapred-default.xml - you
> should explicitly set it to zero if you don't want reducers.
>
> By master, I suppose you mean JobTracker. JobTracker doesn't show all the
> attempts for a given Task, you should navigate to per-task page to see that.
>
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:
>
> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
>
> $ ls
>
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>
>
>

Re: "attempt*" directories in user logs

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
However, in the case Oleg is talking about the attempts are:
attempt_201212051224_0021_m_000000_0
attempt_201212051224_0021_m_000002_0
attempt_201212051224_0021_m_000003_0

These aren't multiple attempts of a single task, are they ? They are
actually different tasks. If they were multiple attempts, I would expect
the last digit to get incremented, like attempt_201212051224_0021_m_000000_0
and attempt_201212051224_0021_m_000000_1, for instance.

It looks like at least 3 different tasks were launched on this node. One of
them could be setup task. Oleg, how many map tasks does the Jobtracker UI
show for this job.

Thanks
hemanth


On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> MR launches multiple attempts for single Task in case of TaskAttempt
> failures or when speculative execution is turned on. In either case, a
> given Task will only ever have one successful TaskAttempt whose output will
> be accepted (committed).
>
> Number of reduces is set to 1 by default in mapred-default.xml - you
> should explicitly set it to zero if you don't want reducers.
>
> By master, I suppose you mean JobTracker. JobTracker doesn't show all the
> attempts for a given Task, you should navigate to per-task page to see that.
>
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:
>
> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
>
> $ ls
>
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>
>
>

Re: "attempt*" directories in user logs

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
However, in the case Oleg is talking about the attempts are:
attempt_201212051224_0021_m_000000_0
attempt_201212051224_0021_m_000002_0
attempt_201212051224_0021_m_000003_0

These aren't multiple attempts of a single task, are they ? They are
actually different tasks. If they were multiple attempts, I would expect
the last digit to get incremented, like attempt_201212051224_0021_m_000000_0
and attempt_201212051224_0021_m_000000_1, for instance.

It looks like at least 3 different tasks were launched on this node. One of
them could be setup task. Oleg, how many map tasks does the Jobtracker UI
show for this job.

Thanks
hemanth


On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> MR launches multiple attempts for single Task in case of TaskAttempt
> failures or when speculative execution is turned on. In either case, a
> given Task will only ever have one successful TaskAttempt whose output will
> be accepted (committed).
>
> Number of reduces is set to 1 by default in mapred-default.xml - you
> should explicitly set it to zero if you don't want reducers.
>
> By master, I suppose you mean JobTracker. JobTracker doesn't show all the
> attempts for a given Task, you should navigate to per-task page to see that.
>
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:
>
> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
>
> $ ls
>
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>
>
>

Re: "attempt*" directories in user logs

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
However, in the case Oleg is talking about the attempts are:
attempt_201212051224_0021_m_000000_0
attempt_201212051224_0021_m_000002_0
attempt_201212051224_0021_m_000003_0

These aren't multiple attempts of a single task, are they ? They are
actually different tasks. If they were multiple attempts, I would expect
the last digit to get incremented, like attempt_201212051224_0021_m_000000_0
and attempt_201212051224_0021_m_000000_1, for instance.

It looks like at least 3 different tasks were launched on this node. One of
them could be setup task. Oleg, how many map tasks does the Jobtracker UI
show for this job.

Thanks
hemanth


On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> MR launches multiple attempts for single Task in case of TaskAttempt
> failures or when speculative execution is turned on. In either case, a
> given Task will only ever have one successful TaskAttempt whose output will
> be accepted (committed).
>
> Number of reduces is set to 1 by default in mapred-default.xml - you
> should explicitly set it to zero if you don't want reducers.
>
> By master, I suppose you mean JobTracker. JobTracker doesn't show all the
> attempts for a given Task, you should navigate to per-task page to see that.
>
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:
>
> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
>
> $ ls
>
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>
>
>

Re: "attempt*" directories in user logs

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
MR launches multiple attempts for single Task in case of TaskAttempt failures or when speculative execution is turned on. In either case, a given Task will only ever have one successful TaskAttempt whose output will be accepted (committed).

Number of reduces is set to 1 by default in mapred-default.xml - you should explicitly set it to zero if you don't want reducers.

By master, I suppose you mean JobTracker. JobTracker doesn't show all the attempts for a given Task, you should navigate to per-task page to see that.


Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:

> I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>> $ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
> 
> I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:
> 
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)
> 
> Thanks
> Oleg


Re: "attempt*" directories in user logs

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.
Hi Oleg,

Speculative tasks can be launched as TaskAttempt in MR jobs.
And, if no reducer class is set, MR launches default Reducer
class(IdentityReducer).

Thanks,
Tsuyoshi


On Sun, Dec 9, 2012 at 11:53 PM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
> >$ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>




-- 
OZAWA Tsuyoshi

Re: "attempt*" directories in user logs

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.
Hi Oleg,

Speculative tasks can be launched as TaskAttempt in MR jobs.
And, if no reducer class is set, MR launches default Reducer
class(IdentityReducer).

Thanks,
Tsuyoshi


On Sun, Dec 9, 2012 at 11:53 PM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
> >$ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>




-- 
OZAWA Tsuyoshi

Re: "attempt*" directories in user logs

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
MR launches multiple attempts for single Task in case of TaskAttempt failures or when speculative execution is turned on. In either case, a given Task will only ever have one successful TaskAttempt whose output will be accepted (committed).

Number of reduces is set to 1 by default in mapred-default.xml - you should explicitly set it to zero if you don't want reducers.

By master, I suppose you mean JobTracker. JobTracker doesn't show all the attempts for a given Task, you should navigate to per-task page to see that.


Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:

> I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>> $ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
> 
> I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:
> 
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)
> 
> Thanks
> Oleg


Re: "attempt*" directories in user logs

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.
Hi Oleg,

Speculative tasks can be launched as TaskAttempt in MR jobs.
And, if no reducer class is set, MR launches default Reducer
class(IdentityReducer).

Thanks,
Tsuyoshi


On Sun, Dec 9, 2012 at 11:53 PM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
> >$ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>




-- 
OZAWA Tsuyoshi

Re: "attempt*" directories in user logs

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
MR launches multiple attempts for single Task in case of TaskAttempt failures or when speculative execution is turned on. In either case, a given Task will only ever have one successful TaskAttempt whose output will be accepted (committed).

Number of reduces is set to 1 by default in mapred-default.xml - you should explicitly set it to zero if you don't want reducers.

By master, I suppose you mean JobTracker. JobTracker doesn't show all the attempts for a given Task, you should navigate to per-task page to see that.


Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:

> I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>> $ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
> 
> I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:
> 
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)
> 
> Thanks
> Oleg


Re: "attempt*" directories in user logs

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.
Hi Oleg,

Speculative tasks can be launched as TaskAttempt in MR jobs.
And, if no reducer class is set, MR launches default Reducer
class(IdentityReducer).

Thanks,
Tsuyoshi


On Sun, Dec 9, 2012 at 11:53 PM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> I studying user logs on the two node cluster that I have setup and I was
> wondering if anyone can shed some light on these "attempt*' directories
> >$ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0
>  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
>
> I mean its obvious that its talking about 3 attempts for Map task and 1
> attempt for reduce task. However my current MR job only results in some
> output written to "attempt_201212051224_0021_m_000000_0". Nothing is the
> reduce part (understandably since I don't even have a reducer, so my
> question is:
>
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was
> provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all
> that's displayed and questioned above (the 'ls' output above is from the
> slave node)
>
> Thanks
> Oleg
>




-- 
OZAWA Tsuyoshi

Re: "attempt*" directories in user logs

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
MR launches multiple attempts for single Task in case of TaskAttempt failures or when speculative execution is turned on. In either case, a given Task will only ever have one successful TaskAttempt whose output will be accepted (committed).

Number of reduces is set to 1 by default in mapred-default.xml - you should explicitly set it to zero if you don't want reducers.

By master, I suppose you mean JobTracker. JobTracker doesn't show all the attempts for a given Task, you should navigate to per-task page to see that.


Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:

> I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>> $ ls
> attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
> attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0
> 
> I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:
> 
> 1. The two more M attempts. . . what are they?
> 2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
> 3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)
> 
> Thanks
> Oleg


"attempt*" directories in user logs

Posted by Oleg Zhurakousky <ol...@gmail.com>.
I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>$ ls
attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0

I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:

1. The two more M attempts. . . what are they?
2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)

Thanks
Oleg
 

Re: Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Perfect! Thanks

On Dec 7, 2012, at 1:21 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

> I think this does it:
> http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html
> 
> On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <ol...@gmail.com> wrote:
> Guys
> 
> I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.
> 
> Thanks
> Oleg
> 


Re: Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Perfect! Thanks

On Dec 7, 2012, at 1:21 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

> I think this does it:
> http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html
> 
> On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <ol...@gmail.com> wrote:
> Guys
> 
> I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.
> 
> Thanks
> Oleg
> 


Re: Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Perfect! Thanks

On Dec 7, 2012, at 1:21 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

> I think this does it:
> http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html
> 
> On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <ol...@gmail.com> wrote:
> Guys
> 
> I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.
> 
> Thanks
> Oleg
> 


"attempt*" directories in user logs

Posted by Oleg Zhurakousky <ol...@gmail.com>.
I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>$ ls
attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0

I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:

1. The two more M attempts. . . what are they?
2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)

Thanks
Oleg
 

"attempt*" directories in user logs

Posted by Oleg Zhurakousky <ol...@gmail.com>.
I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>$ ls
attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0

I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:

1. The two more M attempts. . . what are they?
2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)

Thanks
Oleg
 

Re: Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Perfect! Thanks

On Dec 7, 2012, at 1:21 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

> I think this does it:
> http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html
> 
> On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <ol...@gmail.com> wrote:
> Guys
> 
> I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.
> 
> Thanks
> Oleg
> 


"attempt*" directories in user logs

Posted by Oleg Zhurakousky <ol...@gmail.com>.
I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these "attempt*' directories
>$ ls
attempt_201212051224_0021_m_000000_0  attempt_201212051224_0021_m_000003_0  job-acls.xml
attempt_201212051224_0021_m_000002_0  attempt_201212051224_0021_r_000000_0

I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to "attempt_201212051224_0021_m_000000_0". Nothing is the reduce part (understandably since I don't even have a reducer, so my question is:

1. The two more M attempts. . . what are they?
2. Why was there an attempt to do a Reduce when no reducer was provided.implemented
3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node)

Thanks
Oleg
 

Re: Input path with no Output path

Posted by Peyman Mohajerian <mo...@gmail.com>.
I think this does it:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html

On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Guys
>
> I have a simple mapper that reads a records and sends out a message as it
> encounters the ones it is interested in (no reducer). So no output is ever
> written, but it seems like a job can not be submitted unless Output Path is
> specified. Not a big deal to specify a dummy one, but was wondering if it
> could be avoided.
>
> Thanks
> Oleg

Re: Input path with no Output path

Posted by Peyman Mohajerian <mo...@gmail.com>.
I think this does it:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html

On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Guys
>
> I have a simple mapper that reads a records and sends out a message as it
> encounters the ones it is interested in (no reducer). So no output is ever
> written, but it seems like a job can not be submitted unless Output Path is
> specified. Not a big deal to specify a dummy one, but was wondering if it
> could be avoided.
>
> Thanks
> Oleg

Re: Input path with no Output path

Posted by Peyman Mohajerian <mo...@gmail.com>.
I think this does it:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html

On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Guys
>
> I have a simple mapper that reads a records and sends out a message as it
> encounters the ones it is interested in (no reducer). So no output is ever
> written, but it seems like a job can not be submitted unless Output Path is
> specified. Not a big deal to specify a dummy one, but was wondering if it
> could be avoided.
>
> Thanks
> Oleg

Re: Input path with no Output path

Posted by Peyman Mohajerian <mo...@gmail.com>.
I think this does it:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html

On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Guys
>
> I have a simple mapper that reads a records and sends out a message as it
> encounters the ones it is interested in (no reducer). So no output is ever
> written, but it seems like a job can not be submitted unless Output Path is
> specified. Not a big deal to specify a dummy one, but was wondering if it
> could be avoided.
>
> Thanks
> Oleg

Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Guys

I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.

Thanks
Oleg

Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Guys

I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.

Thanks
Oleg

Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Guys

I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.

Thanks
Oleg

Re: I need some raw big data

Posted by Chris Nauroth <cn...@hortonworks.com>.
Another suggestion is Google Books Ngrams:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:

> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
> >> programming. Where can i find them? especially those about weblog,
> traffic
> >> info etc. My English is not so well, if you can give me a URL which
> directly
> >> help me download the big file, That'll be great.
> >> Waiting for your reply......
>
> Try some of the links off of this Quora thread:
>
>
> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>
> You might also try googling "Enron corpus".   Or check out CommonCrawl.org.
>
>
> Phil
>

Re: I need some raw big data

Posted by Chris Nauroth <cn...@hortonworks.com>.
Another suggestion is Google Books Ngrams:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:

> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
> >> programming. Where can i find them? especially those about weblog,
> traffic
> >> info etc. My English is not so well, if you can give me a URL which
> directly
> >> help me download the big file, That'll be great.
> >> Waiting for your reply......
>
> Try some of the links off of this Quora thread:
>
>
> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>
> You might also try googling "Enron corpus".   Or check out CommonCrawl.org.
>
>
> Phil
>

Re: I need some raw big data

Posted by Chris Nauroth <cn...@hortonworks.com>.
Another suggestion is Google Books Ngrams:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes <mo...@gmail.com>wrote:

> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >>  Hello, I'm Steve who need some raw big data for studying mapreduce
> >> programming. Where can i find them? especially those about weblog,
> traffic
> >> info etc. My English is not so well, if you can give me a URL which
> directly
> >> help me download the big file, That'll be great.
> >> Waiting for your reply......
>
> Try some of the links off of this Quora thread:
>
>
> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public
>
> You might also try googling "Enron corpus".   Or check out CommonCrawl.org.
>
>
> Phil
>

Input path with no Output path

Posted by Oleg Zhurakousky <ol...@gmail.com>.
Guys

I have a simple mapper that reads a records and sends out a message as it encounters the ones it is interested in (no reducer). So no output is ever written, but it seems like a job can not be submitted unless Output Path is specified. Not a big deal to specify a dummy one, but was wondering if it could be avoided.

Thanks
Oleg

Re: I need some raw big data

Posted by Phillip Rhodes <mo...@gmail.com>.
On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> programming. Where can i find them? especially those about weblog, traffic
>> info etc. My English is not so well, if you can give me a URL which directly
>> help me download the big file, That'll be great.
>> Waiting for your reply......

Try some of the links off of this Quora thread:

http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public

You might also try googling "Enron corpus".   Or check out CommonCrawl.org.


Phil

Re: I need some raw big data

Posted by Phillip Rhodes <mo...@gmail.com>.
On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> programming. Where can i find them? especially those about weblog, traffic
>> info etc. My English is not so well, if you can give me a URL which directly
>> help me download the big file, That'll be great.
>> Waiting for your reply......

Try some of the links off of this Quora thread:

http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public

You might also try googling "Enron corpus".   Or check out CommonCrawl.org.


Phil

Re: I need some raw big data

Posted by Bruce Durling <bl...@otfrom.com>.
You can also have a play with some open data from the UK

COINS

http://data.gov.uk/dataset/coins

or have a look around the NHS Information Centre

http://www.ic.nhs.uk/

cheers,
Bruce


On Fri, Dec 7, 2012 at 3:48 PM, Harsh J <ha...@cloudera.com> wrote:

> You can find some real world data samples at InfoChimps' data
> marketplace: http://www.infochimps.com/marketplace
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >  Hello, I'm Steve who need some raw big data for studying mapreduce
> > programming. Where can i find them? especially those about weblog,
> traffic
> > info etc. My English is not so well, if you can give me a URL which
> directly
> > help me download the big file, That'll be great.
> > Waiting for your reply......
>
>
>
> --
> Harsh J
>



-- 
@otfrom | CTO & co-founder @MastodonC | mastodonc.com

Re: I need some raw big data

Posted by Bruce Durling <bl...@otfrom.com>.
You can also have a play with some open data from the UK

COINS

http://data.gov.uk/dataset/coins

or have a look around the NHS Information Centre

http://www.ic.nhs.uk/

cheers,
Bruce


On Fri, Dec 7, 2012 at 3:48 PM, Harsh J <ha...@cloudera.com> wrote:

> You can find some real world data samples at InfoChimps' data
> marketplace: http://www.infochimps.com/marketplace
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >  Hello, I'm Steve who need some raw big data for studying mapreduce
> > programming. Where can i find them? especially those about weblog,
> traffic
> > info etc. My English is not so well, if you can give me a URL which
> directly
> > help me download the big file, That'll be great.
> > Waiting for your reply......
>
>
>
> --
> Harsh J
>



-- 
@otfrom | CTO & co-founder @MastodonC | mastodonc.com

Re: I need some raw big data

Posted by Phillip Rhodes <mo...@gmail.com>.
On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> programming. Where can i find them? especially those about weblog, traffic
>> info etc. My English is not so well, if you can give me a URL which directly
>> help me download the big file, That'll be great.
>> Waiting for your reply......

Try some of the links off of this Quora thread:

http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public

You might also try googling "Enron corpus".   Or check out CommonCrawl.org.


Phil

Re: I need some raw big data

Posted by Bruce Durling <bl...@otfrom.com>.
You can also have a play with some open data from the UK

COINS

http://data.gov.uk/dataset/coins

or have a look around the NHS Information Centre

http://www.ic.nhs.uk/

cheers,
Bruce


On Fri, Dec 7, 2012 at 3:48 PM, Harsh J <ha...@cloudera.com> wrote:

> You can find some real world data samples at InfoChimps' data
> marketplace: http://www.infochimps.com/marketplace
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >  Hello, I'm Steve who need some raw big data for studying mapreduce
> > programming. Where can i find them? especially those about weblog,
> traffic
> > info etc. My English is not so well, if you can give me a URL which
> directly
> > help me download the big file, That'll be great.
> > Waiting for your reply......
>
>
>
> --
> Harsh J
>



-- 
@otfrom | CTO & co-founder @MastodonC | mastodonc.com

Re: I need some raw big data

Posted by Bruce Durling <bl...@otfrom.com>.
You can also have a play with some open data from the UK

COINS

http://data.gov.uk/dataset/coins

or have a look around the NHS Information Centre

http://www.ic.nhs.uk/

cheers,
Bruce


On Fri, Dec 7, 2012 at 3:48 PM, Harsh J <ha...@cloudera.com> wrote:

> You can find some real world data samples at InfoChimps' data
> marketplace: http://www.infochimps.com/marketplace
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
> >  Hello, I'm Steve who need some raw big data for studying mapreduce
> > programming. Where can i find them? especially those about weblog,
> traffic
> > info etc. My English is not so well, if you can give me a URL which
> directly
> > help me download the big file, That'll be great.
> > Waiting for your reply......
>
>
>
> --
> Harsh J
>



-- 
@otfrom | CTO & co-founder @MastodonC | mastodonc.com

Re: I need some raw big data

Posted by Phillip Rhodes <mo...@gmail.com>.
On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <ha...@cloudera.com> wrote:
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>>  Hello, I'm Steve who need some raw big data for studying mapreduce
>> programming. Where can i find them? especially those about weblog, traffic
>> info etc. My English is not so well, if you can give me a URL which directly
>> help me download the big file, That'll be great.
>> Waiting for your reply......

Try some of the links off of this Quora thread:

http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public

You might also try googling "Enron corpus".   Or check out CommonCrawl.org.


Phil

Re: I need some raw big data

Posted by Harsh J <ha...@cloudera.com>.
You can find some real world data samples at InfoChimps' data
marketplace: http://www.infochimps.com/marketplace

On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>  Hello, I'm Steve who need some raw big data for studying mapreduce
> programming. Where can i find them? especially those about weblog, traffic
> info etc. My English is not so well, if you can give me a URL which directly
> help me download the big file, That'll be great.
> Waiting for your reply......



-- 
Harsh J

Re: I need some raw big data

Posted by Harsh J <ha...@cloudera.com>.
You can find some real world data samples at InfoChimps' data
marketplace: http://www.infochimps.com/marketplace

On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>  Hello, I'm Steve who need some raw big data for studying mapreduce
> programming. Where can i find them? especially those about weblog, traffic
> info etc. My English is not so well, if you can give me a URL which directly
> help me download the big file, That'll be great.
> Waiting for your reply......



-- 
Harsh J

Re: I need some raw big data

Posted by Harsh J <ha...@cloudera.com>.
You can find some real world data samples at InfoChimps' data
marketplace: http://www.infochimps.com/marketplace

On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>  Hello, I'm Steve who need some raw big data for studying mapreduce
> programming. Where can i find them? especially those about weblog, traffic
> info etc. My English is not so well, if you can give me a URL which directly
> help me download the big file, That'll be great.
> Waiting for your reply......



-- 
Harsh J

Re: I need some raw big data

Posted by Harsh J <ha...@cloudera.com>.
You can find some real world data samples at InfoChimps' data
marketplace: http://www.infochimps.com/marketplace

On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <st...@gmail.com> wrote:
>  Hello, I'm Steve who need some raw big data for studying mapreduce
> programming. Where can i find them? especially those about weblog, traffic
> info etc. My English is not so well, if you can give me a URL which directly
> help me download the big file, That'll be great.
> Waiting for your reply......



-- 
Harsh J