You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Adarsh Sharma <ad...@orkash.com> on 2011/01/04 08:31:07 UTC

Data for Testing in Hadoop

Dear all,

Designing the architecture is very important for the Hadoop in 
Production Clusters.

We are researching to run Hadoop in Cloud in Individual Nodes and in 
Cloud Environment ( VM's ).

For this, I require some data for testing. Would anyone send me some 
links for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
I shall be grateful for this kindness.


Thanks & Regards

Adarsh Sharma


Re: Data for Testing in Hadoop

Posted by Dave Viner <da...@gmail.com>.
How about http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1 ?

Just the first one (WestburyLab USENET corpus) is 40GB.  I suspect you can
find different formats and data sizes there.

Dave Viner


On Mon, Jan 3, 2011 at 11:31 PM, Adarsh Sharma <ad...@orkash.com>wrote:

> Dear all,
>
> Designing the architecture is very important for the Hadoop in Production
> Clusters.
>
> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud
> Environment ( VM's ).
>
> For this, I require some data for testing. Would anyone send me some links
> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> I shall be grateful for this kindness.
>
>
> Thanks & Regards
>
> Adarsh Sharma
>
>

Re: Data for Testing in Hadoop

Posted by Ranjit Mathew <ra...@yahoo-inc.com>.
On Tuesday 04 January 2011 01:01 PM, Adarsh Sharma wrote:
> For this, I require some data for testing. Would anyone send me some
> links for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> I shall be grateful for this kindness.

If you just want random data of a specific size, you can use "dd" on
Linux with the /dev/urandom pseudo-file. For example, to generate 10 MiB
of random data:

   dd if=/dev/urandom of=data.bin bs=1024 count=10240

For more structured and "Hadoop-enabled" random data-generation, you
can use the data-generator from PigMix2:

   http://wiki.apache.org/pig/DataGeneratorHadoop
   https://issues.apache.org/jira/browse/PIG-200

HTH,
Ranjit

Re: Data for Testing in Hadoop

Posted by Dave Viner <da...@gmail.com>.
Also, Amazon offers free public data sets at:

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1




On Tue, Jan 4, 2011 at 7:28 PM, Lance Norskog <go...@gmail.com> wrote:

> https://cwiki.apache.org/confluence/display/MAHOUT/Collections
>
> All the collections you can imagine.
>
> On Tue, Jan 4, 2011 at 12:28 AM, Harsh J <qw...@gmail.com> wrote:
> > You can use MR to generate the data itself. Checkout GridMix in
> > Hadoop, or PigMix from Pig for examples on general load tests.
> >
> > On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma <ad...@orkash.com>
> wrote:
> >> Dear all,
> >>
> >> Designing the architecture is very important for the Hadoop in
> Production
> >> Clusters.
> >>
> >> We are researching to run Hadoop in Cloud in Individual Nodes and in
> Cloud
> >> Environment ( VM's ).
> >>
> >> For this, I require some data for testing. Would anyone send me some
> links
> >> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> >> I shall be grateful for this kindness.
> >>
> >>
> >> Thanks & Regards
> >>
> >> Adarsh Sharma
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
> > www.harshj.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Data for Testing in Hadoop

Posted by Lance Norskog <go...@gmail.com>.
https://cwiki.apache.org/confluence/display/MAHOUT/Collections

All the collections you can imagine.

On Tue, Jan 4, 2011 at 12:28 AM, Harsh J <qw...@gmail.com> wrote:
> You can use MR to generate the data itself. Checkout GridMix in
> Hadoop, or PigMix from Pig for examples on general load tests.
>
> On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma <ad...@orkash.com> wrote:
>> Dear all,
>>
>> Designing the architecture is very important for the Hadoop in Production
>> Clusters.
>>
>> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud
>> Environment ( VM's ).
>>
>> For this, I require some data for testing. Would anyone send me some links
>> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
>> I shall be grateful for this kindness.
>>
>>
>> Thanks & Regards
>>
>> Adarsh Sharma
>>
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Data for Testing in Hadoop

Posted by Harsh J <qw...@gmail.com>.
You can use MR to generate the data itself. Checkout GridMix in
Hadoop, or PigMix from Pig for examples on general load tests.

On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma <ad...@orkash.com> wrote:
> Dear all,
>
> Designing the architecture is very important for the Hadoop in Production
> Clusters.
>
> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud
> Environment ( VM's ).
>
> For this, I require some data for testing. Would anyone send me some links
> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> I shall be grateful for this kindness.
>
>
> Thanks & Regards
>
> Adarsh Sharma
>
>



-- 
Harsh J
www.harshj.com