You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Erickson <ha...@gmail.com> on 2012/04/14 22:53:10 UTC

Is TeraGen's generated data deterministic?

Hi we are doing some benchmarking of some of our infrastructure and
are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
the data generated by TeraGen is deterministic, in that if I repeat
the same experiment multiple times with the same configuration options
if it will continue to generate and sort the exact same data?  And if
not, is there an easy mod to make this happen?

Thanks!
David

Re: Is TeraGen's generated data deterministic?

Posted by Owen O'Malley <om...@apache.org>.
Yes, both versions of teragen are completely deterministic. They each use a random number generator with a fixed seed. 

-- Owen

On Apr 14, 2012, at 1:53 PM, David Erickson <ha...@gmail.com> wrote:

> Hi we are doing some benchmarking of some of our infrastructure and
> are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
> the data generated by TeraGen is deterministic, in that if I repeat
> the same experiment multiple times with the same configuration options
> if it will continue to generate and sort the exact same data?  And if
> not, is there an easy mod to make this happen?
> 
> Thanks!
> David

Re: Is TeraGen's generated data deterministic?

Posted by David Erickson <ha...@gmail.com>.
Thanks Raj.  Unfortunately I have to tear down hadoop completely
between runs, including the backing data store, so if possible I need
to figure out a way to generate the same data repeatedly by providing
a single seed, or similar.

On Sat, Apr 14, 2012 at 2:15 PM, Raj Vishwanathan <ra...@yahoo.com> wrote:
> David
>
> Since the data generation and sorting is different hadoop jobs, you can generate the data once and sort the same data as many times as as you want.
>
> I don't think Teragen is deterministic.( or rather , the keys are random but the text is deterministic if I remember correctly )
>
>
>
> Raj
>
>
>
>>________________________________
>> From: David Erickson <ha...@gmail.com>
>>To: common-user@hadoop.apache.org
>>Sent: Saturday, April 14, 2012 1:53 PM
>>Subject: Is TeraGen's generated data deterministic?
>>
>>Hi we are doing some benchmarking of some of our infrastructure and
>>are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
>>the data generated by TeraGen is deterministic, in that if I repeat
>>the same experiment multiple times with the same configuration options
>>if it will continue to generate and sort the exact same data?  And if
>>not, is there an easy mod to make this happen?
>>
>>Thanks!
>>David
>>
>>
>>

Re: Is TeraGen's generated data deterministic?

Posted by Raj Vishwanathan <ra...@yahoo.com>.
David

Since the data generation and sorting is different hadoop jobs, you can generate the data once and sort the same data as many times as as you want.

I don't think Teragen is deterministic.( or rather , the keys are random but the text is deterministic if I remember correctly ) 



Raj



>________________________________
> From: David Erickson <ha...@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Saturday, April 14, 2012 1:53 PM
>Subject: Is TeraGen's generated data deterministic?
> 
>Hi we are doing some benchmarking of some of our infrastructure and
>are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
>the data generated by TeraGen is deterministic, in that if I repeat
>the same experiment multiple times with the same configuration options
>if it will continue to generate and sort the exact same data?  And if
>not, is there an easy mod to make this happen?
>
>Thanks!
>David
>
>
>