You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by Jay Vyas <ja...@gmail.com> on 2014/02/15 15:19:43 UTC

Bigtop: generating fake data

Hi bigtop.  Are we interested in maintaining our own infra for generating fake data , rather than relying on and downloading external data sources for smokes?  Fake data is great for testing I think...  

In bigpetstore I'm generating fake data , written a lot of code to do this in the custom input formats.... but I just found :

http://codearte.github.io/jfairy/

Which is a groovy tool for doing the same....

  I wonder wether generating fake data for testing big data should be a first-class part of bigtop ?  Would others use a utility or just me ?

It might be another useful artifact for the community especially for bigpetstore but also for testing a variety of other machine learning related projects....

I think it's bad to rely on external websites for our tests, maybe in time we could move over to our in internally curated/generated data sets , and a data generation tool like the above moves us in that direction.

Re: Bigtop: generating fake data

Posted by Bruno Mahé <bm...@apache.org>.
On 02/15/2014 06:19 AM, Jay Vyas wrote:
> Hi bigtop.  Are we interested in maintaining our own infra for generating fake data , rather than relying on and downloading external data sources for smokes?  Fake data is great for testing I think...
>
> In bigpetstore I'm generating fake data , written a lot of code to do this in the custom input formats.... but I just found :
>
> http://codearte.github.io/jfairy/
>
> Which is a groovy tool for doing the same....
>
>    I wonder wether generating fake data for testing big data should be a first-class part of bigtop ?  Would others use a utility or just me ?
>
> It might be another useful artifact for the community especially for bigpetstore but also for testing a variety of other machine learning related projects....
>
> I think it's bad to rely on external websites for our tests, maybe in time we could move over to our in internally curated/generated data sets , and a data generation tool like the above moves us in that direction.
>

Hi Jay,

Generating fake data is an interesting idea and I don't see any reason 
to not use that when appropriate.

Regarding having our own framework vs re-using a library, it depends.
Writing our own framework is an option if there is no existing APLv2 
(-compatible?) library we can use or extend for our needs.
But writing code to facilitate such task would be welcome in any case. 
Ex: map/reduce jobs that use jfairy to generate TBs of data.


Thanks,
Bruno

Re: Bigtop: generating fake data

Posted by Konstantin Boudnik <co...@apache.org>.
On Sat, Feb 15, 2014 at 10:24PM, Jay Vyas wrote:
> Glad to hear there is some interest.  Here is a JIRA to take it further.
> 
> https://issues.apache.org/jira/browse/BIGTOP-1212
> 
> @Cos, we need something flexible enough to do differnt types of data
> sets,and possibly embed patterns in the data, do you know of any place to
> start ? is GridMix, for example, or SLive, pluggable in that way?

I don't think either of these would work really. Let's investigate.

> If not we might have to hack our own together.
> 
> Maybe respond in BIGTOP-1212 above.
> 
> 
> On Sat, Feb 15, 2014 at 9:47 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> > Neat idea! I think the answer depends on what kinda data we want to
> > generate.
> >  - I had a good run with gridmix for variery of longevity loads (too bad
> >    Cloudera never released the code to open source).
> >  - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and
> > BIGTOP-1209)
> >    are pretty much ready, it seems
> >
> > At any rate, I'd rather prefer to incorporate something readily available
> > that
> > has good community behind it, so we won't end up supporting an big chunk of
> > specialized software.
> >
> > So, what do you have in mind? Any details?
> >   Cos
> >
> > On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote:
> > > Hi bigtop.  Are we interested in maintaining our own infra for generating
> > > fake data , rather than relying on and downloading external data sources
> > for
> > > smokes?  Fake data is great for testing I think...
> > >
> > > In bigpetstore I'm generating fake data , written a lot of code to do
> > this
> > > in the custom input formats.... but I just found :
> > >
> > > http://codearte.github.io/jfairy/
> > >
> > > Which is a groovy tool for doing the same....
> > >
> > >   I wonder wether generating fake data for testing big data should be a
> > >   first-class part of bigtop ?  Would others use a utility or just me ?
> > >
> > > It might be another useful artifact for the community especially for
> > > bigpetstore but also for testing a variety of other machine learning
> > related
> > > projects....
> > >
> > > I think it's bad to rely on external websites for our tests, maybe in
> > time
> > > we could move over to our in internally curated/generated data sets ,
> > and a
> > > data generation tool like the above moves us in that direction.
> >
> >
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Bigtop: generating fake data

Posted by Jay Vyas <ja...@gmail.com>.
Glad to hear there is some interest.  Here is a JIRA to take it further.

https://issues.apache.org/jira/browse/BIGTOP-1212

@Cos, we need something flexible enough to do differnt types of data
sets,and possibly embed patterns in the data, do you know of any place to
start ? is GridMix, for example, or SLive, pluggable in that way?

If not we might have to hack our own together.

Maybe respond in BIGTOP-1212 above.


On Sat, Feb 15, 2014 at 9:47 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Neat idea! I think the answer depends on what kinda data we want to
> generate.
>  - I had a good run with gridmix for variery of longevity loads (too bad
>    Cloudera never released the code to open source).
>  - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and
> BIGTOP-1209)
>    are pretty much ready, it seems
>
> At any rate, I'd rather prefer to incorporate something readily available
> that
> has good community behind it, so we won't end up supporting an big chunk of
> specialized software.
>
> So, what do you have in mind? Any details?
>   Cos
>
> On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote:
> > Hi bigtop.  Are we interested in maintaining our own infra for generating
> > fake data , rather than relying on and downloading external data sources
> for
> > smokes?  Fake data is great for testing I think...
> >
> > In bigpetstore I'm generating fake data , written a lot of code to do
> this
> > in the custom input formats.... but I just found :
> >
> > http://codearte.github.io/jfairy/
> >
> > Which is a groovy tool for doing the same....
> >
> >   I wonder wether generating fake data for testing big data should be a
> >   first-class part of bigtop ?  Would others use a utility or just me ?
> >
> > It might be another useful artifact for the community especially for
> > bigpetstore but also for testing a variety of other machine learning
> related
> > projects....
> >
> > I think it's bad to rely on external websites for our tests, maybe in
> time
> > we could move over to our in internally curated/generated data sets ,
> and a
> > data generation tool like the above moves us in that direction.
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Bigtop: generating fake data

Posted by Konstantin Boudnik <co...@apache.org>.
Neat idea! I think the answer depends on what kinda data we want to generate.
 - I had a good run with gridmix for variery of longevity loads (too bad
   Cloudera never released the code to open source).
 - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and BIGTOP-1209)
   are pretty much ready, it seems

At any rate, I'd rather prefer to incorporate something readily available that
has good community behind it, so we won't end up supporting an big chunk of
specialized software.

So, what do you have in mind? Any details?
  Cos

On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote:
> Hi bigtop.  Are we interested in maintaining our own infra for generating
> fake data , rather than relying on and downloading external data sources for
> smokes?  Fake data is great for testing I think...  
> 
> In bigpetstore I'm generating fake data , written a lot of code to do this
> in the custom input formats.... but I just found :
> 
> http://codearte.github.io/jfairy/
> 
> Which is a groovy tool for doing the same....
> 
>   I wonder wether generating fake data for testing big data should be a
>   first-class part of bigtop ?  Would others use a utility or just me ?
> 
> It might be another useful artifact for the community especially for
> bigpetstore but also for testing a variety of other machine learning related
> projects....
> 
> I think it's bad to rely on external websites for our tests, maybe in time
> we could move over to our in internally curated/generated data sets , and a
> data generation tool like the above moves us in that direction.