You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by Konstantin Boudnik <co...@apache.org> on 2015/09/01 01:34:36 UTC

Re: Proposal for "BigTop Data Generators"

I don't think it should either docker or linux package. Docker assumes certain
isolation (despite poorly defined and implemented). So, if the sample is
generated inside of the docker and is needed elsewhere - it's a hassle. But if
the package is available, it could be easily installed inside of a container
and used as Nate has pointed out.

Cos

On Mon, Aug 31, 2015 at 06:24AM, Jay Vyas wrote:
> Nate: Good idea to abstract the interface one level higher....
> 
> How about a docker run command ? That is probably the easiest way for Linux
> folks to run one off Java apps nowadays.  
> 
> docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output
> data-dir --etc  foo --etc bar
> 
> I'm happy to curate such a docker image, I already am doing something like
> this in kube for bigtop-transaction-queue, which continuously pumps data
> generator outputs into a REST endpoint or file Queue... So it could be
> extended to support other generators.
> 
> 
> > om> <na...@reactor8.com> wrote:
> > 
> > Could picture at some point supporting something like this for non-jvm folk just looking for test/demo data:
> > 
> > apt-get install bigtop-data-gen
> > ~/ $ bigtop-data-gen --scheme weather --size 5GB --output data-dir --etc  foo --etc bar
> > 
> > 
> > 
> > -----Original Message-----
> > From: jay vyas [mailto:jayunit100.apache@gmail.com] 
> > Sent: Sunday, August 30, 2015 5:11 PM
> > To: dev@bigtop.apache.org
> > Subject: Re: Proposal for "BigTop Data Generators"
> > 
> > Hola nate.  Well, here are the Use cases I know of that I have used the data generators for.
> > 
> > Dockerfile:
> > 
> > (1) for testing kubernetes.  For this, I just use transaction-queue docker file.
> > (2) for testing GlusterFS small file workloads, maybe with other analytics tools...
> > 
> > Maven repo
> > 
> > (3) Java maprduce/ignite/spark applications, which can just add a mvn repo when compiling.  Java developers never add jars through RPM repos.
> > 
> > RPM/DEB packages:
> > 
> > I could see people using an RPM/DEB data generator, and I'm not against it.  But I simply don't know of any real world projects which *currently* need RPM/Deb packages, which is why I haven't bothered to propose it as a requirement.  Nevertheless linux packages are always a welcome addition if  someone wants to create em !
> > 
> > 
> > 
> > 
> >> On Sun, Aug 30, 2015 at 4:34 PM, <na...@reactor8.com> wrote:
> >> 
> >> Would container be in addition to deb/rpm, or instead of?  If latter 
> >> can we do deb/rpm as base then have container either created from them 
> >> or directly from artifacts?
> >> 
> >> On test usage side, seems could probably break up tests into 
> >> base/required and then optional/add-on tests/test-suites.  Think 
> >> remember seeing mention of certain tests that are failing at times on 
> >> certain component(s) anyways in the core builds but don’t mean that 
> >> the build is broken, so would make sense to have some clean up around those anyways.
> >> 
> >> -----Original Message-----
> >> From: RJ Nowling [mailto:rnowling@gmail.com]
> >> Sent: Sunday, August 30, 2015 1:11 PM
> >> To: dev@bigtop.apache.org
> >> Subject: Re: Proposal for "BigTop Data Generators"
> >> 
> >> I agree with the above. :)
> >> 
> >> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas 
> >> <ja...@gmail.com>
> >> wrote:
> >> 
> >>> Hi RJ.
> >>> 
> >>> Maven repositories and docker containers for the transaction queue 
> >>> are good enough IMO.  That will give people a way to compose them in 
> >>> different idioms (one for Java folks, another for broader Linux 
> >>> audience
> >> ).
> >>> 
> >>> I think the lib designs are fairly intuitive.  I would say that we 
> >>> should constrain them all to being written in Java or Groovy to keep 
> >>> the bigtop theme of "JVM for everything" :).
> >>> 
> >>> Any particular questions you have around technical design can be 
> >>> followed in a JIRA or else maybe a Readme spec that goes in a  top 
> >>> level of the data-generators dir...
> >>> 
> >>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <rn...@gmail.com> wrote:
> >>>> 
> >>>> I'd like to keep this conversation going.
> >>>> 
> >>>> So here are a few discussion points:
> >>>> 
> >>>> 1. How do we want to make the data generators available?  Maven?
> >>>> RPMs
> >>> and
> >>>> Debs?
> >>>> 
> >>>> For now, I'm using a gradle multi-project build to easily build 
> >>>> and
> >>> install
> >>>> the BPS data generators and its libraries into a local maven repo.
> >>>> This makes development easy.  Eventually, I would like to post 
> >>>> binaries
> >>> through
> >>>> Maven for easy integration by users.  RPMs / Debs could be 
> >>>> interesting since I use a pattern where the data generators are 
> >>>> libraries (to support application integration / parallelization by 
> >>>> the host framework) but also provide CLI drivers for local testing.
> >>>> 
> >>>> 2.  The idea of using the data generators as part of the smoke 
> >>>> tests came up.  Since there is concern about making the data 
> >>>> generators required, we could offer the blueprints (BigPetStore) 
> >>>> as optional smoke tests.  Would that be a good compromise?
> >>>> 
> >>>> 3.  How will they be maintained?
> >>>> 
> >>>> I'll certainly add myself to the maintainers list and will be 
> >>>> taking responsibility.  I'm happy to have others help as well if 
> >>>> anyone wants to
> >>>> -- if not, that's cool, too.
> >>>> 
> >>>> 4. Is anyone interested at all in discussing library APIs and designs?
> >>>> What about internal interfaces and such?
> >>>> 
> >>>> 
> >>>> My plan was to add at least one more data generator (weather
> >>>> simulator)
> >>> to
> >>>> bigtop-data-generators in the short term.  However, given the 
> >>>> concerns raised by Cos (more discussion needed) and Olaf (don't 
> >>>> want to force data generators on unsuspecting users ;) ), I would 
> >>>> like to reach some
> >>> consensus
> >>>> on what people are concerned about and solutions.
> >>>> 
> >>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik 
> >>>> <co...@apache.org>
> >>> wrote:
> >>>> 
> >>>>> Fine by me. I have linked this thread to the JIRA ticket that RJ
> >>> created,
> >>>>> so
> >>>>> we have a way to connect one to another ;)
> >>>>> 
> >>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> I am not confident that moving important design discussions with 
> >>>>>> impact
> >>>>> to
> >>>>>> the whole project to jira is a good idea.
> >>>>>> 
> >>>>>> In the current JIRA Traffic storm it is not easy to identify and 
> >>>>>> follow
> >>>>> important tickets.
> >>>>>> 
> >>>>>> Please keep discussions on the list or at least, please state on 
> >>>>>> this
> >>>>> list which Ticket to follow ...
> >>>>>> 
> >>>>>> Olaf
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik <co...@apache.org>:
> >>>>>>> 
> >>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote:
> >>>>>>>> Hi,
> >>>>>>>> 
> >>>>>>>> Nive to have data generators in Bigtop.
> >>>>>>>> 
> >>>>>>>> But please do not include it in bigtop_utils, since this 
> >>>>>>>> package is mandatory. Not everyone needs a data generator .
> >>>>>>> 
> >>>>>>> Yup. And let's move further design discussion to the JIRA!
> >>>>>>> 
> >>>>>>>> Olaf
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas <
> >>> jayunit100.apache@gmail.com
> >>>>>> :
> >>>>>>>>> 
> >>>>>>>>> Publishing the jar to bigtops maven is probably a good first 
> >>>>>>>>> step
> >>>>> ,Then apps can just include it as needed...?.
> >>>>>>>>> 
> >>>>>>>>> I'm not against packaging if someone wants packages for this.
> >>>>>>>>> Maybe
> >>>>> even include it in bigtop util ?
> >>>>>>>>> 
> >>>>>>>>> Let's move to jira,
> >>>>>>>>> 
> >>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik 
> >>>>>>>>>> <co...@apache.org>
> >>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>> It is pretty cool indeed!
> >>>>>>>>>> 
> >>>>>>>>>> I wonder how it needs to be structured to be:
> >>>>>>>>>> - easy to access/use from other components wherever it is 
> >>>>>>>>>> needed
> >>>>>>>>>> - doesn't interfere with the rest of the stack
> >>>>>>>>>> 
> >>>>>>>>>> I guess one possible way would be to implement the generator 
> >>>>>>>>>> as a
> >>>>> set of maven
> >>>>>>>>>> artifacts, that could be installed/consumed transparently by 
> >>>>>>>>>> just
> >>>>> declaring a
> >>>>>>>>>> dependency e.g as proposed via top-level component.
> >>>>>>>>>> 
> >>>>>>>>>> Another way is to have a new package like we do for 
> >>>>>>>>>> bigtop-utils
> >>>>> and such.
> >>>>>>>>>> 
> >>>>>>>>>> Perhaps this discussion should be moved to JIRA or shall we
> >>>>> continue on the
> >>>>>>>>>> dev@ ??
> >>>>>>>>>> 
> >>>>>>>>>> Cos
> >>>>>>>>>> 
> >>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote:
> >>>>>>>>>>> Hi BigTop,
> >>>>>>>>>>> 
> >>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to propose 
> >>>>>>>>>>> a new
> >>>>> component
> >>>>>>>>>>> for BigTop: BigTop Data Generators.
> >>>>>>>>>>> 
> >>>>>>>>>>> BigTop Data Generators would consist of a common set of 
> >>>>>>>>>>> libraries
> >>>>> for
> >>>>>>>>>>> building data generators and three example data generators:
> >>>>>>>>>>> 
> >>>>>>>>>>> * BigPetStore transaction generator (moved from 
> >>>>>>>>>>> BigPetStore)
> >>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions with 
> >>>>>>>>>>> booths
> >>>>> on a
> >>>>>>>>>>> showroom floor, at a conference, or at a mall
> >>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation
> >>>>> (temperature, wind
> >>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code.  (From a 
> >>>>>>>>>>> model
> >>>>> trained on
> >>>>>>>>>>> NOAA historical weather data)
> >>>>>>>>>>> 
> >>>>>>>>>>> We believe that creating a common set of libraries will 
> >>>>>>>>>>> have
> >>>>> several
> >>>>>>>>>>> benefits including:
> >>>>>>>>>>> 
> >>>>>>>>>>> * Easier for others to build their own data generators
> >>>>>>>>>>> * Make data generators smaller and easier to maintain
> >>>>>>>>>>> * Share improvements across the data generators
> >>>>>>>>>>> 
> >>>>>>>>>>> More details on the libraries are below.
> >>>>>>>>>>> 
> >>>>>>>>>>> BigPetStore will be continue to focus on building  and 
> >>>>>>>>>>> maintaining blueprints, powered by the BigTop Data Generators.
> >>>>>>>>>>> 
> >>>>>>>>>>> Our vision is that we get all of Apache coming to BigTop 
> >>>>>>>>>>> for tools
> >>>>> for
> >>>>>>>>>>> building better, more comprehensive blueprints.  We want to
> >>>>> support these
> >>>>>>>>>>> efforts through data generators and the initial set of 
> >>>>>>>>>>> blueprint
> >>>>> we've been
> >>>>>>>>>>> building.
> >>>>>>>>>>> 
> >>>>>>>>>>> If the community is generally in support of this, I can 
> >>>>>>>>>>> create a
> >>>>> top-level
> >>>>>>>>>>> "bigtop-data-generators" directory and put the data 
> >>>>>>>>>>> generators and libraries in there.
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>> 
> >>>>>>>>>>> RJ
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> -------
> >>>>>>>>>>> Library details:
> >>>>>>>>>>> 
> >>>>>>>>>>> So far, I've extracted the following common libraries:
> >>>>>>>>>>> 
> >>>>>>>>>>> * Samplers -- provides classes for PDFs and various 
> >>>>>>>>>>> samplers
> >>>>>>>>>>> * Name generator -- data set and samplers for generating 
> >>>>>>>>>>> names
> >>>>>>>>>>> * Location data set -- data set and classes for US zip 
> >>>>>>>>>>> codes,
> >>>>> their
> >>>>>>>>>>> GPS coordinates, median house hold incomes, and population 
> >>>>>>>>>>> sizes
> >>>>>>>>>>> * Product generator -- library for enumerating products 
> >>>>>>>>>>> from a specification file.  Comes with default 
> >>>>>>>>>>> specifications for
> >>>>> BigPetStore
> >>>>>>>>>>> 
> >>>>>>>>>>> I also expect that I'll add libraries for:
> >>>>>>>>>>> 
> >>>>>>>>>>>  * Particle simulation -- customer movement in a room
> >>>>>>>>>>>  * Latent factor model generation -- generate latent 
> >>>>>>>>>>> factors and customer weights to create something like MovieLens data.
> >>>>>>>>>>> Used in
> >>>>> Bazaar
> >>>>>>>>>>> for booth preferences and potentially in BigPetStore for 
> >>>>>>>>>>> customer
> >>>>> item
> >>>>>>>>>>> preferences
> >>>>>>>>>>> 
> >>>>>>>>>>> Most of these libraries came out of the BigPetStore data 
> >>>>>>>>>>> generator
> >>>>> but the
> >>>>>>>>>>> other generators have been refactored to be based off the 
> >>>>>>>>>>> standard
> >>>>> set of
> >>>>>>>>>>> libraries.
> > 
> > 
> > --
> > jay vyas
> >