You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2012/11/11 22:38:48 UTC

Re: Flume R -- any interest?

Question.

So in Crunch api, initialize() doesn't get an emitter. and the process gets
emitter every time.

However, my guess any single reincranation of a DoFn object in the backend
will always be getting the same emitter thru its lifecycle. Is it an
admissible assumption or there's currently a counter example to that?

The problem is that as i implement the two way pipeline of input and
emitter data between R and Java, I am bulking these calls together for
performance reasons. Each individual datum in these chunks of data will not
have attached emitter function information to them in any way. (well it
could but it would be a performance killer and i bet emitter never
changes).

So, thoughts? can i assume emitter never changes between first and lass
call to DoFn instance?

thanks.


On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> yes...
>
> i think it worked for me before, although just adding all jars from R
> package distribution would be a little bit more appropriate approach
> -- but it creates a problem with jars in dependent R packages. I think
> it would be much easier to just compile a hadoop-job file and stick it
> in rather than doing cherry-picking of individual jars from who knows
> how many locations.
>
> i think i used the hadoop job format with distributed cache before and
> it worked... at least with Pig "register jar" functionality.
>
> ok i guess i will just try if it works.
>
> On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com> wrote:
> > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Great! so it is in Crunch.
> >>
> >> does it support hadoop-job jar format or only pure java jars?
> >>
> >
> > I think just pure jars-- you're referring to hadoop-job format as having
> > all the dependencies in a lib/ directory within the jar?
> >
> >
> >>
> >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jw...@cloudera.com>
> wrote:
> >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> >
> >> >> I think i need functionality to add more jars (or external
> hadoop-jar)
> >> >> to drive that from an R package. Just setting job jar by class is not
> >> >> enough. I can push overall job-jar as an addiitonal jar to R package;
> >> >> however, i cannot really run hadoop command line on it, i need to set
> >> >> up classpath thru RJava.
> >> >>
> >> >> Traditional single hadoop job jar will unlikely work here since we
> >> >> cannot hardcode pipelines in java code but rather have to construct
> >> >> them on the fly. (well, we could serialize pipeline definitions from
> R
> >> >> and then replay them in a driver -- but that's too cumbersome and
> more
> >> >> work than it has to be.) There's no reason why i shouldn't be able to
> >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
> >> >> off a pipeline.
> >> >>
> >> >
> >> > o.a.c.util.DistCache.addJarToDistributedCache?
> >> >
> >> >
> >> >>
> >> >>
> >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> >> wrote:
> >> >> > Ok, sounds very promising...
> >> >> >
> >> >> > i'll try to start digging on the driver part this week then
> (Pipeline
> >> >> > wrapper in R5).
> >> >> >
> >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com
> >
> >> >> wrote:
> >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >> >
> >> >> wrote:
> >> >> >>> Ok, cool.
> >> >> >>>
> >> >> >>> So what state is Crunch in? I take it is in a fairly advanced
> state.
> >> >> >>> So every api mentioned in the  FlumeJava paper is working ,
> right?
> >> Or
> >> >> >>> there's something that is not working specifically?
> >> >> >>
> >> >> >> I think the only thing in the paper that we don't have in a
> working
> >> >> >> state is MSCR fusion. It's mostly just a question of prioritizing
> it
> >> >> >> and getting the work done.
> >> >> >>
> >> >> >>>
> >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com
> >
> >> >> wrote:
> >> >> >>>> Hey Dmitriy,
> >> >> >>>>
> >> >> >>>> Got a fork going and looking forward to playing with crunchR
> this
> >> >> weekend--
> >> >> >>>> thanks!
> >> >> >>>>
> >> >> >>>> J
> >> >> >>>>
> >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >> wrote:
> >> >> >>>>
> >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> >> >> >>>>>
> >> >> >>>>> Default profile does not compile R artifact . R profile
> compiles R
> >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR to
> mvn
> >> >> >>>>> command line, e.g.
> >> >> >>>>>
> >> >> >>>>> mvn install -DR
> >> >> >>>>>
> >> >> >>>>> there's also a helper that installs the snapshot version of the
> >> >> >>>>> package in the crunchR module.
> >> >> >>>>>
> >> >> >>>>> There's RJava and JRI java dependencies which i did not find
> >> anywhere
> >> >> >>>>> in public maven repos; so it is installed into my github maven
> >> repo
> >> >> so
> >> >> >>>>> far. Should compile for 3rd party.
> >> >> >>>>>
> >> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R
> Doc
> >> >> >>>>> compilation requires roxygen2 (i think).
> >> >> >>>>>
> >> >> >>>>> For some reason RProtoBuf fails to import into another package,
> >> got a
> >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so
> >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may
> >> be a
> >> >> >>>>> problem though...
> >> >> >>>>>
> >> >> >>>>> other than the template, not much else has been done so far...
> >> >> finding
> >> >> >>>>> hadoop libraries and adding it to the package path on
> >> initialization
> >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> >> non-"provided"
> >> >> >>>>> transitives to the crunchR's java part...
> >> >> >>>>>
> >> >> >>>>> No legal stuff...
> >> >> >>>>>
> >> >> >>>>> No readmes... complete stealth at this point.
> >> >> >>>>>
> >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> >> >> dlieu.7@gmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>> > Ok, cool. I will try to roll project template by some time
> next
> >> >> week.
> >> >> >>>>> > we can start with prototyping and benchmarking something
> really
> >> >> >>>>> > simple, such as parallelDo().
> >> >> >>>>> >
> >> >> >>>>> > My interim goal is to perhaps take some more or less simple
> >> >> algorithm
> >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or
> >> >> whatever
> >> >> >>>>> > name it has to be) in a comparable time (performance) but
> with
> >> much
> >> >> >>>>> > fewer lines of code. (say one of factorization or clustering
> >> >> things)
> >> >> >>>>> >
> >> >> >>>>> >
> >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rs...@xebia.com>
> >> wrote:
> >> >> >>>>> >> I am not much of R user but I am interested to see how well
> we
> >> can
> >> >> >>>>> integrate
> >> >> >>>>> >> the two. I would be happy to help.
> >> >> >>>>> >>
> >> >> >>>>> >> regards,
> >> >> >>>>> >> Rahul
> >> >> >>>>> >>
> >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >> >> >>>>> >>>
> >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> >> >> dlieu.7@gmail.com>
> >> >> >>>>> >>> wrote:
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> Yep, ok.
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven
> >> >> project
> >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot
> lately).
> >> Or
> >> >> if you
> >> >> >>>>> >>>> have a template to look at, it would be useful i guess
> too.
> >> >> >>>>> >>>
> >> >> >>>>> >>> No, please go right ahead.
> >> >> >>>>> >>>
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> >> >> josh.wills@gmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>> >>>>>
> >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to
> help.
> >> >> Github
> >> >> >>>>> >>>>> repo?
> >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> >> dlieu.7@gmail.com
> >> >> >
> >> >> >>>>> wrote:
> >> >> >>>>> >>>>>
> >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype
> on
> >> >> top of
> >> >> >>>>> >>>>>> Crunch for something simple. This should both save time
> and
> >> >> prove or
> >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable.
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework
> or we
> >> >> can keep
> >> >> >>>>> >>>>>> it completely separate.
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> -d
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> >> >> jwills@cloudera.com>
> >> >> >>>>> >>>>>> wrote:
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the
> >> >> talk? Was
> >> >> >>>>> it
> >> >> >>>>> >>>>>>> Murray Stokely?
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov <
> >> >> >>>>> dlieu.7@gmail.com>
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> wrote:
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Hello,
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience
> of R
> >> >> mapping
> >> >> >>>>> of
> >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of
> >> >> applications
> >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped
> using
> >> >> flume R.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> implementation
> >> of
> >> >> R
> >> >> >>>>> >>>>>>>> mapping,
> >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to
> >> Crunch
> >> >> would
> >> >> >>>>> be
> >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI
> and
> >> >> jni
> >> >> >>>>> seem to
> >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could
> have a
> >> >> >>>>> contributed
> >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would
> be
> >> >> just a
> >> >> >>>>> very
> >> >> >>>>> >>>>>>>> good synergy.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising
> for
> >> open
> >> >> >>>>> source
> >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest,
> Crunch
> >> >> list
> >> >> >>>>> seems
> >> >> >>>>> >>>>>>>> like a natural place to poke.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Thanks .
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> -Dmitriy
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> --
> >> >> >>>>> >>>>>>> Director of Data Science
> >> >> >>>>> >>>>>>> Cloudera
> >> >> >>>>> >>>>>>> Twitter: @josh_wills
> >> >> >>>>> >>>
> >> >> >>>>> >>>
> >> >> >>>>> >>>
> >> >> >>>>> >>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>> Director of Data Science
> >> >> >>>> Cloudera <http://www.cloudera.com>
> >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Director of Data Science
> >> > Cloudera <http://www.cloudera.com>
> >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

well, i figured out a way not to need it. So no, total node shutdown for R
side is not implemented but it is actually fine.

Basically, for backend, the two way pipeline never explitly shut down which
is fine as well as do function cleanup is synchronized.

Indeed, the problem of course is that in general we can't assume any
particular lifecycle of a DoFn and have to assume they may spring up to
life in arbitrary order, as well as cleanup.

The solution is just to flush the pipelines once doFn cleanpu is
encountered and wait for cleanup receipt from the R's DoFn doppleganger
before exiting the cleanup of a DoFn.

Once Crunch takes care of all DoFn being cleanup, it thus ensures all R
processing is also flushed and the queues are empty. It may be a little
less optimal than a single stage cleanup of everything but hopefully
cleanup is a smaller part compared to process(). (Actually, in Mahout
SSVD's cleanup emissions are often just as big or even larger than all the
process() emissions so I took care that these scenarios work just as good..)



On Sun, Nov 18, 2012 at 9:38 AM, Josh Wills <jo...@gmail.com> wrote:

> Curious-- did you figure out a hack to make this work, or is this still an
> open issue?
>
>
> On Fri, Nov 16, 2012 at 3:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Or RTNode? I guess i am not sure what difference is.
> >
> > Bottom line, i need to do some task startup routines (e.g. establish
> > exchange queues between task and R) and also last thing cleanup before MR
> > tasks exits and _before all outputs are closed_. (kind of "flush all"
> > thing).
> >
> > Thanks.
> > -d
> >
> >
> > On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > How do I hook into CrunchTaskContext to do a task cleanup (as opposed
> to
> > a
> > > DoFn etc.) ?
> > >
> > >
> > > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> no it is fully distributed testing.
> > >>
> > >> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> > >> wondering if any end-to-end diagnostics is already embedded in Crunch
> >  but
> > >> reporting backend errors to front end is notoriously hard (and
> > sometimes,
> > >> impossible) with hadoop, so I assume it doesn't make sense to report
> > >> client-only stuff thru exception while the other stuff still requires
> > >> checking isSucceeded().
> > >>
> > >>
> > >>
> > >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jw...@cloudera.com>
> > wrote:
> > >>
> > >>> Are you running this using LocalJobRunner? Does calling
> > >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
> > >>> settle a debate I'm having w/Matthias. ;-)
> > >>>
> > >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > >>> wrote:
> > >>> > I see the error in the logs but Pipeline.run() has never thrown
> > >>> anything.
> > >>> > isSucceeded() subsequently returns false. Is there any way to
> extract
> > >>> > client-side problem rather than just being able to state that job
> > >>> failed?
> > >>> > or it is ok and the only diagnostics by design?
> > >>> >
> > >>> > ============
> > >>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> > >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> > path
> > >>> > does not exist: hdfs://localhost:11010/crunchr-example/input
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> > >>> > at
> > >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> > >>> > at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> > >>> > at
> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> > >>> > at java.security.AccessController.doPrivileged(Native Method)
> > >>> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> > >>> > at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> > >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> > >>> > at
> > org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> > >>> > at java.lang.Thread.run(Thread.java:662)
> > >>> >
> > >>> >
> > >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > >>> wrote:
> > >>> >
> > >>> >> for hadoop nodes i guess yet another option to soft-link the .so
> > into
> > >>> >> hadoop's native lib folder
> > >>> >>
> > >>> >>
> > >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >>> >wrote:
> > >>> >>
> > >>> >>> I actually want to defer this to hadoop admins, we just need to
> > >>> create a
> > >>> >>> procedure for setting up nodes. Ideally as simple as possible.
> > >>> something
> > >>> >>> like
> > >>> >>>
> > >>> >>> 1) setup R
> > >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
> > >>> >>> 3) R CMD javareconf
> > >>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
> > >>> package="rJava") to
> > >>> >>> either mapred command lines or LD_LIBRARY_PATH...
> > >>> >>>
> > >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
> > >>> crunch
> > >>> >>> might have something to hide a lot of that complexity (since it
> is
> > >>> about
> > >>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
> > >>> way to
> > >>> >>> ship .so's to the backend so if crunch had an api to do something
> > >>> similar
> > >>> >>> it is conceivable that driver might yank and ship it too to hide
> > that
> > >>> >>> complexity as well. But then there's a host of issues how to
> handle
> > >>> >>> potentially different rJava versions installed on different
> > nodes...
> > >>> So, it
> > >>> >>> increasingly looks like something we might want to defer to
> sysops
> > >>> to do
> > >>> >>> with approximate set of requirements .
> > >>> >>>
> > >>> >>>
> > >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com
> >
> > >>> wrote:
> > >>> >>>
> > >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
> > >>> dlieu.7@gmail.com>
> > >>> >>>> wrote:
> > >>> >>>>
> > >>> >>>> > so java tasks need to be able to load libjri.so from
> > >>> >>>> > whatever system.file("jri", package="rJava") says.
> > >>> >>>> >
> > >>> >>>> > Traditionally, these issues were handled with
> > -Djava.library.path.
> > >>> >>>> > Apparently there's nothing java task can do to enable
> > >>> loadLibrary()
> > >>> >>>> command
> > >>> >>>> > to see the damn library once started. But -Djava.library.path
> > >>> requires
> > >>> >>>> for
> > >>> >>>> > nodes to configure and lock jvm command line from
> modifications
> > >>> of the
> > >>> >>>> > client.  which is fine.
> > >>> >>>> >
> > >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre
> > 1.6
> > >>> >>>> (again).
> > >>> >>>> >
> > >>> >>>> > but... any other suggestions about best practice configuring
> > >>> crunch to
> > >>> >>>> run
> > >>> >>>> > user's .so's?
> > >>> >>>> >
> > >>> >>>>
> > >>> >>>> Not off the top of my head. I suspect that whatever you come up
> > >>> with will
> > >>> >>>> become the "best practice." :)
> > >>> >>>>
> > >>> >>>> >
> > >>> >>>> > thanks.
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <
> > josh.wills@gmail.com
> > >>> >
> > >>> >>>> wrote:
> > >>> >>>> >
> > >>> >>>> > > I believe that is a safe assumption, at least right now.
> > >>> >>>> > >
> > >>> >>>> > >
> > >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
> > >>> dlieu.7@gmail.com
> > >>> >>>> >
> > >>> >>>> > > wrote:
> > >>> >>>> > >
> > >>> >>>> > > > Question.
> > >>> >>>> > > >
> > >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and
> > the
> > >>> >>>> process
> > >>> >>>> > > gets
> > >>> >>>> > > > emitter every time.
> > >>> >>>> > > >
> > >>> >>>> > > > However, my guess any single reincranation of a DoFn
> object
> > >>> in the
> > >>> >>>> > > backend
> > >>> >>>> > > > will always be getting the same emitter thru its
> lifecycle.
> > >>> Is it
> > >>> >>>> an
> > >>> >>>> > > > admissible assumption or there's currently a counter
> example
> > >>> to
> > >>> >>>> that?
> > >>> >>>> > > >
> > >>> >>>> > > > The problem is that as i implement the two way pipeline of
> > >>> input
> > >>> >>>> and
> > >>> >>>> > > > emitter data between R and Java, I am bulking these calls
> > >>> together
> > >>> >>>> for
> > >>> >>>> > > > performance reasons. Each individual datum in these chunks
> > of
> > >>> data
> > >>> >>>> will
> > >>> >>>> > > not
> > >>> >>>> > > > have attached emitter function information to them in any
> > way.
> > >>> >>>> (well it
> > >>> >>>> > > > could but it would be a performance killer and i bet
> emitter
> > >>> never
> > >>> >>>> > > > changes).
> > >>> >>>> > > >
> > >>> >>>> > > > So, thoughts? can i assume emitter never changes between
> > >>> first and
> > >>> >>>> lass
> > >>> >>>> > > > call to DoFn instance?
> > >>> >>>> > > >
> > >>> >>>> > > > thanks.
> > >>> >>>> > > >
> > >>> >>>> > > >
> > >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
> > >>> >>>> dlieu.7@gmail.com>
> > >>> >>>> > > > wrote:
> > >>> >>>> > > >
> > >>> >>>> > > > > yes...
> > >>> >>>> > > > >
> > >>> >>>> > > > > i think it worked for me before, although just adding
> all
> > >>> jars
> > >>> >>>> from R
> > >>> >>>> > > > > package distribution would be a little bit more
> > appropriate
> > >>> >>>> approach
> > >>> >>>> > > > > -- but it creates a problem with jars in dependent R
> > >>> packages. I
> > >>> >>>> > think
> > >>> >>>> > > > > it would be much easier to just compile a hadoop-job
> file
> > >>> and
> > >>> >>>> stick
> > >>> >>>> > it
> > >>> >>>> > > > > in rather than doing cherry-picking of individual jars
> > from
> > >>> who
> > >>> >>>> knows
> > >>> >>>> > > > > how many locations.
> > >>> >>>> > > > >
> > >>> >>>> > > > > i think i used the hadoop job format with distributed
> > cache
> > >>> >>>> before
> > >>> >>>> > and
> > >>> >>>> > > > > it worked... at least with Pig "register jar"
> > functionality.
> > >>> >>>> > > > >
> > >>> >>>> > > > > ok i guess i will just try if it works.
> > >>> >>>> > > > >
> > >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
> > >>> jwills@cloudera.com
> > >>> >>>> >
> > >>> >>>> > > wrote:
> > >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> > >>> >>>> > dlieu.7@gmail.com
> > >>> >>>> > > >
> > >>> >>>> > > > > wrote:
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> Great! so it is in Crunch.
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >> does it support hadoop-job jar format or only pure
> java
> > >>> jars?
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >
> > >>> >>>> > > > > > I think just pure jars-- you're referring to
> hadoop-job
> > >>> format
> > >>> >>>> as
> > >>> >>>> > > > having
> > >>> >>>> > > > > > all the dependencies in a lib/ directory within the
> jar?
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> > >>> >>>> jwills@cloudera.com>
> > >>> >>>> > > > > wrote:
> > >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > >>> >>>> > > > dlieu.7@gmail.com>
> > >>> >>>> > > > > >> wrote:
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >> I think i need functionality to add more jars (or
> > >>> external
> > >>> >>>> > > > > hadoop-jar)
> > >>> >>>> > > > > >> >> to drive that from an R package. Just setting job
> > jar
> > >>> by
> > >>> >>>> class
> > >>> >>>> > is
> > >>> >>>> > > > not
> > >>> >>>> > > > > >> >> enough. I can push overall job-jar as an
> addiitonal
> > >>> jar to
> > >>> >>>> R
> > >>> >>>> > > > package;
> > >>> >>>> > > > > >> >> however, i cannot really run hadoop command line
> on
> > >>> it, i
> > >>> >>>> need
> > >>> >>>> > to
> > >>> >>>> > > > set
> > >>> >>>> > > > > >> >> up classpath thru RJava.
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely
> work
> > >>> here
> > >>> >>>> since
> > >>> >>>> > > we
> > >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
> > >>> have to
> > >>> >>>> > > construct
> > >>> >>>> > > > > >> >> them on the fly. (well, we could serialize
> pipeline
> > >>> >>>> definitions
> > >>> >>>> > > > from
> > >>> >>>> > > > > R
> > >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
> > >>> >>>> cumbersome
> > >>> >>>> > and
> > >>> >>>> > > > > more
> > >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
> > >>> shouldn't
> > >>> >>>> be
> > >>> >>>> > > able
> > >>> >>>> > > > to
> > >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
> > >>> (mahout-like)
> > >>> >>>> when
> > >>> >>>> > > > kicking
> > >>> >>>> > > > > >> >> off a pipeline.
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy
> Lyubimov <
> > >>> >>>> > > > > dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> > Ok, sounds very promising...
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> > i'll try to start digging on the driver part
> this
> > >>> week
> > >>> >>>> then
> > >>> >>>> > > > > (Pipeline
> > >>> >>>> > > > > >> >> > wrapper in R5).
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > >>> >>>> > > > josh.wills@gmail.com
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
> > Lyubimov <
> > >>> >>>> > > > > dlieu.7@gmail.com
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>> Ok, cool.
> > >>> >>>> > > > > >> >> >>>
> > >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
> > >>> fairly
> > >>> >>>> > advanced
> > >>> >>>> > > > > state.
> > >>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper
> > is
> > >>> >>>> working ,
> > >>> >>>> > > > > right?
> > >>> >>>> > > > > >> Or
> > >>> >>>> > > > > >> >> >>> there's something that is not working
> > >>> specifically?
> > >>> >>>> > > > > >> >> >>
> > >>> >>>> > > > > >> >> >> I think the only thing in the paper that we
> don't
> > >>> have
> > >>> >>>> in a
> > >>> >>>> > > > > working
> > >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a
> question
> > >>> of
> > >>> >>>> > > > prioritizing
> > >>> >>>> > > > > it
> > >>> >>>> > > > > >> >> >> and getting the work done.
> > >>> >>>> > > > > >> >> >>
> > >>> >>>> > > > > >> >> >>>
> > >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > >>> >>>> > > > jwills@cloudera.com
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>>> Hey Dmitriy,
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to
> playing
> > >>> with
> > >>> >>>> > crunchR
> > >>> >>>> > > > > this
> > >>> >>>> > > > > >> >> weekend--
> > >>> >>>> > > > > >> >> >>>> thanks!
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> J
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
> > >>> Lyubimov <
> > >>> >>>> > > > > >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>> Project template
> > >>> >>>> https://github.com/dlyubimov/crunchR
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact
> .
> > R
> > >>> >>>> profile
> > >>> >>>> > > > > compiles R
> > >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
> > >>> >>>> supplying -DR
> > >>> >>>> > > to
> > >>> >>>> > > > > mvn
> > >>> >>>> > > > > >> >> >>>>> command line, e.g.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> mvn install -DR
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the
> > snapshot
> > >>> >>>> version
> > >>> >>>> > of
> > >>> >>>> > > > the
> > >>> >>>> > > > > >> >> >>>>> package in the crunchR module.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies
> which
> > i
> > >>> did
> > >>> >>>> not
> > >>> >>>> > > find
> > >>> >>>> > > > > >> anywhere
> > >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed
> into
> > >>> my
> > >>> >>>> github
> > >>> >>>> > > > maven
> > >>> >>>> > > > > >> repo
> > >>> >>>> > > > > >> >> so
> > >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
> > >>> optionally,
> > >>> >>>> > > RProtoBuf.
> > >>> >>>> > > > R
> > >>> >>>> > > > > Doc
> > >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import
> into
> > >>> >>>> another
> > >>> >>>> > > > package,
> > >>> >>>> > > > > >> got a
> > >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
> > >>> into
> > >>> >>>> > crunchR,
> > >>> >>>> > > so
> > >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category.
> Down
> > >>> the
> > >>> >>>> road
> > >>> >>>> > that
> > >>> >>>> > > > may
> > >>> >>>> > > > > >> be a
> > >>> >>>> > > > > >> >> >>>>> problem though...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> other than the template, not much else has
> > been
> > >>> done
> > >>> >>>> so
> > >>> >>>> > > > far...
> > >>> >>>> > > > > >> >> finding
> > >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the
> package
> > >>> path on
> > >>> >>>> > > > > >> initialization
> > >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
> > >>> and its
> > >>> >>>> > > > > >> non-"provided"
> > >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> No legal stuff...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this
> point.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
> > >>> Lyubimov <
> > >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project
> > template
> > >>> by
> > >>> >>>> some
> > >>> >>>> > > time
> > >>> >>>> > > > > next
> > >>> >>>> > > > > >> >> week.
> > >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and
> > benchmarking
> > >>> >>>> > something
> > >>> >>>> > > > > really
> > >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some
> more
> > >>> or
> > >>> >>>> less
> > >>> >>>> > > simple
> > >>> >>>> > > > > >> >> algorithm
> > >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be
> solved
> > >>> with
> > >>> >>>> > Rcrunch
> > >>> >>>> > > > (or
> > >>> >>>> > > > > >> >> whatever
> > >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
> > >>> >>>> (performance)
> > >>> >>>> > but
> > >>> >>>> > > > > with
> > >>> >>>> > > > > >> much
> > >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
> > >>> factorization or
> > >>> >>>> > > > clustering
> > >>> >>>> > > > > >> >> things)
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> > >>> >>>> > > rsharma@xebia.com
> > >>> >>>> > > > >
> > >>> >>>> > > > > >> wrote:
> > >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am
> interested
> > >>> to
> > >>> >>>> see how
> > >>> >>>> > > > well
> > >>> >>>> > > > > we
> > >>> >>>> > > > > >> can
> > >>> >>>> > > > > >> >> >>>>> integrate
> > >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>> >> regards,
> > >>> >>>> > > > > >> >> >>>>> >> Rahul
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
> > >>> >>>> Lyubimov <
> > >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> >>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
> > >>> can set
> > >>> >>>> up a
> > >>> >>>> > > > maven
> > >>> >>>> > > > > >> >> project
> > >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been
> doing
> > >>> that a
> > >>> >>>> lot
> > >>> >>>> > > > > lately).
> > >>> >>>> > > > > >> Or
> > >>> >>>> > > > > >> >> if you
> > >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
> > >>> useful i
> > >>> >>>> > guess
> > >>> >>>> > > > > too.
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
> > >>> Wills <
> > >>> >>>> > > > > >> >> josh.wills@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first,
> but
> > >>> I am
> > >>> >>>> happy
> > >>> >>>> > > to
> > >>> >>>> > > > > help.
> > >>> >>>> > > > > >> >> Github
> > >>> >>>> > > > > >> >> >>>>> >>>>> repo?
> > >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
> > >>> Lyubimov" <
> > >>> >>>> > > > > >> dlieu.7@gmail.com
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
> > >>> JRI/RJava
> > >>> >>>> > > > prototype
> > >>> >>>> > > > > on
> > >>> >>>> > > > > >> >> top of
> > >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This
> > should
> > >>> both
> > >>> >>>> save
> > >>> >>>> > > > time
> > >>> >>>> > > > > and
> > >>> >>>> > > > > >> >> prove or
> > >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava
> > integration
> > >>> is
> > >>> >>>> > viable.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
> > >>> Crunch
> > >>> >>>> > > framework
> > >>> >>>> > > > > or we
> > >>> >>>> > > > > >> >> can keep
> > >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> -d
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
> > >>> Wills <
> > >>> >>>> > > > > >> >> jwills@cloudera.com>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be
> into
> > >>> it--
> > >>> >>>> who
> > >>> >>>> > gave
> > >>> >>>> > > > the
> > >>> >>>> > > > > >> >> talk? Was
> > >>> >>>> > > > > >> >> >>>>> it
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM,
> > Dmitriy
> > >>> >>>> > Lyubimov <
> > >>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
> > >>> Google's
> > >>> >>>> > > experience
> > >>> >>>> > > > > of R
> > >>> >>>> > > > > >> >> mapping
> > >>> >>>> > > > > >> >> >>>>> of
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent
> BARUGs. I
> > >>> think
> > >>> >>>> a
> > >>> >>>> > lot
> > >>> >>>> > > of
> > >>> >>>> > > > > >> >> applications
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout
> could
> > >>> be
> > >>> >>>> > > prototyped
> > >>> >>>> > > > > using
> > >>> >>>> > > > > >> >> flume R.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
> > >>> Google
> > >>> >>>> > > > > implementation
> > >>> >>>> > > > > >> of
> > >>> >>>> > > > > >> >> R
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
> > >>> mapping
> > >>> >>>> from
> > >>> >>>> > R
> > >>> >>>> > > to
> > >>> >>>> > > > > >> Crunch
> > >>> >>>> > > > > >> >> would
> > >>> >>>> > > > > >> >> >>>>> be
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
> > >>> efficient).
> > >>> >>>> > > > RJava/JRI
> > >>> >>>> > > > > and
> > >>> >>>> > > > > >> >> jni
> > >>> >>>> > > > > >> >> >>>>> seem to
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to
> do
> > >>> that
> > >>> >>>> > > directly.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
> > >>> project
> > >>> >>>> > could
> > >>> >>>> > > > > have a
> > >>> >>>> > > > > >> >> >>>>> contributed
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
> > >>> matrices,
> > >>> >>>> that
> > >>> >>>> > > would
> > >>> >>>> > > > > be
> > >>> >>>> > > > > >> >> just a
> > >>> >>>> > > > > >> >> >>>>> very
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> > >>> >>>> > > contributing/advising
> > >>> >>>> > > > > for
> > >>> >>>> > > > > >> open
> > >>> >>>> > > > > >> >> >>>>> source
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
> > >>> gauging
> > >>> >>>> > interest,
> > >>> >>>> > > > > Crunch
> > >>> >>>> > > > > >> >> list
> > >>> >>>> > > > > >> >> >>>>> seems
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> --
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> --
> > >>> >>>> > > > > >> >> >>>> Director of Data Science
> > >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
> > >>> http://twitter.com/josh_wills>
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> > --
> > >>> >>>> > > > > >> > Director of Data Science
> > >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > >> > Twitter: @josh_wills <
> http://twitter.com/josh_wills>
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > > --
> > >>> >>>> > > > > > Director of Data Science
> > >>> >>>> > > > > > Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >>> >>>> > > > >
> > >>> >>>> > > >
> > >>> >>>> > >
> > >>> >>>> >
> > >>> >>>>
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> --
> > >>> >>>> Director of Data Science
> > >>> >>>> Cloudera <http://www.cloudera.com>
> > >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >>> >>>>
> > >>> >>>
> > >>> >>>
> > >>> >>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Director of Data Science
> > >>> Cloudera
> > >>> Twitter: @josh_wills
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Flume R -- any interest?

Posted by Josh Wills <jo...@gmail.com>.

Curious-- did you figure out a hack to make this work, or is this still an
open issue?


On Fri, Nov 16, 2012 at 3:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Or RTNode? I guess i am not sure what difference is.
>
> Bottom line, i need to do some task startup routines (e.g. establish
> exchange queues between task and R) and also last thing cleanup before MR
> tasks exits and _before all outputs are closed_. (kind of "flush all"
> thing).
>
> Thanks.
> -d
>
>
> On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > How do I hook into CrunchTaskContext to do a task cleanup (as opposed to
> a
> > DoFn etc.) ?
> >
> >
> > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> no it is fully distributed testing.
> >>
> >> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> >> wondering if any end-to-end diagnostics is already embedded in Crunch
>  but
> >> reporting backend errors to front end is notoriously hard (and
> sometimes,
> >> impossible) with hadoop, so I assume it doesn't make sense to report
> >> client-only stuff thru exception while the other stuff still requires
> >> checking isSucceeded().
> >>
> >>
> >>
> >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jw...@cloudera.com>
> wrote:
> >>
> >>> Are you running this using LocalJobRunner? Does calling
> >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
> >>> settle a debate I'm having w/Matthias. ;-)
> >>>
> >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>> > I see the error in the logs but Pipeline.run() has never thrown
> >>> anything.
> >>> > isSucceeded() subsequently returns false. Is there any way to extract
> >>> > client-side problem rather than just being able to state that job
> >>> failed?
> >>> > or it is ok and the only diagnostics by design?
> >>> >
> >>> > ============
> >>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path
> >>> > does not exist: hdfs://localhost:11010/crunchr-example/input
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> >>> > at
> >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> >>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> >>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> >>> > at java.security.AccessController.doPrivileged(Native Method)
> >>> > at javax.security.auth.Subject.doAs(Subject.java:396)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> >>> > at
> >>>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> >>> > at
> org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> >>> > at java.lang.Thread.run(Thread.java:662)
> >>> >
> >>> >
> >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >>> wrote:
> >>> >
> >>> >> for hadoop nodes i guess yet another option to soft-link the .so
> into
> >>> >> hadoop's native lib folder
> >>> >>
> >>> >>
> >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >>> >wrote:
> >>> >>
> >>> >>> I actually want to defer this to hadoop admins, we just need to
> >>> create a
> >>> >>> procedure for setting up nodes. Ideally as simple as possible.
> >>> something
> >>> >>> like
> >>> >>>
> >>> >>> 1) setup R
> >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
> >>> >>> 3) R CMD javareconf
> >>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
> >>> package="rJava") to
> >>> >>> either mapred command lines or LD_LIBRARY_PATH...
> >>> >>>
> >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
> >>> crunch
> >>> >>> might have something to hide a lot of that complexity (since it is
> >>> about
> >>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
> >>> way to
> >>> >>> ship .so's to the backend so if crunch had an api to do something
> >>> similar
> >>> >>> it is conceivable that driver might yank and ship it too to hide
> that
> >>> >>> complexity as well. But then there's a host of issues how to handle
> >>> >>> potentially different rJava versions installed on different
> nodes...
> >>> So, it
> >>> >>> increasingly looks like something we might want to defer to sysops
> >>> to do
> >>> >>> with approximate set of requirements .
> >>> >>>
> >>> >>>
> >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com>
> >>> wrote:
> >>> >>>
> >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
> >>> dlieu.7@gmail.com>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>> > so java tasks need to be able to load libjri.so from
> >>> >>>> > whatever system.file("jri", package="rJava") says.
> >>> >>>> >
> >>> >>>> > Traditionally, these issues were handled with
> -Djava.library.path.
> >>> >>>> > Apparently there's nothing java task can do to enable
> >>> loadLibrary()
> >>> >>>> command
> >>> >>>> > to see the damn library once started. But -Djava.library.path
> >>> requires
> >>> >>>> for
> >>> >>>> > nodes to configure and lock jvm command line from modifications
> >>> of the
> >>> >>>> > client.  which is fine.
> >>> >>>> >
> >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre
> 1.6
> >>> >>>> (again).
> >>> >>>> >
> >>> >>>> > but... any other suggestions about best practice configuring
> >>> crunch to
> >>> >>>> run
> >>> >>>> > user's .so's?
> >>> >>>> >
> >>> >>>>
> >>> >>>> Not off the top of my head. I suspect that whatever you come up
> >>> with will
> >>> >>>> become the "best practice." :)
> >>> >>>>
> >>> >>>> >
> >>> >>>> > thanks.
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <
> josh.wills@gmail.com
> >>> >
> >>> >>>> wrote:
> >>> >>>> >
> >>> >>>> > > I believe that is a safe assumption, at least right now.
> >>> >>>> > >
> >>> >>>> > >
> >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
> >>> dlieu.7@gmail.com
> >>> >>>> >
> >>> >>>> > > wrote:
> >>> >>>> > >
> >>> >>>> > > > Question.
> >>> >>>> > > >
> >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and
> the
> >>> >>>> process
> >>> >>>> > > gets
> >>> >>>> > > > emitter every time.
> >>> >>>> > > >
> >>> >>>> > > > However, my guess any single reincranation of a DoFn object
> >>> in the
> >>> >>>> > > backend
> >>> >>>> > > > will always be getting the same emitter thru its lifecycle.
> >>> Is it
> >>> >>>> an
> >>> >>>> > > > admissible assumption or there's currently a counter example
> >>> to
> >>> >>>> that?
> >>> >>>> > > >
> >>> >>>> > > > The problem is that as i implement the two way pipeline of
> >>> input
> >>> >>>> and
> >>> >>>> > > > emitter data between R and Java, I am bulking these calls
> >>> together
> >>> >>>> for
> >>> >>>> > > > performance reasons. Each individual datum in these chunks
> of
> >>> data
> >>> >>>> will
> >>> >>>> > > not
> >>> >>>> > > > have attached emitter function information to them in any
> way.
> >>> >>>> (well it
> >>> >>>> > > > could but it would be a performance killer and i bet emitter
> >>> never
> >>> >>>> > > > changes).
> >>> >>>> > > >
> >>> >>>> > > > So, thoughts? can i assume emitter never changes between
> >>> first and
> >>> >>>> lass
> >>> >>>> > > > call to DoFn instance?
> >>> >>>> > > >
> >>> >>>> > > > thanks.
> >>> >>>> > > >
> >>> >>>> > > >
> >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
> >>> >>>> dlieu.7@gmail.com>
> >>> >>>> > > > wrote:
> >>> >>>> > > >
> >>> >>>> > > > > yes...
> >>> >>>> > > > >
> >>> >>>> > > > > i think it worked for me before, although just adding all
> >>> jars
> >>> >>>> from R
> >>> >>>> > > > > package distribution would be a little bit more
> appropriate
> >>> >>>> approach
> >>> >>>> > > > > -- but it creates a problem with jars in dependent R
> >>> packages. I
> >>> >>>> > think
> >>> >>>> > > > > it would be much easier to just compile a hadoop-job file
> >>> and
> >>> >>>> stick
> >>> >>>> > it
> >>> >>>> > > > > in rather than doing cherry-picking of individual jars
> from
> >>> who
> >>> >>>> knows
> >>> >>>> > > > > how many locations.
> >>> >>>> > > > >
> >>> >>>> > > > > i think i used the hadoop job format with distributed
> cache
> >>> >>>> before
> >>> >>>> > and
> >>> >>>> > > > > it worked... at least with Pig "register jar"
> functionality.
> >>> >>>> > > > >
> >>> >>>> > > > > ok i guess i will just try if it works.
> >>> >>>> > > > >
> >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
> >>> jwills@cloudera.com
> >>> >>>> >
> >>> >>>> > > wrote:
> >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> >>> >>>> > dlieu.7@gmail.com
> >>> >>>> > > >
> >>> >>>> > > > > wrote:
> >>> >>>> > > > > >
> >>> >>>> > > > > >> Great! so it is in Crunch.
> >>> >>>> > > > > >>
> >>> >>>> > > > > >> does it support hadoop-job jar format or only pure java
> >>> jars?
> >>> >>>> > > > > >>
> >>> >>>> > > > > >
> >>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
> >>> format
> >>> >>>> as
> >>> >>>> > > > having
> >>> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > >>
> >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> >>> >>>> jwills@cloudera.com>
> >>> >>>> > > > > wrote:
> >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> >>> >>>> > > > dlieu.7@gmail.com>
> >>> >>>> > > > > >> wrote:
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >> I think i need functionality to add more jars (or
> >>> external
> >>> >>>> > > > > hadoop-jar)
> >>> >>>> > > > > >> >> to drive that from an R package. Just setting job
> jar
> >>> by
> >>> >>>> class
> >>> >>>> > is
> >>> >>>> > > > not
> >>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
> >>> jar to
> >>> >>>> R
> >>> >>>> > > > package;
> >>> >>>> > > > > >> >> however, i cannot really run hadoop command line on
> >>> it, i
> >>> >>>> need
> >>> >>>> > to
> >>> >>>> > > > set
> >>> >>>> > > > > >> >> up classpath thru RJava.
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
> >>> here
> >>> >>>> since
> >>> >>>> > > we
> >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
> >>> have to
> >>> >>>> > > construct
> >>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
> >>> >>>> definitions
> >>> >>>> > > > from
> >>> >>>> > > > > R
> >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
> >>> >>>> cumbersome
> >>> >>>> > and
> >>> >>>> > > > > more
> >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
> >>> shouldn't
> >>> >>>> be
> >>> >>>> > > able
> >>> >>>> > > > to
> >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
> >>> (mahout-like)
> >>> >>>> when
> >>> >>>> > > > kicking
> >>> >>>> > > > > >> >> off a pipeline.
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> >>> >>>> > > > > dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> > Ok, sounds very promising...
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> > i'll try to start digging on the driver part this
> >>> week
> >>> >>>> then
> >>> >>>> > > > > (Pipeline
> >>> >>>> > > > > >> >> > wrapper in R5).
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> >>> >>>> > > > josh.wills@gmail.com
> >>> >>>> > > > > >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
> Lyubimov <
> >>> >>>> > > > > dlieu.7@gmail.com
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>> Ok, cool.
> >>> >>>> > > > > >> >> >>>
> >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
> >>> fairly
> >>> >>>> > advanced
> >>> >>>> > > > > state.
> >>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper
> is
> >>> >>>> working ,
> >>> >>>> > > > > right?
> >>> >>>> > > > > >> Or
> >>> >>>> > > > > >> >> >>> there's something that is not working
> >>> specifically?
> >>> >>>> > > > > >> >> >>
> >>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
> >>> have
> >>> >>>> in a
> >>> >>>> > > > > working
> >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question
> >>> of
> >>> >>>> > > > prioritizing
> >>> >>>> > > > > it
> >>> >>>> > > > > >> >> >> and getting the work done.
> >>> >>>> > > > > >> >> >>
> >>> >>>> > > > > >> >> >>>
> >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> >>> >>>> > > > jwills@cloudera.com
> >>> >>>> > > > > >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>>> Hey Dmitriy,
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
> >>> with
> >>> >>>> > crunchR
> >>> >>>> > > > > this
> >>> >>>> > > > > >> >> weekend--
> >>> >>>> > > > > >> >> >>>> thanks!
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> J
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
> >>> Lyubimov <
> >>> >>>> > > > > >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>> Project template
> >>> >>>> https://github.com/dlyubimov/crunchR
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact .
> R
> >>> >>>> profile
> >>> >>>> > > > > compiles R
> >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
> >>> >>>> supplying -DR
> >>> >>>> > > to
> >>> >>>> > > > > mvn
> >>> >>>> > > > > >> >> >>>>> command line, e.g.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> mvn install -DR
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the
> snapshot
> >>> >>>> version
> >>> >>>> > of
> >>> >>>> > > > the
> >>> >>>> > > > > >> >> >>>>> package in the crunchR module.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which
> i
> >>> did
> >>> >>>> not
> >>> >>>> > > find
> >>> >>>> > > > > >> anywhere
> >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into
> >>> my
> >>> >>>> github
> >>> >>>> > > > maven
> >>> >>>> > > > > >> repo
> >>> >>>> > > > > >> >> so
> >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
> >>> optionally,
> >>> >>>> > > RProtoBuf.
> >>> >>>> > > > R
> >>> >>>> > > > > Doc
> >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
> >>> >>>> another
> >>> >>>> > > > package,
> >>> >>>> > > > > >> got a
> >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
> >>> into
> >>> >>>> > crunchR,
> >>> >>>> > > so
> >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down
> >>> the
> >>> >>>> road
> >>> >>>> > that
> >>> >>>> > > > may
> >>> >>>> > > > > >> be a
> >>> >>>> > > > > >> >> >>>>> problem though...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> other than the template, not much else has
> been
> >>> done
> >>> >>>> so
> >>> >>>> > > > far...
> >>> >>>> > > > > >> >> finding
> >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
> >>> path on
> >>> >>>> > > > > >> initialization
> >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
> >>> and its
> >>> >>>> > > > > >> non-"provided"
> >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> No legal stuff...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
> >>> Lyubimov <
> >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project
> template
> >>> by
> >>> >>>> some
> >>> >>>> > > time
> >>> >>>> > > > > next
> >>> >>>> > > > > >> >> week.
> >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and
> benchmarking
> >>> >>>> > something
> >>> >>>> > > > > really
> >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more
> >>> or
> >>> >>>> less
> >>> >>>> > > simple
> >>> >>>> > > > > >> >> algorithm
> >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
> >>> with
> >>> >>>> > Rcrunch
> >>> >>>> > > > (or
> >>> >>>> > > > > >> >> whatever
> >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
> >>> >>>> (performance)
> >>> >>>> > but
> >>> >>>> > > > > with
> >>> >>>> > > > > >> much
> >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
> >>> factorization or
> >>> >>>> > > > clustering
> >>> >>>> > > > > >> >> things)
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> >>> >>>> > > rsharma@xebia.com
> >>> >>>> > > > >
> >>> >>>> > > > > >> wrote:
> >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested
> >>> to
> >>> >>>> see how
> >>> >>>> > > > well
> >>> >>>> > > > > we
> >>> >>>> > > > > >> can
> >>> >>>> > > > > >> >> >>>>> integrate
> >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>> >> regards,
> >>> >>>> > > > > >> >> >>>>> >> Rahul
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
> >>> >>>> Lyubimov <
> >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> >>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
> >>> can set
> >>> >>>> up a
> >>> >>>> > > > maven
> >>> >>>> > > > > >> >> project
> >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
> >>> that a
> >>> >>>> lot
> >>> >>>> > > > > lately).
> >>> >>>> > > > > >> Or
> >>> >>>> > > > > >> >> if you
> >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
> >>> useful i
> >>> >>>> > guess
> >>> >>>> > > > > too.
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
> >>> Wills <
> >>> >>>> > > > > >> >> josh.wills@gmail.com>
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but
> >>> I am
> >>> >>>> happy
> >>> >>>> > > to
> >>> >>>> > > > > help.
> >>> >>>> > > > > >> >> Github
> >>> >>>> > > > > >> >> >>>>> >>>>> repo?
> >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
> >>> Lyubimov" <
> >>> >>>> > > > > >> dlieu.7@gmail.com
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
> >>> JRI/RJava
> >>> >>>> > > > prototype
> >>> >>>> > > > > on
> >>> >>>> > > > > >> >> top of
> >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This
> should
> >>> both
> >>> >>>> save
> >>> >>>> > > > time
> >>> >>>> > > > > and
> >>> >>>> > > > > >> >> prove or
> >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava
> integration
> >>> is
> >>> >>>> > viable.
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
> >>> Crunch
> >>> >>>> > > framework
> >>> >>>> > > > > or we
> >>> >>>> > > > > >> >> can keep
> >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> -d
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
> >>> Wills <
> >>> >>>> > > > > >> >> jwills@cloudera.com>
> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into
> >>> it--
> >>> >>>> who
> >>> >>>> > gave
> >>> >>>> > > > the
> >>> >>>> > > > > >> >> talk? Was
> >>> >>>> > > > > >> >> >>>>> it
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM,
> Dmitriy
> >>> >>>> > Lyubimov <
> >>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
> >>> Google's
> >>> >>>> > > experience
> >>> >>>> > > > > of R
> >>> >>>> > > > > >> >> mapping
> >>> >>>> > > > > >> >> >>>>> of
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
> >>> think
> >>> >>>> a
> >>> >>>> > lot
> >>> >>>> > > of
> >>> >>>> > > > > >> >> applications
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could
> >>> be
> >>> >>>> > > prototyped
> >>> >>>> > > > > using
> >>> >>>> > > > > >> >> flume R.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
> >>> Google
> >>> >>>> > > > > implementation
> >>> >>>> > > > > >> of
> >>> >>>> > > > > >> >> R
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
> >>> mapping
> >>> >>>> from
> >>> >>>> > R
> >>> >>>> > > to
> >>> >>>> > > > > >> Crunch
> >>> >>>> > > > > >> >> would
> >>> >>>> > > > > >> >> >>>>> be
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
> >>> efficient).
> >>> >>>> > > > RJava/JRI
> >>> >>>> > > > > and
> >>> >>>> > > > > >> >> jni
> >>> >>>> > > > > >> >> >>>>> seem to
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do
> >>> that
> >>> >>>> > > directly.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
> >>> project
> >>> >>>> > could
> >>> >>>> > > > > have a
> >>> >>>> > > > > >> >> >>>>> contributed
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
> >>> matrices,
> >>> >>>> that
> >>> >>>> > > would
> >>> >>>> > > > > be
> >>> >>>> > > > > >> >> just a
> >>> >>>> > > > > >> >> >>>>> very
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> >>> >>>> > > contributing/advising
> >>> >>>> > > > > for
> >>> >>>> > > > > >> open
> >>> >>>> > > > > >> >> >>>>> source
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
> >>> gauging
> >>> >>>> > interest,
> >>> >>>> > > > > Crunch
> >>> >>>> > > > > >> >> list
> >>> >>>> > > > > >> >> >>>>> seems
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> --
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> --
> >>> >>>> > > > > >> >> >>>> Director of Data Science
> >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
> >>> http://twitter.com/josh_wills>
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> > --
> >>> >>>> > > > > >> > Director of Data Science
> >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>> > > > > >>
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > > --
> >>> >>>> > > > > > Director of Data Science
> >>> >>>> > > > > > Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>> > > > >
> >>> >>>> > > >
> >>> >>>> > >
> >>> >>>> >
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> --
> >>> >>>> Director of Data Science
> >>> >>>> Cloudera <http://www.cloudera.com>
> >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>>
> >>>
> >>> --
> >>> Director of Data Science
> >>> Cloudera
> >>> Twitter: @josh_wills
> >>>
> >>
> >>
> >
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Or RTNode? I guess i am not sure what difference is.

Bottom line, i need to do some task startup routines (e.g. establish
exchange queues between task and R) and also last thing cleanup before MR
tasks exits and _before all outputs are closed_. (kind of "flush all"
thing).

Thanks.
-d


On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> How do I hook into CrunchTaskContext to do a task cleanup (as opposed to a
> DoFn etc.) ?
>
>
> On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> no it is fully distributed testing.
>>
>> It is ok, StatEt handles log4j logging for me so i see the logs. I was
>> wondering if any end-to-end diagnostics is already embedded in Crunch  but
>> reporting backend errors to front end is notoriously hard (and sometimes,
>> impossible) with hadoop, so I assume it doesn't make sense to report
>> client-only stuff thru exception while the other stuff still requires
>> checking isSucceeded().
>>
>>
>>
>> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Are you running this using LocalJobRunner? Does calling
>>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
>>> settle a debate I'm having w/Matthias. ;-)
>>>
>>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> > I see the error in the logs but Pipeline.run() has never thrown
>>> anything.
>>> > isSucceeded() subsequently returns false. Is there any way to extract
>>> > client-side problem rather than just being able to state that job
>>> failed?
>>> > or it is ok and the only diagnostics by design?
>>> >
>>> > ============
>>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
>>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>>> > does not exist: hdfs://localhost:11010/crunchr-example/input
>>> > at
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
>>> > at
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
>>> > at
>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
>>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
>>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>>> > at java.security.AccessController.doPrivileged(Native Method)
>>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>>> > at
>>> >
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>> > at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
>>> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
>>> > at java.lang.Thread.run(Thread.java:662)
>>> >
>>> >
>>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >
>>> >> for hadoop nodes i guess yet another option to soft-link the .so into
>>> >> hadoop's native lib folder
>>> >>
>>> >>
>>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >wrote:
>>> >>
>>> >>> I actually want to defer this to hadoop admins, we just need to
>>> create a
>>> >>> procedure for setting up nodes. Ideally as simple as possible.
>>> something
>>> >>> like
>>> >>>
>>> >>> 1) setup R
>>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
>>> >>> 3) R CMD javareconf
>>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
>>> package="rJava") to
>>> >>> either mapred command lines or LD_LIBRARY_PATH...
>>> >>>
>>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
>>> crunch
>>> >>> might have something to hide a lot of that complexity (since it is
>>> about
>>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
>>> way to
>>> >>> ship .so's to the backend so if crunch had an api to do something
>>> similar
>>> >>> it is conceivable that driver might yank and ship it too to hide that
>>> >>> complexity as well. But then there's a host of issues how to handle
>>> >>> potentially different rJava versions installed on different nodes...
>>> So, it
>>> >>> increasingly looks like something we might want to defer to sysops
>>> to do
>>> >>> with approximate set of requirements .
>>> >>>
>>> >>>
>>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com>
>>> wrote:
>>> >>>
>>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>> > so java tasks need to be able to load libjri.so from
>>> >>>> > whatever system.file("jri", package="rJava") says.
>>> >>>> >
>>> >>>> > Traditionally, these issues were handled with -Djava.library.path.
>>> >>>> > Apparently there's nothing java task can do to enable
>>> loadLibrary()
>>> >>>> command
>>> >>>> > to see the damn library once started. But -Djava.library.path
>>> requires
>>> >>>> for
>>> >>>> > nodes to configure and lock jvm command line from modifications
>>> of the
>>> >>>> > client.  which is fine.
>>> >>>> >
>>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>>> >>>> (again).
>>> >>>> >
>>> >>>> > but... any other suggestions about best practice configuring
>>> crunch to
>>> >>>> run
>>> >>>> > user's .so's?
>>> >>>> >
>>> >>>>
>>> >>>> Not off the top of my head. I suspect that whatever you come up
>>> with will
>>> >>>> become the "best practice." :)
>>> >>>>
>>> >>>> >
>>> >>>> > thanks.
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <josh.wills@gmail.com
>>> >
>>> >>>> wrote:
>>> >>>> >
>>> >>>> > > I believe that is a safe assumption, at least right now.
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com
>>> >>>> >
>>> >>>> > > wrote:
>>> >>>> > >
>>> >>>> > > > Question.
>>> >>>> > > >
>>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>>> >>>> process
>>> >>>> > > gets
>>> >>>> > > > emitter every time.
>>> >>>> > > >
>>> >>>> > > > However, my guess any single reincranation of a DoFn object
>>> in the
>>> >>>> > > backend
>>> >>>> > > > will always be getting the same emitter thru its lifecycle.
>>> Is it
>>> >>>> an
>>> >>>> > > > admissible assumption or there's currently a counter example
>>> to
>>> >>>> that?
>>> >>>> > > >
>>> >>>> > > > The problem is that as i implement the two way pipeline of
>>> input
>>> >>>> and
>>> >>>> > > > emitter data between R and Java, I am bulking these calls
>>> together
>>> >>>> for
>>> >>>> > > > performance reasons. Each individual datum in these chunks of
>>> data
>>> >>>> will
>>> >>>> > > not
>>> >>>> > > > have attached emitter function information to them in any way.
>>> >>>> (well it
>>> >>>> > > > could but it would be a performance killer and i bet emitter
>>> never
>>> >>>> > > > changes).
>>> >>>> > > >
>>> >>>> > > > So, thoughts? can i assume emitter never changes between
>>> first and
>>> >>>> lass
>>> >>>> > > > call to DoFn instance?
>>> >>>> > > >
>>> >>>> > > > thanks.
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>>> >>>> dlieu.7@gmail.com>
>>> >>>> > > > wrote:
>>> >>>> > > >
>>> >>>> > > > > yes...
>>> >>>> > > > >
>>> >>>> > > > > i think it worked for me before, although just adding all
>>> jars
>>> >>>> from R
>>> >>>> > > > > package distribution would be a little bit more appropriate
>>> >>>> approach
>>> >>>> > > > > -- but it creates a problem with jars in dependent R
>>> packages. I
>>> >>>> > think
>>> >>>> > > > > it would be much easier to just compile a hadoop-job file
>>> and
>>> >>>> stick
>>> >>>> > it
>>> >>>> > > > > in rather than doing cherry-picking of individual jars from
>>> who
>>> >>>> knows
>>> >>>> > > > > how many locations.
>>> >>>> > > > >
>>> >>>> > > > > i think i used the hadoop job format with distributed cache
>>> >>>> before
>>> >>>> > and
>>> >>>> > > > > it worked... at least with Pig "register jar" functionality.
>>> >>>> > > > >
>>> >>>> > > > > ok i guess i will just try if it works.
>>> >>>> > > > >
>>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
>>> jwills@cloudera.com
>>> >>>> >
>>> >>>> > > wrote:
>>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>>> >>>> > dlieu.7@gmail.com
>>> >>>> > > >
>>> >>>> > > > > wrote:
>>> >>>> > > > > >
>>> >>>> > > > > >> Great! so it is in Crunch.
>>> >>>> > > > > >>
>>> >>>> > > > > >> does it support hadoop-job jar format or only pure java
>>> jars?
>>> >>>> > > > > >>
>>> >>>> > > > > >
>>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
>>> format
>>> >>>> as
>>> >>>> > > > having
>>> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > >>
>>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>>> >>>> jwills@cloudera.com>
>>> >>>> > > > > wrote:
>>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>>> >>>> > > > dlieu.7@gmail.com>
>>> >>>> > > > > >> wrote:
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >> I think i need functionality to add more jars (or
>>> external
>>> >>>> > > > > hadoop-jar)
>>> >>>> > > > > >> >> to drive that from an R package. Just setting job jar
>>> by
>>> >>>> class
>>> >>>> > is
>>> >>>> > > > not
>>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
>>> jar to
>>> >>>> R
>>> >>>> > > > package;
>>> >>>> > > > > >> >> however, i cannot really run hadoop command line on
>>> it, i
>>> >>>> need
>>> >>>> > to
>>> >>>> > > > set
>>> >>>> > > > > >> >> up classpath thru RJava.
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
>>> here
>>> >>>> since
>>> >>>> > > we
>>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
>>> have to
>>> >>>> > > construct
>>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>>> >>>> definitions
>>> >>>> > > > from
>>> >>>> > > > > R
>>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
>>> >>>> cumbersome
>>> >>>> > and
>>> >>>> > > > > more
>>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
>>> shouldn't
>>> >>>> be
>>> >>>> > > able
>>> >>>> > > > to
>>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
>>> (mahout-like)
>>> >>>> when
>>> >>>> > > > kicking
>>> >>>> > > > > >> >> off a pipeline.
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >
>>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>>> >>>> > > > > dlieu.7@gmail.com>
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> > Ok, sounds very promising...
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> > i'll try to start digging on the driver part this
>>> week
>>> >>>> then
>>> >>>> > > > > (Pipeline
>>> >>>> > > > > >> >> > wrapper in R5).
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>>> >>>> > > > josh.wills@gmail.com
>>> >>>> > > > > >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>>> >>>> > > > > dlieu.7@gmail.com
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>> Ok, cool.
>>> >>>> > > > > >> >> >>>
>>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
>>> fairly
>>> >>>> > advanced
>>> >>>> > > > > state.
>>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>>> >>>> working ,
>>> >>>> > > > > right?
>>> >>>> > > > > >> Or
>>> >>>> > > > > >> >> >>> there's something that is not working
>>> specifically?
>>> >>>> > > > > >> >> >>
>>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
>>> have
>>> >>>> in a
>>> >>>> > > > > working
>>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question
>>> of
>>> >>>> > > > prioritizing
>>> >>>> > > > > it
>>> >>>> > > > > >> >> >> and getting the work done.
>>> >>>> > > > > >> >> >>
>>> >>>> > > > > >> >> >>>
>>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>>> >>>> > > > jwills@cloudera.com
>>> >>>> > > > > >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>>> Hey Dmitriy,
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
>>> with
>>> >>>> > crunchR
>>> >>>> > > > > this
>>> >>>> > > > > >> >> weekend--
>>> >>>> > > > > >> >> >>>> thanks!
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> J
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
>>> Lyubimov <
>>> >>>> > > > > >> dlieu.7@gmail.com>
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>> Project template
>>> >>>> https://github.com/dlyubimov/crunchR
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>>> >>>> profile
>>> >>>> > > > > compiles R
>>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
>>> >>>> supplying -DR
>>> >>>> > > to
>>> >>>> > > > > mvn
>>> >>>> > > > > >> >> >>>>> command line, e.g.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> mvn install -DR
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>>> >>>> version
>>> >>>> > of
>>> >>>> > > > the
>>> >>>> > > > > >> >> >>>>> package in the crunchR module.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i
>>> did
>>> >>>> not
>>> >>>> > > find
>>> >>>> > > > > >> anywhere
>>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into
>>> my
>>> >>>> github
>>> >>>> > > > maven
>>> >>>> > > > > >> repo
>>> >>>> > > > > >> >> so
>>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
>>> optionally,
>>> >>>> > > RProtoBuf.
>>> >>>> > > > R
>>> >>>> > > > > Doc
>>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
>>> >>>> another
>>> >>>> > > > package,
>>> >>>> > > > > >> got a
>>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
>>> into
>>> >>>> > crunchR,
>>> >>>> > > so
>>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down
>>> the
>>> >>>> road
>>> >>>> > that
>>> >>>> > > > may
>>> >>>> > > > > >> be a
>>> >>>> > > > > >> >> >>>>> problem though...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> other than the template, not much else has been
>>> done
>>> >>>> so
>>> >>>> > > > far...
>>> >>>> > > > > >> >> finding
>>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
>>> path on
>>> >>>> > > > > >> initialization
>>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
>>> and its
>>> >>>> > > > > >> non-"provided"
>>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> No legal stuff...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
>>> Lyubimov <
>>> >>>> > > > > >> >> dlieu.7@gmail.com>
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template
>>> by
>>> >>>> some
>>> >>>> > > time
>>> >>>> > > > > next
>>> >>>> > > > > >> >> week.
>>> >>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>>> >>>> > something
>>> >>>> > > > > really
>>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more
>>> or
>>> >>>> less
>>> >>>> > > simple
>>> >>>> > > > > >> >> algorithm
>>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
>>> with
>>> >>>> > Rcrunch
>>> >>>> > > > (or
>>> >>>> > > > > >> >> whatever
>>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>>> >>>> (performance)
>>> >>>> > but
>>> >>>> > > > > with
>>> >>>> > > > > >> much
>>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
>>> factorization or
>>> >>>> > > > clustering
>>> >>>> > > > > >> >> things)
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>>> >>>> > > rsharma@xebia.com
>>> >>>> > > > >
>>> >>>> > > > > >> wrote:
>>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested
>>> to
>>> >>>> see how
>>> >>>> > > > well
>>> >>>> > > > > we
>>> >>>> > > > > >> can
>>> >>>> > > > > >> >> >>>>> integrate
>>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>> >> regards,
>>> >>>> > > > > >> >> >>>>> >> Rahul
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>>> >>>> Lyubimov <
>>> >>>> > > > > >> >> dlieu.7@gmail.com>
>>> >>>> > > > > >> >> >>>>> >>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
>>> can set
>>> >>>> up a
>>> >>>> > > > maven
>>> >>>> > > > > >> >> project
>>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
>>> that a
>>> >>>> lot
>>> >>>> > > > > lately).
>>> >>>> > > > > >> Or
>>> >>>> > > > > >> >> if you
>>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
>>> useful i
>>> >>>> > guess
>>> >>>> > > > > too.
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
>>> Wills <
>>> >>>> > > > > >> >> josh.wills@gmail.com>
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>
>>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but
>>> I am
>>> >>>> happy
>>> >>>> > > to
>>> >>>> > > > > help.
>>> >>>> > > > > >> >> Github
>>> >>>> > > > > >> >> >>>>> >>>>> repo?
>>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
>>> Lyubimov" <
>>> >>>> > > > > >> dlieu.7@gmail.com
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
>>> JRI/RJava
>>> >>>> > > > prototype
>>> >>>> > > > > on
>>> >>>> > > > > >> >> top of
>>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should
>>> both
>>> >>>> save
>>> >>>> > > > time
>>> >>>> > > > > and
>>> >>>> > > > > >> >> prove or
>>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration
>>> is
>>> >>>> > viable.
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
>>> Crunch
>>> >>>> > > framework
>>> >>>> > > > > or we
>>> >>>> > > > > >> >> can keep
>>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> -d
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
>>> Wills <
>>> >>>> > > > > >> >> jwills@cloudera.com>
>>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into
>>> it--
>>> >>>> who
>>> >>>> > gave
>>> >>>> > > > the
>>> >>>> > > > > >> >> talk? Was
>>> >>>> > > > > >> >> >>>>> it
>>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>>> >>>> > Lyubimov <
>>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
>>> Google's
>>> >>>> > > experience
>>> >>>> > > > > of R
>>> >>>> > > > > >> >> mapping
>>> >>>> > > > > >> >> >>>>> of
>>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
>>> think
>>> >>>> a
>>> >>>> > lot
>>> >>>> > > of
>>> >>>> > > > > >> >> applications
>>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could
>>> be
>>> >>>> > > prototyped
>>> >>>> > > > > using
>>> >>>> > > > > >> >> flume R.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
>>> Google
>>> >>>> > > > > implementation
>>> >>>> > > > > >> of
>>> >>>> > > > > >> >> R
>>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
>>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
>>> mapping
>>> >>>> from
>>> >>>> > R
>>> >>>> > > to
>>> >>>> > > > > >> Crunch
>>> >>>> > > > > >> >> would
>>> >>>> > > > > >> >> >>>>> be
>>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
>>> efficient).
>>> >>>> > > > RJava/JRI
>>> >>>> > > > > and
>>> >>>> > > > > >> >> jni
>>> >>>> > > > > >> >> >>>>> seem to
>>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do
>>> that
>>> >>>> > > directly.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
>>> project
>>> >>>> > could
>>> >>>> > > > > have a
>>> >>>> > > > > >> >> >>>>> contributed
>>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
>>> matrices,
>>> >>>> that
>>> >>>> > > would
>>> >>>> > > > > be
>>> >>>> > > > > >> >> just a
>>> >>>> > > > > >> >> >>>>> very
>>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>>> >>>> > > contributing/advising
>>> >>>> > > > > for
>>> >>>> > > > > >> open
>>> >>>> > > > > >> >> >>>>> source
>>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
>>> gauging
>>> >>>> > interest,
>>> >>>> > > > > Crunch
>>> >>>> > > > > >> >> list
>>> >>>> > > > > >> >> >>>>> seems
>>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> --
>>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
>>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> --
>>> >>>> > > > > >> >> >>>> Director of Data Science
>>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
>>> http://twitter.com/josh_wills>
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> > --
>>> >>>> > > > > >> > Director of Data Science
>>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
>>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>> > > > > >>
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > > --
>>> >>>> > > > > > Director of Data Science
>>> >>>> > > > > > Cloudera <http://www.cloudera.com>
>>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Director of Data Science
>>> >>>> Cloudera <http://www.cloudera.com>
>>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera
>>> Twitter: @josh_wills
>>>
>>
>>
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

How do I hook into CrunchTaskContext to do a task cleanup (as opposed to a
DoFn etc.) ?


On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> no it is fully distributed testing.
>
> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> wondering if any end-to-end diagnostics is already embedded in Crunch  but
> reporting backend errors to front end is notoriously hard (and sometimes,
> impossible) with hadoop, so I assume it doesn't make sense to report
> client-only stuff thru exception while the other stuff still requires
> checking isSucceeded().
>
>
>
> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Are you running this using LocalJobRunner? Does calling
>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
>> settle a debate I'm having w/Matthias. ;-)
>>
>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > I see the error in the logs but Pipeline.run() has never thrown
>> anything.
>> > isSucceeded() subsequently returns false. Is there any way to extract
>> > client-side problem rather than just being able to state that job
>> failed?
>> > or it is ok and the only diagnostics by design?
>> >
>> > ============
>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>> > does not exist: hdfs://localhost:11010/crunchr-example/input
>> > at
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
>> > at
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
>> > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>> > at java.security.AccessController.doPrivileged(Native Method)
>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>> > at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>> > at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
>> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
>> > at java.lang.Thread.run(Thread.java:662)
>> >
>> >
>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >
>> >> for hadoop nodes i guess yet another option to soft-link the .so into
>> >> hadoop's native lib folder
>> >>
>> >>
>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>> >>
>> >>> I actually want to defer this to hadoop admins, we just need to
>> create a
>> >>> procedure for setting up nodes. Ideally as simple as possible.
>> something
>> >>> like
>> >>>
>> >>> 1) setup R
>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
>> >>> 3) R CMD javareconf
>> >>> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava")
>> to
>> >>> either mapred command lines or LD_LIBRARY_PATH...
>> >>>
>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
>> crunch
>> >>> might have something to hide a lot of that complexity (since it is
>> about
>> >>> hiding complexities, for the most part :)  ) besides hadoop has a way
>> to
>> >>> ship .so's to the backend so if crunch had an api to do something
>> similar
>> >>> it is conceivable that driver might yank and ship it too to hide that
>> >>> complexity as well. But then there's a host of issues how to handle
>> >>> potentially different rJava versions installed on different nodes...
>> So, it
>> >>> increasingly looks like something we might want to defer to sysops to
>> do
>> >>> with approximate set of requirements .
>> >>>
>> >>>
>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com>
>> wrote:
>> >>>
>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> >>>> wrote:
>> >>>>
>> >>>> > so java tasks need to be able to load libjri.so from
>> >>>> > whatever system.file("jri", package="rJava") says.
>> >>>> >
>> >>>> > Traditionally, these issues were handled with -Djava.library.path.
>> >>>> > Apparently there's nothing java task can do to enable loadLibrary()
>> >>>> command
>> >>>> > to see the damn library once started. But -Djava.library.path
>> requires
>> >>>> for
>> >>>> > nodes to configure and lock jvm command line from modifications of
>> the
>> >>>> > client.  which is fine.
>> >>>> >
>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>> >>>> (again).
>> >>>> >
>> >>>> > but... any other suggestions about best practice configuring
>> crunch to
>> >>>> run
>> >>>> > user's .so's?
>> >>>> >
>> >>>>
>> >>>> Not off the top of my head. I suspect that whatever you come up with
>> will
>> >>>> become the "best practice." :)
>> >>>>
>> >>>> >
>> >>>> > thanks.
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
>> >>>> wrote:
>> >>>> >
>> >>>> > > I believe that is a safe assumption, at least right now.
>> >>>> > >
>> >>>> > >
>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com
>> >>>> >
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > > > Question.
>> >>>> > > >
>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>> >>>> process
>> >>>> > > gets
>> >>>> > > > emitter every time.
>> >>>> > > >
>> >>>> > > > However, my guess any single reincranation of a DoFn object in
>> the
>> >>>> > > backend
>> >>>> > > > will always be getting the same emitter thru its lifecycle. Is
>> it
>> >>>> an
>> >>>> > > > admissible assumption or there's currently a counter example to
>> >>>> that?
>> >>>> > > >
>> >>>> > > > The problem is that as i implement the two way pipeline of
>> input
>> >>>> and
>> >>>> > > > emitter data between R and Java, I am bulking these calls
>> together
>> >>>> for
>> >>>> > > > performance reasons. Each individual datum in these chunks of
>> data
>> >>>> will
>> >>>> > > not
>> >>>> > > > have attached emitter function information to them in any way.
>> >>>> (well it
>> >>>> > > > could but it would be a performance killer and i bet emitter
>> never
>> >>>> > > > changes).
>> >>>> > > >
>> >>>> > > > So, thoughts? can i assume emitter never changes between first
>> and
>> >>>> lass
>> >>>> > > > call to DoFn instance?
>> >>>> > > >
>> >>>> > > > thanks.
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>> >>>> dlieu.7@gmail.com>
>> >>>> > > > wrote:
>> >>>> > > >
>> >>>> > > > > yes...
>> >>>> > > > >
>> >>>> > > > > i think it worked for me before, although just adding all
>> jars
>> >>>> from R
>> >>>> > > > > package distribution would be a little bit more appropriate
>> >>>> approach
>> >>>> > > > > -- but it creates a problem with jars in dependent R
>> packages. I
>> >>>> > think
>> >>>> > > > > it would be much easier to just compile a hadoop-job file and
>> >>>> stick
>> >>>> > it
>> >>>> > > > > in rather than doing cherry-picking of individual jars from
>> who
>> >>>> knows
>> >>>> > > > > how many locations.
>> >>>> > > > >
>> >>>> > > > > i think i used the hadoop job format with distributed cache
>> >>>> before
>> >>>> > and
>> >>>> > > > > it worked... at least with Pig "register jar" functionality.
>> >>>> > > > >
>> >>>> > > > > ok i guess i will just try if it works.
>> >>>> > > > >
>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
>> jwills@cloudera.com
>> >>>> >
>> >>>> > > wrote:
>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>> >>>> > dlieu.7@gmail.com
>> >>>> > > >
>> >>>> > > > > wrote:
>> >>>> > > > > >
>> >>>> > > > > >> Great! so it is in Crunch.
>> >>>> > > > > >>
>> >>>> > > > > >> does it support hadoop-job jar format or only pure java
>> jars?
>> >>>> > > > > >>
>> >>>> > > > > >
>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
>> format
>> >>>> as
>> >>>> > > > having
>> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > >>
>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>> >>>> jwills@cloudera.com>
>> >>>> > > > > wrote:
>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>> >>>> > > > dlieu.7@gmail.com>
>> >>>> > > > > >> wrote:
>> >>>> > > > > >> >
>> >>>> > > > > >> >> I think i need functionality to add more jars (or
>> external
>> >>>> > > > > hadoop-jar)
>> >>>> > > > > >> >> to drive that from an R package. Just setting job jar
>> by
>> >>>> class
>> >>>> > is
>> >>>> > > > not
>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
>> jar to
>> >>>> R
>> >>>> > > > package;
>> >>>> > > > > >> >> however, i cannot really run hadoop command line on
>> it, i
>> >>>> need
>> >>>> > to
>> >>>> > > > set
>> >>>> > > > > >> >> up classpath thru RJava.
>> >>>> > > > > >> >>
>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
>> here
>> >>>> since
>> >>>> > > we
>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather have
>> to
>> >>>> > > construct
>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>> >>>> definitions
>> >>>> > > > from
>> >>>> > > > > R
>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
>> >>>> cumbersome
>> >>>> > and
>> >>>> > > > > more
>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
>> shouldn't
>> >>>> be
>> >>>> > > able
>> >>>> > > > to
>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
>> >>>> when
>> >>>> > > > kicking
>> >>>> > > > > >> >> off a pipeline.
>> >>>> > > > > >> >>
>> >>>> > > > > >> >
>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> >>
>> >>>> > > > > >> >>
>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>> >>>> > > > > dlieu.7@gmail.com>
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> > Ok, sounds very promising...
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> > i'll try to start digging on the driver part this
>> week
>> >>>> then
>> >>>> > > > > (Pipeline
>> >>>> > > > > >> >> > wrapper in R5).
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>> >>>> > > > josh.wills@gmail.com
>> >>>> > > > > >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>> >>>> > > > > dlieu.7@gmail.com
>> >>>> > > > > >> >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>> Ok, cool.
>> >>>> > > > > >> >> >>>
>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
>> fairly
>> >>>> > advanced
>> >>>> > > > > state.
>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>> >>>> working ,
>> >>>> > > > > right?
>> >>>> > > > > >> Or
>> >>>> > > > > >> >> >>> there's something that is not working specifically?
>> >>>> > > > > >> >> >>
>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
>> have
>> >>>> in a
>> >>>> > > > > working
>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
>> >>>> > > > prioritizing
>> >>>> > > > > it
>> >>>> > > > > >> >> >> and getting the work done.
>> >>>> > > > > >> >> >>
>> >>>> > > > > >> >> >>>
>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>> >>>> > > > jwills@cloudera.com
>> >>>> > > > > >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>>> Hey Dmitriy,
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
>> with
>> >>>> > crunchR
>> >>>> > > > > this
>> >>>> > > > > >> >> weekend--
>> >>>> > > > > >> >> >>>> thanks!
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> J
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov
>> <
>> >>>> > > > > >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>> Project template
>> >>>> https://github.com/dlyubimov/crunchR
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>> >>>> profile
>> >>>> > > > > compiles R
>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
>> >>>> supplying -DR
>> >>>> > > to
>> >>>> > > > > mvn
>> >>>> > > > > >> >> >>>>> command line, e.g.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> mvn install -DR
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>> >>>> version
>> >>>> > of
>> >>>> > > > the
>> >>>> > > > > >> >> >>>>> package in the crunchR module.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i
>> did
>> >>>> not
>> >>>> > > find
>> >>>> > > > > >> anywhere
>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
>> >>>> github
>> >>>> > > > maven
>> >>>> > > > > >> repo
>> >>>> > > > > >> >> so
>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
>> >>>> > > RProtoBuf.
>> >>>> > > > R
>> >>>> > > > > Doc
>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
>> >>>> another
>> >>>> > > > package,
>> >>>> > > > > >> got a
>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
>> >>>> > crunchR,
>> >>>> > > so
>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the
>> >>>> road
>> >>>> > that
>> >>>> > > > may
>> >>>> > > > > >> be a
>> >>>> > > > > >> >> >>>>> problem though...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> other than the template, not much else has been
>> done
>> >>>> so
>> >>>> > > > far...
>> >>>> > > > > >> >> finding
>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
>> path on
>> >>>> > > > > >> initialization
>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and
>> its
>> >>>> > > > > >> non-"provided"
>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> No legal stuff...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
>> Lyubimov <
>> >>>> > > > > >> >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template
>> by
>> >>>> some
>> >>>> > > time
>> >>>> > > > > next
>> >>>> > > > > >> >> week.
>> >>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>> >>>> > something
>> >>>> > > > > really
>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or
>> >>>> less
>> >>>> > > simple
>> >>>> > > > > >> >> algorithm
>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
>> with
>> >>>> > Rcrunch
>> >>>> > > > (or
>> >>>> > > > > >> >> whatever
>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>> >>>> (performance)
>> >>>> > but
>> >>>> > > > > with
>> >>>> > > > > >> much
>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization
>> or
>> >>>> > > > clustering
>> >>>> > > > > >> >> things)
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>> >>>> > > rsharma@xebia.com
>> >>>> > > > >
>> >>>> > > > > >> wrote:
>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to
>> >>>> see how
>> >>>> > > > well
>> >>>> > > > > we
>> >>>> > > > > >> can
>> >>>> > > > > >> >> >>>>> integrate
>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>> >> regards,
>> >>>> > > > > >> >> >>>>> >> Rahul
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>> >>>> Lyubimov <
>> >>>> > > > > >> >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> >>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can
>> set
>> >>>> up a
>> >>>> > > > maven
>> >>>> > > > > >> >> project
>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
>> that a
>> >>>> lot
>> >>>> > > > > lately).
>> >>>> > > > > >> Or
>> >>>> > > > > >> >> if you
>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
>> useful i
>> >>>> > guess
>> >>>> > > > > too.
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills
>> <
>> >>>> > > > > >> >> josh.wills@gmail.com>
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>
>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I
>> am
>> >>>> happy
>> >>>> > > to
>> >>>> > > > > help.
>> >>>> > > > > >> >> Github
>> >>>> > > > > >> >> >>>>> >>>>> repo?
>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
>> Lyubimov" <
>> >>>> > > > > >> dlieu.7@gmail.com
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>
>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
>> JRI/RJava
>> >>>> > > > prototype
>> >>>> > > > > on
>> >>>> > > > > >> >> top of
>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should
>> both
>> >>>> save
>> >>>> > > > time
>> >>>> > > > > and
>> >>>> > > > > >> >> prove or
>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration
>> is
>> >>>> > viable.
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
>> Crunch
>> >>>> > > framework
>> >>>> > > > > or we
>> >>>> > > > > >> >> can keep
>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>> -d
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
>> Wills <
>> >>>> > > > > >> >> jwills@cloudera.com>
>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into
>> it--
>> >>>> who
>> >>>> > gave
>> >>>> > > > the
>> >>>> > > > > >> >> talk? Was
>> >>>> > > > > >> >> >>>>> it
>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>> >>>> > Lyubimov <
>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
>> Google's
>> >>>> > > experience
>> >>>> > > > > of R
>> >>>> > > > > >> >> mapping
>> >>>> > > > > >> >> >>>>> of
>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
>> think
>> >>>> a
>> >>>> > lot
>> >>>> > > of
>> >>>> > > > > >> >> applications
>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
>> >>>> > > prototyped
>> >>>> > > > > using
>> >>>> > > > > >> >> flume R.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
>> Google
>> >>>> > > > > implementation
>> >>>> > > > > >> of
>> >>>> > > > > >> >> R
>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
>> mapping
>> >>>> from
>> >>>> > R
>> >>>> > > to
>> >>>> > > > > >> Crunch
>> >>>> > > > > >> >> would
>> >>>> > > > > >> >> >>>>> be
>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
>> efficient).
>> >>>> > > > RJava/JRI
>> >>>> > > > > and
>> >>>> > > > > >> >> jni
>> >>>> > > > > >> >> >>>>> seem to
>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do
>> that
>> >>>> > > directly.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
>> project
>> >>>> > could
>> >>>> > > > > have a
>> >>>> > > > > >> >> >>>>> contributed
>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
>> matrices,
>> >>>> that
>> >>>> > > would
>> >>>> > > > > be
>> >>>> > > > > >> >> just a
>> >>>> > > > > >> >> >>>>> very
>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>> >>>> > > contributing/advising
>> >>>> > > > > for
>> >>>> > > > > >> open
>> >>>> > > > > >> >> >>>>> source
>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
>> >>>> > interest,
>> >>>> > > > > Crunch
>> >>>> > > > > >> >> list
>> >>>> > > > > >> >> >>>>> seems
>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>> --
>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> --
>> >>>> > > > > >> >> >>>> Director of Data Science
>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
>> http://twitter.com/josh_wills>
>> >>>> > > > > >> >>
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> > --
>> >>>> > > > > >> > Director of Data Science
>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>> > > > > >>
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > > --
>> >>>> > > > > > Director of Data Science
>> >>>> > > > > > Cloudera <http://www.cloudera.com>
>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>> > > > >
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Director of Data Science
>> >>>> Cloudera <http://www.cloudera.com>
>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills
>>
>
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

no it is fully distributed testing.

It is ok, StatEt handles log4j logging for me so i see the logs. I was
wondering if any end-to-end diagnostics is already embedded in Crunch  but
reporting backend errors to front end is notoriously hard (and sometimes,
impossible) with hadoop, so I assume it doesn't make sense to report
client-only stuff thru exception while the other stuff still requires
checking isSucceeded().



On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jw...@cloudera.com> wrote:

> Are you running this using LocalJobRunner? Does calling
> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
> settle a debate I'm having w/Matthias. ;-)
>
> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > I see the error in the logs but Pipeline.run() has never thrown anything.
> > isSucceeded() subsequently returns false. Is there any way to extract
> > client-side problem rather than just being able to state that job failed?
> > or it is ok and the only diagnostics by design?
> >
> > ============
> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> > does not exist: hdfs://localhost:11010/crunchr-example/input
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> > at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> > at
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> > at
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> > at
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> > at java.lang.Thread.run(Thread.java:662)
> >
> >
> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> for hadoop nodes i guess yet another option to soft-link the .so into
> >> hadoop's native lib folder
> >>
> >>
> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >>
> >>> I actually want to defer this to hadoop admins, we just need to create
> a
> >>> procedure for setting up nodes. Ideally as simple as possible.
> something
> >>> like
> >>>
> >>> 1) setup R
> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
> >>> 3) R CMD javareconf
> >>> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava")
> to
> >>> either mapred command lines or LD_LIBRARY_PATH...
> >>>
> >>> but it will depend on their versions of hadoop, jre etc. I hoped crunch
> >>> might have something to hide a lot of that complexity (since it is
> about
> >>> hiding complexities, for the most part :)  ) besides hadoop has a way
> to
> >>> ship .so's to the backend so if crunch had an api to do something
> similar
> >>> it is conceivable that driver might yank and ship it too to hide that
> >>> complexity as well. But then there's a host of issues how to handle
> >>> potentially different rJava versions installed on different nodes...
> So, it
> >>> increasingly looks like something we might want to defer to sysops to
> do
> >>> with approximate set of requirements .
> >>>
> >>>
> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com>
> wrote:
> >>>
> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> > so java tasks need to be able to load libjri.so from
> >>>> > whatever system.file("jri", package="rJava") says.
> >>>> >
> >>>> > Traditionally, these issues were handled with -Djava.library.path.
> >>>> > Apparently there's nothing java task can do to enable loadLibrary()
> >>>> command
> >>>> > to see the damn library once started. But -Djava.library.path
> requires
> >>>> for
> >>>> > nodes to configure and lock jvm command line from modifications of
> the
> >>>> > client.  which is fine.
> >>>> >
> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
> >>>> (again).
> >>>> >
> >>>> > but... any other suggestions about best practice configuring crunch
> to
> >>>> run
> >>>> > user's .so's?
> >>>> >
> >>>>
> >>>> Not off the top of my head. I suspect that whatever you come up with
> will
> >>>> become the "best practice." :)
> >>>>
> >>>> >
> >>>> > thanks.
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > I believe that is a safe assumption, at least right now.
> >>>> > >
> >>>> > >
> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >>>> >
> >>>> > > wrote:
> >>>> > >
> >>>> > > > Question.
> >>>> > > >
> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
> >>>> process
> >>>> > > gets
> >>>> > > > emitter every time.
> >>>> > > >
> >>>> > > > However, my guess any single reincranation of a DoFn object in
> the
> >>>> > > backend
> >>>> > > > will always be getting the same emitter thru its lifecycle. Is
> it
> >>>> an
> >>>> > > > admissible assumption or there's currently a counter example to
> >>>> that?
> >>>> > > >
> >>>> > > > The problem is that as i implement the two way pipeline of input
> >>>> and
> >>>> > > > emitter data between R and Java, I am bulking these calls
> together
> >>>> for
> >>>> > > > performance reasons. Each individual datum in these chunks of
> data
> >>>> will
> >>>> > > not
> >>>> > > > have attached emitter function information to them in any way.
> >>>> (well it
> >>>> > > > could but it would be a performance killer and i bet emitter
> never
> >>>> > > > changes).
> >>>> > > >
> >>>> > > > So, thoughts? can i assume emitter never changes between first
> and
> >>>> lass
> >>>> > > > call to DoFn instance?
> >>>> > > >
> >>>> > > > thanks.
> >>>> > > >
> >>>> > > >
> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
> >>>> dlieu.7@gmail.com>
> >>>> > > > wrote:
> >>>> > > >
> >>>> > > > > yes...
> >>>> > > > >
> >>>> > > > > i think it worked for me before, although just adding all jars
> >>>> from R
> >>>> > > > > package distribution would be a little bit more appropriate
> >>>> approach
> >>>> > > > > -- but it creates a problem with jars in dependent R
> packages. I
> >>>> > think
> >>>> > > > > it would be much easier to just compile a hadoop-job file and
> >>>> stick
> >>>> > it
> >>>> > > > > in rather than doing cherry-picking of individual jars from
> who
> >>>> knows
> >>>> > > > > how many locations.
> >>>> > > > >
> >>>> > > > > i think i used the hadoop job format with distributed cache
> >>>> before
> >>>> > and
> >>>> > > > > it worked... at least with Pig "register jar" functionality.
> >>>> > > > >
> >>>> > > > > ok i guess i will just try if it works.
> >>>> > > > >
> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
> jwills@cloudera.com
> >>>> >
> >>>> > > wrote:
> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> >>>> > dlieu.7@gmail.com
> >>>> > > >
> >>>> > > > > wrote:
> >>>> > > > > >
> >>>> > > > > >> Great! so it is in Crunch.
> >>>> > > > > >>
> >>>> > > > > >> does it support hadoop-job jar format or only pure java
> jars?
> >>>> > > > > >>
> >>>> > > > > >
> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
> format
> >>>> as
> >>>> > > > having
> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > >>
> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> >>>> jwills@cloudera.com>
> >>>> > > > > wrote:
> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> >>>> > > > dlieu.7@gmail.com>
> >>>> > > > > >> wrote:
> >>>> > > > > >> >
> >>>> > > > > >> >> I think i need functionality to add more jars (or
> external
> >>>> > > > > hadoop-jar)
> >>>> > > > > >> >> to drive that from an R package. Just setting job jar by
> >>>> class
> >>>> > is
> >>>> > > > not
> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar
> to
> >>>> R
> >>>> > > > package;
> >>>> > > > > >> >> however, i cannot really run hadoop command line on it,
> i
> >>>> need
> >>>> > to
> >>>> > > > set
> >>>> > > > > >> >> up classpath thru RJava.
> >>>> > > > > >> >>
> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
> here
> >>>> since
> >>>> > > we
> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather have
> to
> >>>> > > construct
> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
> >>>> definitions
> >>>> > > > from
> >>>> > > > > R
> >>>> > > > > >> >> and then replay them in a driver -- but that's too
> >>>> cumbersome
> >>>> > and
> >>>> > > > > more
> >>>> > > > > >> >> work than it has to be.) There's no reason why i
> shouldn't
> >>>> be
> >>>> > > able
> >>>> > > > to
> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
> >>>> when
> >>>> > > > kicking
> >>>> > > > > >> >> off a pipeline.
> >>>> > > > > >> >>
> >>>> > > > > >> >
> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> >>>> > > > > >> >
> >>>> > > > > >> >
> >>>> > > > > >> >>
> >>>> > > > > >> >>
> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> >>>> > > > > dlieu.7@gmail.com>
> >>>> > > > > >> >> wrote:
> >>>> > > > > >> >> > Ok, sounds very promising...
> >>>> > > > > >> >> >
> >>>> > > > > >> >> > i'll try to start digging on the driver part this week
> >>>> then
> >>>> > > > > (Pipeline
> >>>> > > > > >> >> > wrapper in R5).
> >>>> > > > > >> >> >
> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> >>>> > > > josh.wills@gmail.com
> >>>> > > > > >
> >>>> > > > > >> >> wrote:
> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> >>>> > > > > dlieu.7@gmail.com
> >>>> > > > > >> >
> >>>> > > > > >> >> wrote:
> >>>> > > > > >> >> >>> Ok, cool.
> >>>> > > > > >> >> >>>
> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
> >>>> > advanced
> >>>> > > > > state.
> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
> >>>> working ,
> >>>> > > > > right?
> >>>> > > > > >> Or
> >>>> > > > > >> >> >>> there's something that is not working specifically?
> >>>> > > > > >> >> >>
> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
> have
> >>>> in a
> >>>> > > > > working
> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
> >>>> > > > prioritizing
> >>>> > > > > it
> >>>> > > > > >> >> >> and getting the work done.
> >>>> > > > > >> >> >>
> >>>> > > > > >> >> >>>
> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> >>>> > > > jwills@cloudera.com
> >>>> > > > > >
> >>>> > > > > >> >> wrote:
> >>>> > > > > >> >> >>>> Hey Dmitriy,
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
> with
> >>>> > crunchR
> >>>> > > > > this
> >>>> > > > > >> >> weekend--
> >>>> > > > > >> >> >>>> thanks!
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>> J
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> >>>> > > > > >> dlieu.7@gmail.com>
> >>>> > > > > >> >> wrote:
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>>> Project template
> >>>> https://github.com/dlyubimov/crunchR
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
> >>>> profile
> >>>> > > > > compiles R
> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
> >>>> supplying -DR
> >>>> > > to
> >>>> > > > > mvn
> >>>> > > > > >> >> >>>>> command line, e.g.
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> mvn install -DR
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
> >>>> version
> >>>> > of
> >>>> > > > the
> >>>> > > > > >> >> >>>>> package in the crunchR module.
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i
> did
> >>>> not
> >>>> > > find
> >>>> > > > > >> anywhere
> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
> >>>> github
> >>>> > > > maven
> >>>> > > > > >> repo
> >>>> > > > > >> >> so
> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
> >>>> > > RProtoBuf.
> >>>> > > > R
> >>>> > > > > Doc
> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
> >>>> another
> >>>> > > > package,
> >>>> > > > > >> got a
> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
> >>>> > crunchR,
> >>>> > > so
> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the
> >>>> road
> >>>> > that
> >>>> > > > may
> >>>> > > > > >> be a
> >>>> > > > > >> >> >>>>> problem though...
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> other than the template, not much else has been
> done
> >>>> so
> >>>> > > > far...
> >>>> > > > > >> >> finding
> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
> path on
> >>>> > > > > >> initialization
> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and
> its
> >>>> > > > > >> non-"provided"
> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> No legal stuff...
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
> Lyubimov <
> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>>> > > > > >> >> >>>>> wrote:
> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by
> >>>> some
> >>>> > > time
> >>>> > > > > next
> >>>> > > > > >> >> week.
> >>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
> >>>> > something
> >>>> > > > > really
> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
> >>>> > > > > >> >> >>>>> >
> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or
> >>>> less
> >>>> > > simple
> >>>> > > > > >> >> algorithm
> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
> with
> >>>> > Rcrunch
> >>>> > > > (or
> >>>> > > > > >> >> whatever
> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
> >>>> (performance)
> >>>> > but
> >>>> > > > > with
> >>>> > > > > >> much
> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization
> or
> >>>> > > > clustering
> >>>> > > > > >> >> things)
> >>>> > > > > >> >> >>>>> >
> >>>> > > > > >> >> >>>>> >
> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> >>>> > > rsharma@xebia.com
> >>>> > > > >
> >>>> > > > > >> wrote:
> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to
> >>>> see how
> >>>> > > > well
> >>>> > > > > we
> >>>> > > > > >> can
> >>>> > > > > >> >> >>>>> integrate
> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
> >>>> > > > > >> >> >>>>> >>
> >>>> > > > > >> >> >>>>> >> regards,
> >>>> > > > > >> >> >>>>> >> Rahul
> >>>> > > > > >> >> >>>>> >>
> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
> >>>> Lyubimov <
> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>>> > > > > >> >> >>>>> >>> wrote:
> >>>> > > > > >> >> >>>>> >>>>
> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
> >>>> > > > > >> >> >>>>> >>>>
> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can
> set
> >>>> up a
> >>>> > > > maven
> >>>> > > > > >> >> project
> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
> that a
> >>>> lot
> >>>> > > > > lately).
> >>>> > > > > >> Or
> >>>> > > > > >> >> if you
> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
> useful i
> >>>> > guess
> >>>> > > > > too.
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>>>
> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> >>>> > > > > >> >> josh.wills@gmail.com>
> >>>> > > > > >> >> >>>>> wrote:
> >>>> > > > > >> >> >>>>> >>>>>
> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I
> am
> >>>> happy
> >>>> > > to
> >>>> > > > > help.
> >>>> > > > > >> >> Github
> >>>> > > > > >> >> >>>>> >>>>> repo?
> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov"
> <
> >>>> > > > > >> dlieu.7@gmail.com
> >>>> > > > > >> >> >
> >>>> > > > > >> >> >>>>> wrote:
> >>>> > > > > >> >> >>>>> >>>>>
> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
> JRI/RJava
> >>>> > > > prototype
> >>>> > > > > on
> >>>> > > > > >> >> top of
> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should
> both
> >>>> save
> >>>> > > > time
> >>>> > > > > and
> >>>> > > > > >> >> prove or
> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
> >>>> > viable.
> >>>> > > > > >> >> >>>>> >>>>>>
> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
> >>>> > > framework
> >>>> > > > > or we
> >>>> > > > > >> >> can keep
> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
> >>>> > > > > >> >> >>>>> >>>>>>
> >>>> > > > > >> >> >>>>> >>>>>> -d
> >>>> > > > > >> >> >>>>> >>>>>>
> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
> Wills <
> >>>> > > > > >> >> jwills@cloudera.com>
> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>>> > > > > >> >> >>>>> >>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it--
> >>>> who
> >>>> > gave
> >>>> > > > the
> >>>> > > > > >> >> talk? Was
> >>>> > > > > >> >> >>>>> it
> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> >>>> > > > > >> >> >>>>> >>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> >>>> > Lyubimov <
> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
> >>>> > > > > >> >> >>>>> >>>>>>
> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
> >>>> > > experience
> >>>> > > > > of R
> >>>> > > > > >> >> mapping
> >>>> > > > > >> >> >>>>> of
> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
> think
> >>>> a
> >>>> > lot
> >>>> > > of
> >>>> > > > > >> >> applications
> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
> >>>> > > prototyped
> >>>> > > > > using
> >>>> > > > > >> >> flume R.
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> >>>> > > > > implementation
> >>>> > > > > >> of
> >>>> > > > > >> >> R
> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
> mapping
> >>>> from
> >>>> > R
> >>>> > > to
> >>>> > > > > >> Crunch
> >>>> > > > > >> >> would
> >>>> > > > > >> >> >>>>> be
> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
> efficient).
> >>>> > > > RJava/JRI
> >>>> > > > > and
> >>>> > > > > >> >> jni
> >>>> > > > > >> >> >>>>> seem to
> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
> >>>> > > directly.
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
> project
> >>>> > could
> >>>> > > > > have a
> >>>> > > > > >> >> >>>>> contributed
> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices,
> >>>> that
> >>>> > > would
> >>>> > > > > be
> >>>> > > > > >> >> just a
> >>>> > > > > >> >> >>>>> very
> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> >>>> > > contributing/advising
> >>>> > > > > for
> >>>> > > > > >> open
> >>>> > > > > >> >> >>>>> source
> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
> >>>> > interest,
> >>>> > > > > Crunch
> >>>> > > > > >> >> list
> >>>> > > > > >> >> >>>>> seems
> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> >>>> > > > > >> >> >>>>> >>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>>
> >>>> > > > > >> >> >>>>> >>>>>>> --
> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>>
> >>>> > > > > >> >> >>>>> >>
> >>>> > > > > >> >> >>>>>
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>>
> >>>> > > > > >> >> >>>> --
> >>>> > > > > >> >> >>>> Director of Data Science
> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
> http://twitter.com/josh_wills>
> >>>> > > > > >> >>
> >>>> > > > > >> >
> >>>> > > > > >> >
> >>>> > > > > >> >
> >>>> > > > > >> > --
> >>>> > > > > >> > Director of Data Science
> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>>> > > > > >>
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > > --
> >>>> > > > > > Director of Data Science
> >>>> > > > > > Cloudera <http://www.cloudera.com>
> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Director of Data Science
> >>>> Cloudera <http://www.cloudera.com>
> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>>>
> >>>
> >>>
> >>
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
>

Re: Flume R -- any interest?

Posted by Josh Wills <jw...@cloudera.com>.

Are you running this using LocalJobRunner? Does calling
Pipeline.enableDebug() before run() help? If it doesn't, it'll help
settle a debate I'm having w/Matthias. ;-)

On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> I see the error in the logs but Pipeline.run() has never thrown anything.
> isSucceeded() subsequently returns false. Is there any way to extract
> client-side problem rather than just being able to state that job failed?
> or it is ok and the only diagnostics by design?
>
> ============
> 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does not exist: hdfs://localhost:11010/crunchr-example/input
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> at java.lang.Thread.run(Thread.java:662)
>
>
> On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> for hadoop nodes i guess yet another option to soft-link the .so into
>> hadoop's native lib folder
>>
>>
>> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> I actually want to defer this to hadoop admins, we just need to create a
>>> procedure for setting up nodes. Ideally as simple as possible. something
>>> like
>>>
>>> 1) setup R
>>> 2) install.packages("rJava","RProtoBuf","crunchR")
>>> 3) R CMD javareconf
>>> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
>>> either mapred command lines or LD_LIBRARY_PATH...
>>>
>>> but it will depend on their versions of hadoop, jre etc. I hoped crunch
>>> might have something to hide a lot of that complexity (since it is about
>>> hiding complexities, for the most part :)  ) besides hadoop has a way to
>>> ship .so's to the backend so if crunch had an api to do something similar
>>> it is conceivable that driver might yank and ship it too to hide that
>>> complexity as well. But then there's a host of issues how to handle
>>> potentially different rJava versions installed on different nodes... So, it
>>> increasingly looks like something we might want to defer to sysops to do
>>> with approximate set of requirements .
>>>
>>>
>>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>
>>>> > so java tasks need to be able to load libjri.so from
>>>> > whatever system.file("jri", package="rJava") says.
>>>> >
>>>> > Traditionally, these issues were handled with -Djava.library.path.
>>>> > Apparently there's nothing java task can do to enable loadLibrary()
>>>> command
>>>> > to see the damn library once started. But -Djava.library.path requires
>>>> for
>>>> > nodes to configure and lock jvm command line from modifications of the
>>>> > client.  which is fine.
>>>> >
>>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>>>> (again).
>>>> >
>>>> > but... any other suggestions about best practice configuring crunch to
>>>> run
>>>> > user's .so's?
>>>> >
>>>>
>>>> Not off the top of my head. I suspect that whatever you come up with will
>>>> become the "best practice." :)
>>>>
>>>> >
>>>> > thanks.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > I believe that is a safe assumption, at least right now.
>>>> > >
>>>> > >
>>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> >
>>>> > > wrote:
>>>> > >
>>>> > > > Question.
>>>> > > >
>>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>>>> process
>>>> > > gets
>>>> > > > emitter every time.
>>>> > > >
>>>> > > > However, my guess any single reincranation of a DoFn object in the
>>>> > > backend
>>>> > > > will always be getting the same emitter thru its lifecycle. Is it
>>>> an
>>>> > > > admissible assumption or there's currently a counter example to
>>>> that?
>>>> > > >
>>>> > > > The problem is that as i implement the two way pipeline of input
>>>> and
>>>> > > > emitter data between R and Java, I am bulking these calls together
>>>> for
>>>> > > > performance reasons. Each individual datum in these chunks of data
>>>> will
>>>> > > not
>>>> > > > have attached emitter function information to them in any way.
>>>> (well it
>>>> > > > could but it would be a performance killer and i bet emitter never
>>>> > > > changes).
>>>> > > >
>>>> > > > So, thoughts? can i assume emitter never changes between first and
>>>> lass
>>>> > > > call to DoFn instance?
>>>> > > >
>>>> > > > thanks.
>>>> > > >
>>>> > > >
>>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>> > > > wrote:
>>>> > > >
>>>> > > > > yes...
>>>> > > > >
>>>> > > > > i think it worked for me before, although just adding all jars
>>>> from R
>>>> > > > > package distribution would be a little bit more appropriate
>>>> approach
>>>> > > > > -- but it creates a problem with jars in dependent R packages. I
>>>> > think
>>>> > > > > it would be much easier to just compile a hadoop-job file and
>>>> stick
>>>> > it
>>>> > > > > in rather than doing cherry-picking of individual jars from who
>>>> knows
>>>> > > > > how many locations.
>>>> > > > >
>>>> > > > > i think i used the hadoop job format with distributed cache
>>>> before
>>>> > and
>>>> > > > > it worked... at least with Pig "register jar" functionality.
>>>> > > > >
>>>> > > > > ok i guess i will just try if it works.
>>>> > > > >
>>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com
>>>> >
>>>> > > wrote:
>>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>>>> > dlieu.7@gmail.com
>>>> > > >
>>>> > > > > wrote:
>>>> > > > > >
>>>> > > > > >> Great! so it is in Crunch.
>>>> > > > > >>
>>>> > > > > >> does it support hadoop-job jar format or only pure java jars?
>>>> > > > > >>
>>>> > > > > >
>>>> > > > > > I think just pure jars-- you're referring to hadoop-job format
>>>> as
>>>> > > > having
>>>> > > > > > all the dependencies in a lib/ directory within the jar?
>>>> > > > > >
>>>> > > > > >
>>>> > > > > >>
>>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>>>> jwills@cloudera.com>
>>>> > > > > wrote:
>>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>>>> > > > dlieu.7@gmail.com>
>>>> > > > > >> wrote:
>>>> > > > > >> >
>>>> > > > > >> >> I think i need functionality to add more jars (or external
>>>> > > > > hadoop-jar)
>>>> > > > > >> >> to drive that from an R package. Just setting job jar by
>>>> class
>>>> > is
>>>> > > > not
>>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to
>>>> R
>>>> > > > package;
>>>> > > > > >> >> however, i cannot really run hadoop command line on it, i
>>>> need
>>>> > to
>>>> > > > set
>>>> > > > > >> >> up classpath thru RJava.
>>>> > > > > >> >>
>>>> > > > > >> >> Traditional single hadoop job jar will unlikely work here
>>>> since
>>>> > > we
>>>> > > > > >> >> cannot hardcode pipelines in java code but rather have to
>>>> > > construct
>>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>>>> definitions
>>>> > > > from
>>>> > > > > R
>>>> > > > > >> >> and then replay them in a driver -- but that's too
>>>> cumbersome
>>>> > and
>>>> > > > > more
>>>> > > > > >> >> work than it has to be.) There's no reason why i shouldn't
>>>> be
>>>> > > able
>>>> > > > to
>>>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
>>>> when
>>>> > > > kicking
>>>> > > > > >> >> off a pipeline.
>>>> > > > > >> >>
>>>> > > > > >> >
>>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>>>> > > > > >> >
>>>> > > > > >> >
>>>> > > > > >> >>
>>>> > > > > >> >>
>>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>>>> > > > > dlieu.7@gmail.com>
>>>> > > > > >> >> wrote:
>>>> > > > > >> >> > Ok, sounds very promising...
>>>> > > > > >> >> >
>>>> > > > > >> >> > i'll try to start digging on the driver part this week
>>>> then
>>>> > > > > (Pipeline
>>>> > > > > >> >> > wrapper in R5).
>>>> > > > > >> >> >
>>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>>>> > > > josh.wills@gmail.com
>>>> > > > > >
>>>> > > > > >> >> wrote:
>>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>>>> > > > > dlieu.7@gmail.com
>>>> > > > > >> >
>>>> > > > > >> >> wrote:
>>>> > > > > >> >> >>> Ok, cool.
>>>> > > > > >> >> >>>
>>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
>>>> > advanced
>>>> > > > > state.
>>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>>>> working ,
>>>> > > > > right?
>>>> > > > > >> Or
>>>> > > > > >> >> >>> there's something that is not working specifically?
>>>> > > > > >> >> >>
>>>> > > > > >> >> >> I think the only thing in the paper that we don't have
>>>> in a
>>>> > > > > working
>>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
>>>> > > > prioritizing
>>>> > > > > it
>>>> > > > > >> >> >> and getting the work done.
>>>> > > > > >> >> >>
>>>> > > > > >> >> >>>
>>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>>>> > > > jwills@cloudera.com
>>>> > > > > >
>>>> > > > > >> >> wrote:
>>>> > > > > >> >> >>>> Hey Dmitriy,
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>> Got a fork going and looking forward to playing with
>>>> > crunchR
>>>> > > > > this
>>>> > > > > >> >> weekend--
>>>> > > > > >> >> >>>> thanks!
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>> J
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
>>>> > > > > >> dlieu.7@gmail.com>
>>>> > > > > >> >> wrote:
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>>> Project template
>>>> https://github.com/dlyubimov/crunchR
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>>>> profile
>>>> > > > > compiles R
>>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
>>>> supplying -DR
>>>> > > to
>>>> > > > > mvn
>>>> > > > > >> >> >>>>> command line, e.g.
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> mvn install -DR
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>>>> version
>>>> > of
>>>> > > > the
>>>> > > > > >> >> >>>>> package in the crunchR module.
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did
>>>> not
>>>> > > find
>>>> > > > > >> anywhere
>>>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
>>>> github
>>>> > > > maven
>>>> > > > > >> repo
>>>> > > > > >> >> so
>>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
>>>> > > RProtoBuf.
>>>> > > > R
>>>> > > > > Doc
>>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
>>>> another
>>>> > > > package,
>>>> > > > > >> got a
>>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
>>>> > crunchR,
>>>> > > so
>>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the
>>>> road
>>>> > that
>>>> > > > may
>>>> > > > > >> be a
>>>> > > > > >> >> >>>>> problem though...
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> other than the template, not much else has been done
>>>> so
>>>> > > > far...
>>>> > > > > >> >> finding
>>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
>>>> > > > > >> initialization
>>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
>>>> > > > > >> non-"provided"
>>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> No legal stuff...
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
>>>> > > > > >> >> dlieu.7@gmail.com>
>>>> > > > > >> >> >>>>> wrote:
>>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by
>>>> some
>>>> > > time
>>>> > > > > next
>>>> > > > > >> >> week.
>>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>>>> > something
>>>> > > > > really
>>>> > > > > >> >> >>>>> > simple, such as parallelDo().
>>>> > > > > >> >> >>>>> >
>>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or
>>>> less
>>>> > > simple
>>>> > > > > >> >> algorithm
>>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
>>>> > Rcrunch
>>>> > > > (or
>>>> > > > > >> >> whatever
>>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>>>> (performance)
>>>> > but
>>>> > > > > with
>>>> > > > > >> much
>>>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
>>>> > > > clustering
>>>> > > > > >> >> things)
>>>> > > > > >> >> >>>>> >
>>>> > > > > >> >> >>>>> >
>>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>>>> > > rsharma@xebia.com
>>>> > > > >
>>>> > > > > >> wrote:
>>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to
>>>> see how
>>>> > > > well
>>>> > > > > we
>>>> > > > > >> can
>>>> > > > > >> >> >>>>> integrate
>>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>>>> > > > > >> >> >>>>> >>
>>>> > > > > >> >> >>>>> >> regards,
>>>> > > > > >> >> >>>>> >> Rahul
>>>> > > > > >> >> >>>>> >>
>>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>>>> Lyubimov <
>>>> > > > > >> >> dlieu.7@gmail.com>
>>>> > > > > >> >> >>>>> >>> wrote:
>>>> > > > > >> >> >>>>> >>>>
>>>> > > > > >> >> >>>>> >>>> Yep, ok.
>>>> > > > > >> >> >>>>> >>>>
>>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
>>>> up a
>>>> > > > maven
>>>> > > > > >> >> project
>>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
>>>> lot
>>>> > > > > lately).
>>>> > > > > >> Or
>>>> > > > > >> >> if you
>>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
>>>> > guess
>>>> > > > > too.
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>> No, please go right ahead.
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>>>
>>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
>>>> > > > > >> >> josh.wills@gmail.com>
>>>> > > > > >> >> >>>>> wrote:
>>>> > > > > >> >> >>>>> >>>>>
>>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
>>>> happy
>>>> > > to
>>>> > > > > help.
>>>> > > > > >> >> Github
>>>> > > > > >> >> >>>>> >>>>> repo?
>>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
>>>> > > > > >> dlieu.7@gmail.com
>>>> > > > > >> >> >
>>>> > > > > >> >> >>>>> wrote:
>>>> > > > > >> >> >>>>> >>>>>
>>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
>>>> > > > prototype
>>>> > > > > on
>>>> > > > > >> >> top of
>>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
>>>> save
>>>> > > > time
>>>> > > > > and
>>>> > > > > >> >> prove or
>>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
>>>> > viable.
>>>> > > > > >> >> >>>>> >>>>>>
>>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
>>>> > > framework
>>>> > > > > or we
>>>> > > > > >> >> can keep
>>>> > > > > >> >> >>>>> >>>>>> it completely separate.
>>>> > > > > >> >> >>>>> >>>>>>
>>>> > > > > >> >> >>>>> >>>>>> -d
>>>> > > > > >> >> >>>>> >>>>>>
>>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
>>>> > > > > >> >> jwills@cloudera.com>
>>>> > > > > >> >> >>>>> >>>>>> wrote:
>>>> > > > > >> >> >>>>> >>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it--
>>>> who
>>>> > gave
>>>> > > > the
>>>> > > > > >> >> talk? Was
>>>> > > > > >> >> >>>>> it
>>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>>>> > > > > >> >> >>>>> >>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>>>> > Lyubimov <
>>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>>>> > > > > >> >> >>>>> >>>>>>
>>>> > > > > >> >> >>>>> >>>>>> wrote:
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> Hello,
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
>>>> > > experience
>>>> > > > > of R
>>>> > > > > >> >> mapping
>>>> > > > > >> >> >>>>> of
>>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think
>>>> a
>>>> > lot
>>>> > > of
>>>> > > > > >> >> applications
>>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
>>>> > > prototyped
>>>> > > > > using
>>>> > > > > >> >> flume R.
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
>>>> > > > > implementation
>>>> > > > > >> of
>>>> > > > > >> >> R
>>>> > > > > >> >> >>>>> >>>>>>>> mapping,
>>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
>>>> from
>>>> > R
>>>> > > to
>>>> > > > > >> Crunch
>>>> > > > > >> >> would
>>>> > > > > >> >> >>>>> be
>>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
>>>> > > > RJava/JRI
>>>> > > > > and
>>>> > > > > >> >> jni
>>>> > > > > >> >> >>>>> seem to
>>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
>>>> > > directly.
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
>>>> > could
>>>> > > > > have a
>>>> > > > > >> >> >>>>> contributed
>>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices,
>>>> that
>>>> > > would
>>>> > > > > be
>>>> > > > > >> >> just a
>>>> > > > > >> >> >>>>> very
>>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>>>> > > contributing/advising
>>>> > > > > for
>>>> > > > > >> open
>>>> > > > > >> >> >>>>> source
>>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
>>>> > interest,
>>>> > > > > Crunch
>>>> > > > > >> >> list
>>>> > > > > >> >> >>>>> seems
>>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>>>> > > > > >> >> >>>>> >>>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>>>> > > > > >> >> >>>>> >>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>>
>>>> > > > > >> >> >>>>> >>>>>>> --
>>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>>>> > > > > >> >> >>>>> >>>>>>> Cloudera
>>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>>
>>>> > > > > >> >> >>>>> >>
>>>> > > > > >> >> >>>>>
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>>
>>>> > > > > >> >> >>>> --
>>>> > > > > >> >> >>>> Director of Data Science
>>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>>>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>> > > > > >> >>
>>>> > > > > >> >
>>>> > > > > >> >
>>>> > > > > >> >
>>>> > > > > >> > --
>>>> > > > > >> > Director of Data Science
>>>> > > > > >> > Cloudera <http://www.cloudera.com>
>>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>> > > > > >>
>>>> > > > > >
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > --
>>>> > > > > > Director of Data Science
>>>> > > > > > Cloudera <http://www.cloudera.com>
>>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I see the error in the logs but Pipeline.run() has never thrown anything.
isSucceeded() subsequently returns false. Is there any way to extract
client-side problem rather than just being able to state that job failed?
or it is ok and the only diagnostics by design?

============
68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: hdfs://localhost:11010/crunchr-example/input
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
at
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
at
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
at
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
at java.lang.Thread.run(Thread.java:662)


On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> for hadoop nodes i guess yet another option to soft-link the .so into
> hadoop's native lib folder
>
>
> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> I actually want to defer this to hadoop admins, we just need to create a
>> procedure for setting up nodes. Ideally as simple as possible. something
>> like
>>
>> 1) setup R
>> 2) install.packages("rJava","RProtoBuf","crunchR")
>> 3) R CMD javareconf
>> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
>> either mapred command lines or LD_LIBRARY_PATH...
>>
>> but it will depend on their versions of hadoop, jre etc. I hoped crunch
>> might have something to hide a lot of that complexity (since it is about
>> hiding complexities, for the most part :)  ) besides hadoop has a way to
>> ship .so's to the backend so if crunch had an api to do something similar
>> it is conceivable that driver might yank and ship it too to hide that
>> complexity as well. But then there's a host of issues how to handle
>> potentially different rJava versions installed on different nodes... So, it
>> increasingly looks like something we might want to defer to sysops to do
>> with approximate set of requirements .
>>
>>
>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>
>>> > so java tasks need to be able to load libjri.so from
>>> > whatever system.file("jri", package="rJava") says.
>>> >
>>> > Traditionally, these issues were handled with -Djava.library.path.
>>> > Apparently there's nothing java task can do to enable loadLibrary()
>>> command
>>> > to see the damn library once started. But -Djava.library.path requires
>>> for
>>> > nodes to configure and lock jvm command line from modifications of the
>>> > client.  which is fine.
>>> >
>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>>> (again).
>>> >
>>> > but... any other suggestions about best practice configuring crunch to
>>> run
>>> > user's .so's?
>>> >
>>>
>>> Not off the top of my head. I suspect that whatever you come up with will
>>> become the "best practice." :)
>>>
>>> >
>>> > thanks.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
>>> wrote:
>>> >
>>> > > I believe that is a safe assumption, at least right now.
>>> > >
>>> > >
>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > > > Question.
>>> > > >
>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>>> process
>>> > > gets
>>> > > > emitter every time.
>>> > > >
>>> > > > However, my guess any single reincranation of a DoFn object in the
>>> > > backend
>>> > > > will always be getting the same emitter thru its lifecycle. Is it
>>> an
>>> > > > admissible assumption or there's currently a counter example to
>>> that?
>>> > > >
>>> > > > The problem is that as i implement the two way pipeline of input
>>> and
>>> > > > emitter data between R and Java, I am bulking these calls together
>>> for
>>> > > > performance reasons. Each individual datum in these chunks of data
>>> will
>>> > > not
>>> > > > have attached emitter function information to them in any way.
>>> (well it
>>> > > > could but it would be a performance killer and i bet emitter never
>>> > > > changes).
>>> > > >
>>> > > > So, thoughts? can i assume emitter never changes between first and
>>> lass
>>> > > > call to DoFn instance?
>>> > > >
>>> > > > thanks.
>>> > > >
>>> > > >
>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> > > > wrote:
>>> > > >
>>> > > > > yes...
>>> > > > >
>>> > > > > i think it worked for me before, although just adding all jars
>>> from R
>>> > > > > package distribution would be a little bit more appropriate
>>> approach
>>> > > > > -- but it creates a problem with jars in dependent R packages. I
>>> > think
>>> > > > > it would be much easier to just compile a hadoop-job file and
>>> stick
>>> > it
>>> > > > > in rather than doing cherry-picking of individual jars from who
>>> knows
>>> > > > > how many locations.
>>> > > > >
>>> > > > > i think i used the hadoop job format with distributed cache
>>> before
>>> > and
>>> > > > > it worked... at least with Pig "register jar" functionality.
>>> > > > >
>>> > > > > ok i guess i will just try if it works.
>>> > > > >
>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com
>>> >
>>> > > wrote:
>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>>> > dlieu.7@gmail.com
>>> > > >
>>> > > > > wrote:
>>> > > > > >
>>> > > > > >> Great! so it is in Crunch.
>>> > > > > >>
>>> > > > > >> does it support hadoop-job jar format or only pure java jars?
>>> > > > > >>
>>> > > > > >
>>> > > > > > I think just pure jars-- you're referring to hadoop-job format
>>> as
>>> > > > having
>>> > > > > > all the dependencies in a lib/ directory within the jar?
>>> > > > > >
>>> > > > > >
>>> > > > > >>
>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>>> jwills@cloudera.com>
>>> > > > > wrote:
>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>>> > > > dlieu.7@gmail.com>
>>> > > > > >> wrote:
>>> > > > > >> >
>>> > > > > >> >> I think i need functionality to add more jars (or external
>>> > > > > hadoop-jar)
>>> > > > > >> >> to drive that from an R package. Just setting job jar by
>>> class
>>> > is
>>> > > > not
>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to
>>> R
>>> > > > package;
>>> > > > > >> >> however, i cannot really run hadoop command line on it, i
>>> need
>>> > to
>>> > > > set
>>> > > > > >> >> up classpath thru RJava.
>>> > > > > >> >>
>>> > > > > >> >> Traditional single hadoop job jar will unlikely work here
>>> since
>>> > > we
>>> > > > > >> >> cannot hardcode pipelines in java code but rather have to
>>> > > construct
>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>>> definitions
>>> > > > from
>>> > > > > R
>>> > > > > >> >> and then replay them in a driver -- but that's too
>>> cumbersome
>>> > and
>>> > > > > more
>>> > > > > >> >> work than it has to be.) There's no reason why i shouldn't
>>> be
>>> > > able
>>> > > > to
>>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
>>> when
>>> > > > kicking
>>> > > > > >> >> off a pipeline.
>>> > > > > >> >>
>>> > > > > >> >
>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>>> > > > > >> >
>>> > > > > >> >
>>> > > > > >> >>
>>> > > > > >> >>
>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>>> > > > > dlieu.7@gmail.com>
>>> > > > > >> >> wrote:
>>> > > > > >> >> > Ok, sounds very promising...
>>> > > > > >> >> >
>>> > > > > >> >> > i'll try to start digging on the driver part this week
>>> then
>>> > > > > (Pipeline
>>> > > > > >> >> > wrapper in R5).
>>> > > > > >> >> >
>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>>> > > > josh.wills@gmail.com
>>> > > > > >
>>> > > > > >> >> wrote:
>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>>> > > > > dlieu.7@gmail.com
>>> > > > > >> >
>>> > > > > >> >> wrote:
>>> > > > > >> >> >>> Ok, cool.
>>> > > > > >> >> >>>
>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
>>> > advanced
>>> > > > > state.
>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>>> working ,
>>> > > > > right?
>>> > > > > >> Or
>>> > > > > >> >> >>> there's something that is not working specifically?
>>> > > > > >> >> >>
>>> > > > > >> >> >> I think the only thing in the paper that we don't have
>>> in a
>>> > > > > working
>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
>>> > > > prioritizing
>>> > > > > it
>>> > > > > >> >> >> and getting the work done.
>>> > > > > >> >> >>
>>> > > > > >> >> >>>
>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>>> > > > jwills@cloudera.com
>>> > > > > >
>>> > > > > >> >> wrote:
>>> > > > > >> >> >>>> Hey Dmitriy,
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>> Got a fork going and looking forward to playing with
>>> > crunchR
>>> > > > > this
>>> > > > > >> >> weekend--
>>> > > > > >> >> >>>> thanks!
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>> J
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
>>> > > > > >> dlieu.7@gmail.com>
>>> > > > > >> >> wrote:
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>>> Project template
>>> https://github.com/dlyubimov/crunchR
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>>> profile
>>> > > > > compiles R
>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
>>> supplying -DR
>>> > > to
>>> > > > > mvn
>>> > > > > >> >> >>>>> command line, e.g.
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> mvn install -DR
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>>> version
>>> > of
>>> > > > the
>>> > > > > >> >> >>>>> package in the crunchR module.
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did
>>> not
>>> > > find
>>> > > > > >> anywhere
>>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
>>> github
>>> > > > maven
>>> > > > > >> repo
>>> > > > > >> >> so
>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
>>> > > RProtoBuf.
>>> > > > R
>>> > > > > Doc
>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
>>> another
>>> > > > package,
>>> > > > > >> got a
>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
>>> > crunchR,
>>> > > so
>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the
>>> road
>>> > that
>>> > > > may
>>> > > > > >> be a
>>> > > > > >> >> >>>>> problem though...
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> other than the template, not much else has been done
>>> so
>>> > > > far...
>>> > > > > >> >> finding
>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
>>> > > > > >> initialization
>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
>>> > > > > >> non-"provided"
>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> No legal stuff...
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
>>> > > > > >> >> dlieu.7@gmail.com>
>>> > > > > >> >> >>>>> wrote:
>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by
>>> some
>>> > > time
>>> > > > > next
>>> > > > > >> >> week.
>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>>> > something
>>> > > > > really
>>> > > > > >> >> >>>>> > simple, such as parallelDo().
>>> > > > > >> >> >>>>> >
>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or
>>> less
>>> > > simple
>>> > > > > >> >> algorithm
>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
>>> > Rcrunch
>>> > > > (or
>>> > > > > >> >> whatever
>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>>> (performance)
>>> > but
>>> > > > > with
>>> > > > > >> much
>>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
>>> > > > clustering
>>> > > > > >> >> things)
>>> > > > > >> >> >>>>> >
>>> > > > > >> >> >>>>> >
>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>>> > > rsharma@xebia.com
>>> > > > >
>>> > > > > >> wrote:
>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to
>>> see how
>>> > > > well
>>> > > > > we
>>> > > > > >> can
>>> > > > > >> >> >>>>> integrate
>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>>> > > > > >> >> >>>>> >>
>>> > > > > >> >> >>>>> >> regards,
>>> > > > > >> >> >>>>> >> Rahul
>>> > > > > >> >> >>>>> >>
>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>>> Lyubimov <
>>> > > > > >> >> dlieu.7@gmail.com>
>>> > > > > >> >> >>>>> >>> wrote:
>>> > > > > >> >> >>>>> >>>>
>>> > > > > >> >> >>>>> >>>> Yep, ok.
>>> > > > > >> >> >>>>> >>>>
>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
>>> up a
>>> > > > maven
>>> > > > > >> >> project
>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
>>> lot
>>> > > > > lately).
>>> > > > > >> Or
>>> > > > > >> >> if you
>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
>>> > guess
>>> > > > > too.
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>> No, please go right ahead.
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>>>
>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
>>> > > > > >> >> josh.wills@gmail.com>
>>> > > > > >> >> >>>>> wrote:
>>> > > > > >> >> >>>>> >>>>>
>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
>>> happy
>>> > > to
>>> > > > > help.
>>> > > > > >> >> Github
>>> > > > > >> >> >>>>> >>>>> repo?
>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
>>> > > > > >> dlieu.7@gmail.com
>>> > > > > >> >> >
>>> > > > > >> >> >>>>> wrote:
>>> > > > > >> >> >>>>> >>>>>
>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
>>> > > > prototype
>>> > > > > on
>>> > > > > >> >> top of
>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
>>> save
>>> > > > time
>>> > > > > and
>>> > > > > >> >> prove or
>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
>>> > viable.
>>> > > > > >> >> >>>>> >>>>>>
>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
>>> > > framework
>>> > > > > or we
>>> > > > > >> >> can keep
>>> > > > > >> >> >>>>> >>>>>> it completely separate.
>>> > > > > >> >> >>>>> >>>>>>
>>> > > > > >> >> >>>>> >>>>>> -d
>>> > > > > >> >> >>>>> >>>>>>
>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
>>> > > > > >> >> jwills@cloudera.com>
>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> > > > > >> >> >>>>> >>>>>>>
>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it--
>>> who
>>> > gave
>>> > > > the
>>> > > > > >> >> talk? Was
>>> > > > > >> >> >>>>> it
>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>>> > > > > >> >> >>>>> >>>>>>>
>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>>> > Lyubimov <
>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>>> > > > > >> >> >>>>> >>>>>>
>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> Hello,
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
>>> > > experience
>>> > > > > of R
>>> > > > > >> >> mapping
>>> > > > > >> >> >>>>> of
>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think
>>> a
>>> > lot
>>> > > of
>>> > > > > >> >> applications
>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
>>> > > prototyped
>>> > > > > using
>>> > > > > >> >> flume R.
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
>>> > > > > implementation
>>> > > > > >> of
>>> > > > > >> >> R
>>> > > > > >> >> >>>>> >>>>>>>> mapping,
>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
>>> from
>>> > R
>>> > > to
>>> > > > > >> Crunch
>>> > > > > >> >> would
>>> > > > > >> >> >>>>> be
>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
>>> > > > RJava/JRI
>>> > > > > and
>>> > > > > >> >> jni
>>> > > > > >> >> >>>>> seem to
>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
>>> > > directly.
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
>>> > could
>>> > > > > have a
>>> > > > > >> >> >>>>> contributed
>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices,
>>> that
>>> > > would
>>> > > > > be
>>> > > > > >> >> just a
>>> > > > > >> >> >>>>> very
>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>>> > > contributing/advising
>>> > > > > for
>>> > > > > >> open
>>> > > > > >> >> >>>>> source
>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
>>> > interest,
>>> > > > > Crunch
>>> > > > > >> >> list
>>> > > > > >> >> >>>>> seems
>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>>> > > > > >> >> >>>>> >>>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>>> > > > > >> >> >>>>> >>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>
>>> > > > > >> >> >>>>> >>>>>>>
>>> > > > > >> >> >>>>> >>>>>>> --
>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>>> > > > > >> >> >>>>> >>>>>>> Cloudera
>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>>
>>> > > > > >> >> >>>>> >>
>>> > > > > >> >> >>>>>
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>>
>>> > > > > >> >> >>>> --
>>> > > > > >> >> >>>> Director of Data Science
>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> > > > > >> >>
>>> > > > > >> >
>>> > > > > >> >
>>> > > > > >> >
>>> > > > > >> > --
>>> > > > > >> > Director of Data Science
>>> > > > > >> > Cloudera <http://www.cloudera.com>
>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> > > > > >>
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > --
>>> > > > > > Director of Data Science
>>> > > > > > Cloudera <http://www.cloudera.com>
>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

for hadoop nodes i guess yet another option to soft-link the .so into
hadoop's native lib folder


On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I actually want to defer this to hadoop admins, we just need to create a
> procedure for setting up nodes. Ideally as simple as possible. something
> like
>
> 1) setup R
> 2) install.packages("rJava","RProtoBuf","crunchR")
> 3) R CMD javareconf
> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
> either mapred command lines or LD_LIBRARY_PATH...
>
> but it will depend on their versions of hadoop, jre etc. I hoped crunch
> might have something to hide a lot of that complexity (since it is about
> hiding complexities, for the most part :)  ) besides hadoop has a way to
> ship .so's to the backend so if crunch had an api to do something similar
> it is conceivable that driver might yank and ship it too to hide that
> complexity as well. But then there's a host of issues how to handle
> potentially different rJava versions installed on different nodes... So, it
> increasingly looks like something we might want to defer to sysops to do
> with approximate set of requirements .
>
>
> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>> > so java tasks need to be able to load libjri.so from
>> > whatever system.file("jri", package="rJava") says.
>> >
>> > Traditionally, these issues were handled with -Djava.library.path.
>> > Apparently there's nothing java task can do to enable loadLibrary()
>> command
>> > to see the damn library once started. But -Djava.library.path requires
>> for
>> > nodes to configure and lock jvm command line from modifications of the
>> > client.  which is fine.
>> >
>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>> (again).
>> >
>> > but... any other suggestions about best practice configuring crunch to
>> run
>> > user's .so's?
>> >
>>
>> Not off the top of my head. I suspect that whatever you come up with will
>> become the "best practice." :)
>>
>> >
>> > thanks.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
>> wrote:
>> >
>> > > I believe that is a safe assumption, at least right now.
>> > >
>> > >
>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > > wrote:
>> > >
>> > > > Question.
>> > > >
>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>> process
>> > > gets
>> > > > emitter every time.
>> > > >
>> > > > However, my guess any single reincranation of a DoFn object in the
>> > > backend
>> > > > will always be getting the same emitter thru its lifecycle. Is it an
>> > > > admissible assumption or there's currently a counter example to
>> that?
>> > > >
>> > > > The problem is that as i implement the two way pipeline of input and
>> > > > emitter data between R and Java, I am bulking these calls together
>> for
>> > > > performance reasons. Each individual datum in these chunks of data
>> will
>> > > not
>> > > > have attached emitter function information to them in any way.
>> (well it
>> > > > could but it would be a performance killer and i bet emitter never
>> > > > changes).
>> > > >
>> > > > So, thoughts? can i assume emitter never changes between first and
>> lass
>> > > > call to DoFn instance?
>> > > >
>> > > > thanks.
>> > > >
>> > > >
>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > yes...
>> > > > >
>> > > > > i think it worked for me before, although just adding all jars
>> from R
>> > > > > package distribution would be a little bit more appropriate
>> approach
>> > > > > -- but it creates a problem with jars in dependent R packages. I
>> > think
>> > > > > it would be much easier to just compile a hadoop-job file and
>> stick
>> > it
>> > > > > in rather than doing cherry-picking of individual jars from who
>> knows
>> > > > > how many locations.
>> > > > >
>> > > > > i think i used the hadoop job format with distributed cache before
>> > and
>> > > > > it worked... at least with Pig "register jar" functionality.
>> > > > >
>> > > > > ok i guess i will just try if it works.
>> > > > >
>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com>
>> > > wrote:
>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>> > dlieu.7@gmail.com
>> > > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Great! so it is in Crunch.
>> > > > > >>
>> > > > > >> does it support hadoop-job jar format or only pure java jars?
>> > > > > >>
>> > > > > >
>> > > > > > I think just pure jars-- you're referring to hadoop-job format
>> as
>> > > > having
>> > > > > > all the dependencies in a lib/ directory within the jar?
>> > > > > >
>> > > > > >
>> > > > > >>
>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>> jwills@cloudera.com>
>> > > > > wrote:
>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>> > > > dlieu.7@gmail.com>
>> > > > > >> wrote:
>> > > > > >> >
>> > > > > >> >> I think i need functionality to add more jars (or external
>> > > > > hadoop-jar)
>> > > > > >> >> to drive that from an R package. Just setting job jar by
>> class
>> > is
>> > > > not
>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
>> > > > package;
>> > > > > >> >> however, i cannot really run hadoop command line on it, i
>> need
>> > to
>> > > > set
>> > > > > >> >> up classpath thru RJava.
>> > > > > >> >>
>> > > > > >> >> Traditional single hadoop job jar will unlikely work here
>> since
>> > > we
>> > > > > >> >> cannot hardcode pipelines in java code but rather have to
>> > > construct
>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>> definitions
>> > > > from
>> > > > > R
>> > > > > >> >> and then replay them in a driver -- but that's too
>> cumbersome
>> > and
>> > > > > more
>> > > > > >> >> work than it has to be.) There's no reason why i shouldn't
>> be
>> > > able
>> > > > to
>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
>> > > > kicking
>> > > > > >> >> off a pipeline.
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>> > > > > dlieu.7@gmail.com>
>> > > > > >> >> wrote:
>> > > > > >> >> > Ok, sounds very promising...
>> > > > > >> >> >
>> > > > > >> >> > i'll try to start digging on the driver part this week
>> then
>> > > > > (Pipeline
>> > > > > >> >> > wrapper in R5).
>> > > > > >> >> >
>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>> > > > josh.wills@gmail.com
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>> > > > > dlieu.7@gmail.com
>> > > > > >> >
>> > > > > >> >> wrote:
>> > > > > >> >> >>> Ok, cool.
>> > > > > >> >> >>>
>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
>> > advanced
>> > > > > state.
>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>> working ,
>> > > > > right?
>> > > > > >> Or
>> > > > > >> >> >>> there's something that is not working specifically?
>> > > > > >> >> >>
>> > > > > >> >> >> I think the only thing in the paper that we don't have
>> in a
>> > > > > working
>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
>> > > > prioritizing
>> > > > > it
>> > > > > >> >> >> and getting the work done.
>> > > > > >> >> >>
>> > > > > >> >> >>>
>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>> > > > jwills@cloudera.com
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >>>> Hey Dmitriy,
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> Got a fork going and looking forward to playing with
>> > crunchR
>> > > > > this
>> > > > > >> >> weekend--
>> > > > > >> >> >>>> thanks!
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> J
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
>> > > > > >> dlieu.7@gmail.com>
>> > > > > >> >> wrote:
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>> profile
>> > > > > compiles R
>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying
>> -DR
>> > > to
>> > > > > mvn
>> > > > > >> >> >>>>> command line, e.g.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> mvn install -DR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>> version
>> > of
>> > > > the
>> > > > > >> >> >>>>> package in the crunchR module.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did
>> not
>> > > find
>> > > > > >> anywhere
>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
>> github
>> > > > maven
>> > > > > >> repo
>> > > > > >> >> so
>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
>> > > RProtoBuf.
>> > > > R
>> > > > > Doc
>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into another
>> > > > package,
>> > > > > >> got a
>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
>> > crunchR,
>> > > so
>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
>> > that
>> > > > may
>> > > > > >> be a
>> > > > > >> >> >>>>> problem though...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> other than the template, not much else has been done
>> so
>> > > > far...
>> > > > > >> >> finding
>> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
>> > > > > >> initialization
>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
>> > > > > >> non-"provided"
>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No legal stuff...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
>> > > > > >> >> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by
>> some
>> > > time
>> > > > > next
>> > > > > >> >> week.
>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>> > something
>> > > > > really
>> > > > > >> >> >>>>> > simple, such as parallelDo().
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less
>> > > simple
>> > > > > >> >> algorithm
>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
>> > Rcrunch
>> > > > (or
>> > > > > >> >> whatever
>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>> (performance)
>> > but
>> > > > > with
>> > > > > >> much
>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
>> > > > clustering
>> > > > > >> >> things)
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>> > > rsharma@xebia.com
>> > > > >
>> > > > > >> wrote:
>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to see
>> how
>> > > > well
>> > > > > we
>> > > > > >> can
>> > > > > >> >> >>>>> integrate
>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> regards,
>> > > > > >> >> >>>>> >> Rahul
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov
>> <
>> > > > > >> >> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> >>> wrote:
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> Yep, ok.
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
>> up a
>> > > > maven
>> > > > > >> >> project
>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
>> lot
>> > > > > lately).
>> > > > > >> Or
>> > > > > >> >> if you
>> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
>> > guess
>> > > > > too.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> No, please go right ahead.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
>> > > > > >> >> josh.wills@gmail.com>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
>> happy
>> > > to
>> > > > > help.
>> > > > > >> >> Github
>> > > > > >> >> >>>>> >>>>> repo?
>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
>> > > > > >> dlieu.7@gmail.com
>> > > > > >> >> >
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
>> > > > prototype
>> > > > > on
>> > > > > >> >> top of
>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
>> save
>> > > > time
>> > > > > and
>> > > > > >> >> prove or
>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
>> > viable.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
>> > > framework
>> > > > > or we
>> > > > > >> >> can keep
>> > > > > >> >> >>>>> >>>>>> it completely separate.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> -d
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
>> > > > > >> >> jwills@cloudera.com>
>> > > > > >> >> >>>>> >>>>>> wrote:
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who
>> > gave
>> > > > the
>> > > > > >> >> talk? Was
>> > > > > >> >> >>>>> it
>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>> > Lyubimov <
>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> wrote:
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Hello,
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
>> > > experience
>> > > > > of R
>> > > > > >> >> mapping
>> > > > > >> >> >>>>> of
>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a
>> > lot
>> > > of
>> > > > > >> >> applications
>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
>> > > prototyped
>> > > > > using
>> > > > > >> >> flume R.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
>> > > > > implementation
>> > > > > >> of
>> > > > > >> >> R
>> > > > > >> >> >>>>> >>>>>>>> mapping,
>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
>> from
>> > R
>> > > to
>> > > > > >> Crunch
>> > > > > >> >> would
>> > > > > >> >> >>>>> be
>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
>> > > > RJava/JRI
>> > > > > and
>> > > > > >> >> jni
>> > > > > >> >> >>>>> seem to
>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
>> > > directly.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
>> > could
>> > > > > have a
>> > > > > >> >> >>>>> contributed
>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices,
>> that
>> > > would
>> > > > > be
>> > > > > >> >> just a
>> > > > > >> >> >>>>> very
>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>> > > contributing/advising
>> > > > > for
>> > > > > >> open
>> > > > > >> >> >>>>> source
>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
>> > interest,
>> > > > > Crunch
>> > > > > >> >> list
>> > > > > >> >> >>>>> seems
>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> --
>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>> > > > > >> >> >>>>> >>>>>>> Cloudera
>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> --
>> > > > > >> >> >>>> Director of Data Science
>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > --
>> > > > > >> > Director of Data Science
>> > > > > >> > Cloudera <http://www.cloudera.com>
>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Director of Data Science
>> > > > > > Cloudera <http://www.cloudera.com>
>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I actually want to defer this to hadoop admins, we just need to create a
procedure for setting up nodes. Ideally as simple as possible. something
like

1) setup R
2) install.packages("rJava","RProtoBuf","crunchR")
3) R CMD javareconf
3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
either mapred command lines or LD_LIBRARY_PATH...

but it will depend on their versions of hadoop, jre etc. I hoped crunch
might have something to hide a lot of that complexity (since it is about
hiding complexities, for the most part :)  ) besides hadoop has a way to
ship .so's to the backend so if crunch had an api to do something similar
it is conceivable that driver might yank and ship it too to hide that
complexity as well. But then there's a host of issues how to handle
potentially different rJava versions installed on different nodes... So, it
increasingly looks like something we might want to defer to sysops to do
with approximate set of requirements .


On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jw...@cloudera.com> wrote:

> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > so java tasks need to be able to load libjri.so from
> > whatever system.file("jri", package="rJava") says.
> >
> > Traditionally, these issues were handled with -Djava.library.path.
> > Apparently there's nothing java task can do to enable loadLibrary()
> command
> > to see the damn library once started. But -Djava.library.path requires
> for
> > nodes to configure and lock jvm command line from modifications of the
> > client.  which is fine.
> >
> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
> (again).
> >
> > but... any other suggestions about best practice configuring crunch to
> run
> > user's .so's?
> >
>
> Not off the top of my head. I suspect that whatever you come up with will
> become the "best practice." :)
>
> >
> > thanks.
> >
> >
> >
> >
> >
> >
> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com>
> wrote:
> >
> > > I believe that is a safe assumption, at least right now.
> > >
> > >
> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Question.
> > > >
> > > > So in Crunch api, initialize() doesn't get an emitter. and the
> process
> > > gets
> > > > emitter every time.
> > > >
> > > > However, my guess any single reincranation of a DoFn object in the
> > > backend
> > > > will always be getting the same emitter thru its lifecycle. Is it an
> > > > admissible assumption or there's currently a counter example to that?
> > > >
> > > > The problem is that as i implement the two way pipeline of input and
> > > > emitter data between R and Java, I am bulking these calls together
> for
> > > > performance reasons. Each individual datum in these chunks of data
> will
> > > not
> > > > have attached emitter function information to them in any way. (well
> it
> > > > could but it would be a performance killer and i bet emitter never
> > > > changes).
> > > >
> > > > So, thoughts? can i assume emitter never changes between first and
> lass
> > > > call to DoFn instance?
> > > >
> > > > thanks.
> > > >
> > > >
> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > yes...
> > > > >
> > > > > i think it worked for me before, although just adding all jars
> from R
> > > > > package distribution would be a little bit more appropriate
> approach
> > > > > -- but it creates a problem with jars in dependent R packages. I
> > think
> > > > > it would be much easier to just compile a hadoop-job file and stick
> > it
> > > > > in rather than doing cherry-picking of individual jars from who
> knows
> > > > > how many locations.
> > > > >
> > > > > i think i used the hadoop job format with distributed cache before
> > and
> > > > > it worked... at least with Pig "register jar" functionality.
> > > > >
> > > > > ok i guess i will just try if it works.
> > > > >
> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com>
> > > wrote:
> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Great! so it is in Crunch.
> > > > > >>
> > > > > >> does it support hadoop-job jar format or only pure java jars?
> > > > > >>
> > > > > >
> > > > > > I think just pure jars-- you're referring to hadoop-job format as
> > > > having
> > > > > > all the dependencies in a lib/ directory within the jar?
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> jwills@cloudera.com>
> > > > > wrote:
> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > > > dlieu.7@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I think i need functionality to add more jars (or external
> > > > > hadoop-jar)
> > > > > >> >> to drive that from an R package. Just setting job jar by
> class
> > is
> > > > not
> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
> > > > package;
> > > > > >> >> however, i cannot really run hadoop command line on it, i
> need
> > to
> > > > set
> > > > > >> >> up classpath thru RJava.
> > > > > >> >>
> > > > > >> >> Traditional single hadoop job jar will unlikely work here
> since
> > > we
> > > > > >> >> cannot hardcode pipelines in java code but rather have to
> > > construct
> > > > > >> >> them on the fly. (well, we could serialize pipeline
> definitions
> > > > from
> > > > > R
> > > > > >> >> and then replay them in a driver -- but that's too cumbersome
> > and
> > > > > more
> > > > > >> >> work than it has to be.) There's no reason why i shouldn't be
> > > able
> > > > to
> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
> > > > kicking
> > > > > >> >> off a pipeline.
> > > > > >> >>
> > > > > >> >
> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > > > >> >
> > > > > >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > > > > dlieu.7@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >> > Ok, sounds very promising...
> > > > > >> >> >
> > > > > >> >> > i'll try to start digging on the driver part this week then
> > > > > (Pipeline
> > > > > >> >> > wrapper in R5).
> > > > > >> >> >
> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > > > josh.wills@gmail.com
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> > > > > dlieu.7@gmail.com
> > > > > >> >
> > > > > >> >> wrote:
> > > > > >> >> >>> Ok, cool.
> > > > > >> >> >>>
> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
> > advanced
> > > > > state.
> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
> working ,
> > > > > right?
> > > > > >> Or
> > > > > >> >> >>> there's something that is not working specifically?
> > > > > >> >> >>
> > > > > >> >> >> I think the only thing in the paper that we don't have in
> a
> > > > > working
> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
> > > > prioritizing
> > > > > it
> > > > > >> >> >> and getting the work done.
> > > > > >> >> >>
> > > > > >> >> >>>
> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > > > jwills@cloudera.com
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >>>> Hey Dmitriy,
> > > > > >> >> >>>>
> > > > > >> >> >>>> Got a fork going and looking forward to playing with
> > crunchR
> > > > > this
> > > > > >> >> weekend--
> > > > > >> >> >>>> thanks!
> > > > > >> >> >>>>
> > > > > >> >> >>>> J
> > > > > >> >> >>>>
> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> > > > > >> dlieu.7@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >> >>>>
> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> Default profile does not compile R artifact . R profile
> > > > > compiles R
> > > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying
> -DR
> > > to
> > > > > mvn
> > > > > >> >> >>>>> command line, e.g.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> mvn install -DR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> there's also a helper that installs the snapshot
> version
> > of
> > > > the
> > > > > >> >> >>>>> package in the crunchR module.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did not
> > > find
> > > > > >> anywhere
> > > > > >> >> >>>>> in public maven repos; so it is installed into my
> github
> > > > maven
> > > > > >> repo
> > > > > >> >> so
> > > > > >> >> >>>>> far. Should compile for 3rd party.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
> > > RProtoBuf.
> > > > R
> > > > > Doc
> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into another
> > > > package,
> > > > > >> got a
> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
> > crunchR,
> > > so
> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
> > that
> > > > may
> > > > > >> be a
> > > > > >> >> >>>>> problem though...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> other than the template, not much else has been done so
> > > > far...
> > > > > >> >> finding
> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
> > > > > >> initialization
> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> > > > > >> non-"provided"
> > > > > >> >> >>>>> transitives to the crunchR's java part...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No legal stuff...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No readmes... complete stealth at this point.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> > > > > >> >> dlieu.7@gmail.com>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by some
> > > time
> > > > > next
> > > > > >> >> week.
> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
> > something
> > > > > really
> > > > > >> >> >>>>> > simple, such as parallelDo().
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less
> > > simple
> > > > > >> >> algorithm
> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
> > Rcrunch
> > > > (or
> > > > > >> >> whatever
> > > > > >> >> >>>>> > name it has to be) in a comparable time (performance)
> > but
> > > > > with
> > > > > >> much
> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
> > > > clustering
> > > > > >> >> things)
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> > > rsharma@xebia.com
> > > > >
> > > > > >> wrote:
> > > > > >> >> >>>>> >> I am not much of R user but I am interested to see
> how
> > > > well
> > > > > we
> > > > > >> can
> > > > > >> >> >>>>> integrate
> > > > > >> >> >>>>> >> the two. I would be happy to help.
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> regards,
> > > > > >> >> >>>>> >> Rahul
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> > > > > >> >> dlieu.7@gmail.com>
> > > > > >> >> >>>>> >>> wrote:
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> Yep, ok.
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
> up a
> > > > maven
> > > > > >> >> project
> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
> lot
> > > > > lately).
> > > > > >> Or
> > > > > >> >> if you
> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
> > guess
> > > > > too.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> No, please go right ahead.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> > > > > >> >> josh.wills@gmail.com>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
> happy
> > > to
> > > > > help.
> > > > > >> >> Github
> > > > > >> >> >>>>> >>>>> repo?
> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > > > > >> dlieu.7@gmail.com
> > > > > >> >> >
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
> > > > prototype
> > > > > on
> > > > > >> >> top of
> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
> save
> > > > time
> > > > > and
> > > > > >> >> prove or
> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
> > viable.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
> > > framework
> > > > > or we
> > > > > >> >> can keep
> > > > > >> >> >>>>> >>>>>> it completely separate.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> -d
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > > > >> >> jwills@cloudera.com>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who
> > gave
> > > > the
> > > > > >> >> talk? Was
> > > > > >> >> >>>>> it
> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> > Lyubimov <
> > > > > >> >> >>>>> dlieu.7@gmail.com>
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Hello,
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
> > > experience
> > > > > of R
> > > > > >> >> mapping
> > > > > >> >> >>>>> of
> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a
> > lot
> > > of
> > > > > >> >> applications
> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
> > > prototyped
> > > > > using
> > > > > >> >> flume R.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> > > > > implementation
> > > > > >> of
> > > > > >> >> R
> > > > > >> >> >>>>> >>>>>>>> mapping,
> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
> from
> > R
> > > to
> > > > > >> Crunch
> > > > > >> >> would
> > > > > >> >> >>>>> be
> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
> > > > RJava/JRI
> > > > > and
> > > > > >> >> jni
> > > > > >> >> >>>>> seem to
> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
> > > directly.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
> > could
> > > > > have a
> > > > > >> >> >>>>> contributed
> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that
> > > would
> > > > > be
> > > > > >> >> just a
> > > > > >> >> >>>>> very
> > > > > >> >> >>>>> >>>>>>>> good synergy.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> > > contributing/advising
> > > > > for
> > > > > >> open
> > > > > >> >> >>>>> source
> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
> > interest,
> > > > > Crunch
> > > > > >> >> list
> > > > > >> >> >>>>> seems
> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Thanks .
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> --
> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> > > > > >> >> >>>>> >>>>>>> Cloudera
> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> --
> > > > > >> >> >>>> Director of Data Science
> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Director of Data Science
> > > > > >> > Cloudera <http://www.cloudera.com>
> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Director of Data Science
> > > > > > Cloudera <http://www.cloudera.com>
> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Flume R -- any interest?

Posted by Josh Wills <jw...@cloudera.com>.

On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> so java tasks need to be able to load libjri.so from
> whatever system.file("jri", package="rJava") says.
>
> Traditionally, these issues were handled with -Djava.library.path.
> Apparently there's nothing java task can do to enable loadLibrary() command
> to see the damn library once started. But -Djava.library.path requires for
> nodes to configure and lock jvm command line from modifications of the
> client.  which is fine.
>
> I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 (again).
>
> but... any other suggestions about best practice configuring crunch to run
> user's .so's?
>

Not off the top of my head. I suspect that whatever you come up with will
become the "best practice." :)

>
> thanks.
>
>
>
>
>
>
> On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com> wrote:
>
> > I believe that is a safe assumption, at least right now.
> >
> >
> > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Question.
> > >
> > > So in Crunch api, initialize() doesn't get an emitter. and the process
> > gets
> > > emitter every time.
> > >
> > > However, my guess any single reincranation of a DoFn object in the
> > backend
> > > will always be getting the same emitter thru its lifecycle. Is it an
> > > admissible assumption or there's currently a counter example to that?
> > >
> > > The problem is that as i implement the two way pipeline of input and
> > > emitter data between R and Java, I am bulking these calls together for
> > > performance reasons. Each individual datum in these chunks of data will
> > not
> > > have attached emitter function information to them in any way. (well it
> > > could but it would be a performance killer and i bet emitter never
> > > changes).
> > >
> > > So, thoughts? can i assume emitter never changes between first and lass
> > > call to DoFn instance?
> > >
> > > thanks.
> > >
> > >
> > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > yes...
> > > >
> > > > i think it worked for me before, although just adding all jars from R
> > > > package distribution would be a little bit more appropriate approach
> > > > -- but it creates a problem with jars in dependent R packages. I
> think
> > > > it would be much easier to just compile a hadoop-job file and stick
> it
> > > > in rather than doing cherry-picking of individual jars from who knows
> > > > how many locations.
> > > >
> > > > i think i used the hadoop job format with distributed cache before
> and
> > > > it worked... at least with Pig "register jar" functionality.
> > > >
> > > > ok i guess i will just try if it works.
> > > >
> > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com>
> > wrote:
> > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Great! so it is in Crunch.
> > > > >>
> > > > >> does it support hadoop-job jar format or only pure java jars?
> > > > >>
> > > > >
> > > > > I think just pure jars-- you're referring to hadoop-job format as
> > > having
> > > > > all the dependencies in a lib/ directory within the jar?
> > > > >
> > > > >
> > > > >>
> > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jw...@cloudera.com>
> > > > wrote:
> > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> >> I think i need functionality to add more jars (or external
> > > > hadoop-jar)
> > > > >> >> to drive that from an R package. Just setting job jar by class
> is
> > > not
> > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
> > > package;
> > > > >> >> however, i cannot really run hadoop command line on it, i need
> to
> > > set
> > > > >> >> up classpath thru RJava.
> > > > >> >>
> > > > >> >> Traditional single hadoop job jar will unlikely work here since
> > we
> > > > >> >> cannot hardcode pipelines in java code but rather have to
> > construct
> > > > >> >> them on the fly. (well, we could serialize pipeline definitions
> > > from
> > > > R
> > > > >> >> and then replay them in a driver -- but that's too cumbersome
> and
> > > > more
> > > > >> >> work than it has to be.) There's no reason why i shouldn't be
> > able
> > > to
> > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
> > > kicking
> > > > >> >> off a pipeline.
> > > > >> >>
> > > > >> >
> > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > > >> >
> > > > >> >
> > > > >> >>
> > > > >> >>
> > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > > > dlieu.7@gmail.com>
> > > > >> >> wrote:
> > > > >> >> > Ok, sounds very promising...
> > > > >> >> >
> > > > >> >> > i'll try to start digging on the driver part this week then
> > > > (Pipeline
> > > > >> >> > wrapper in R5).
> > > > >> >> >
> > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > > josh.wills@gmail.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> > > > dlieu.7@gmail.com
> > > > >> >
> > > > >> >> wrote:
> > > > >> >> >>> Ok, cool.
> > > > >> >> >>>
> > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
> advanced
> > > > state.
> > > > >> >> >>> So every api mentioned in the  FlumeJava paper is working ,
> > > > right?
> > > > >> Or
> > > > >> >> >>> there's something that is not working specifically?
> > > > >> >> >>
> > > > >> >> >> I think the only thing in the paper that we don't have in a
> > > > working
> > > > >> >> >> state is MSCR fusion. It's mostly just a question of
> > > prioritizing
> > > > it
> > > > >> >> >> and getting the work done.
> > > > >> >> >>
> > > > >> >> >>>
> > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > > jwills@cloudera.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >>>> Hey Dmitriy,
> > > > >> >> >>>>
> > > > >> >> >>>> Got a fork going and looking forward to playing with
> crunchR
> > > > this
> > > > >> >> weekend--
> > > > >> >> >>>> thanks!
> > > > >> >> >>>>
> > > > >> >> >>>> J
> > > > >> >> >>>>
> > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> > > > >> dlieu.7@gmail.com>
> > > > >> >> wrote:
> > > > >> >> >>>>
> > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > > >> >> >>>>>
> > > > >> >> >>>>> Default profile does not compile R artifact . R profile
> > > > compiles R
> > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR
> > to
> > > > mvn
> > > > >> >> >>>>> command line, e.g.
> > > > >> >> >>>>>
> > > > >> >> >>>>> mvn install -DR
> > > > >> >> >>>>>
> > > > >> >> >>>>> there's also a helper that installs the snapshot version
> of
> > > the
> > > > >> >> >>>>> package in the crunchR module.
> > > > >> >> >>>>>
> > > > >> >> >>>>> There's RJava and JRI java dependencies which i did not
> > find
> > > > >> anywhere
> > > > >> >> >>>>> in public maven repos; so it is installed into my github
> > > maven
> > > > >> repo
> > > > >> >> so
> > > > >> >> >>>>> far. Should compile for 3rd party.
> > > > >> >> >>>>>
> > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
> > RProtoBuf.
> > > R
> > > > Doc
> > > > >> >> >>>>> compilation requires roxygen2 (i think).
> > > > >> >> >>>>>
> > > > >> >> >>>>> For some reason RProtoBuf fails to import into another
> > > package,
> > > > >> got a
> > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
> crunchR,
> > so
> > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
> that
> > > may
> > > > >> be a
> > > > >> >> >>>>> problem though...
> > > > >> >> >>>>>
> > > > >> >> >>>>> other than the template, not much else has been done so
> > > far...
> > > > >> >> finding
> > > > >> >> >>>>> hadoop libraries and adding it to the package path on
> > > > >> initialization
> > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> > > > >> non-"provided"
> > > > >> >> >>>>> transitives to the crunchR's java part...
> > > > >> >> >>>>>
> > > > >> >> >>>>> No legal stuff...
> > > > >> >> >>>>>
> > > > >> >> >>>>> No readmes... complete stealth at this point.
> > > > >> >> >>>>>
> > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> > > > >> >> dlieu.7@gmail.com>
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> > Ok, cool. I will try to roll project template by some
> > time
> > > > next
> > > > >> >> week.
> > > > >> >> >>>>> > we can start with prototyping and benchmarking
> something
> > > > really
> > > > >> >> >>>>> > simple, such as parallelDo().
> > > > >> >> >>>>> >
> > > > >> >> >>>>> > My interim goal is to perhaps take some more or less
> > simple
> > > > >> >> algorithm
> > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
> Rcrunch
> > > (or
> > > > >> >> whatever
> > > > >> >> >>>>> > name it has to be) in a comparable time (performance)
> but
> > > > with
> > > > >> much
> > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
> > > clustering
> > > > >> >> things)
> > > > >> >> >>>>> >
> > > > >> >> >>>>> >
> > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> > rsharma@xebia.com
> > > >
> > > > >> wrote:
> > > > >> >> >>>>> >> I am not much of R user but I am interested to see how
> > > well
> > > > we
> > > > >> can
> > > > >> >> >>>>> integrate
> > > > >> >> >>>>> >> the two. I would be happy to help.
> > > > >> >> >>>>> >>
> > > > >> >> >>>>> >> regards,
> > > > >> >> >>>>> >> Rahul
> > > > >> >> >>>>> >>
> > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> > > > >> >> dlieu.7@gmail.com>
> > > > >> >> >>>>> >>> wrote:
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> Yep, ok.
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a
> > > maven
> > > > >> >> project
> > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot
> > > > lately).
> > > > >> Or
> > > > >> >> if you
> > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
> guess
> > > > too.
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>> No, please go right ahead.
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> > > > >> >> josh.wills@gmail.com>
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> >>>>>
> > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy
> > to
> > > > help.
> > > > >> >> Github
> > > > >> >> >>>>> >>>>> repo?
> > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > > > >> dlieu.7@gmail.com
> > > > >> >> >
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> >>>>>
> > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
> > > prototype
> > > > on
> > > > >> >> top of
> > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both save
> > > time
> > > > and
> > > > >> >> prove or
> > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
> viable.
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
> > framework
> > > > or we
> > > > >> >> can keep
> > > > >> >> >>>>> >>>>>> it completely separate.
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> -d
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > > >> >> jwills@cloudera.com>
> > > > >> >> >>>>> >>>>>> wrote:
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who
> gave
> > > the
> > > > >> >> talk? Was
> > > > >> >> >>>>> it
> > > > >> >> >>>>> >>>>>>> Murray Stokely?
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> Lyubimov <
> > > > >> >> >>>>> dlieu.7@gmail.com>
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> wrote:
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> Hello,
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
> > experience
> > > > of R
> > > > >> >> mapping
> > > > >> >> >>>>> of
> > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a
> lot
> > of
> > > > >> >> applications
> > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
> > prototyped
> > > > using
> > > > >> >> flume R.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> > > > implementation
> > > > >> of
> > > > >> >> R
> > > > >> >> >>>>> >>>>>>>> mapping,
> > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from
> R
> > to
> > > > >> Crunch
> > > > >> >> would
> > > > >> >> >>>>> be
> > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
> > > RJava/JRI
> > > > and
> > > > >> >> jni
> > > > >> >> >>>>> seem to
> > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
> > directly.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
> could
> > > > have a
> > > > >> >> >>>>> contributed
> > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that
> > would
> > > > be
> > > > >> >> just a
> > > > >> >> >>>>> very
> > > > >> >> >>>>> >>>>>>>> good synergy.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> > contributing/advising
> > > > for
> > > > >> open
> > > > >> >> >>>>> source
> > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
> interest,
> > > > Crunch
> > > > >> >> list
> > > > >> >> >>>>> seems
> > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> Thanks .
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>> -Dmitriy
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> --
> > > > >> >> >>>>> >>>>>>> Director of Data Science
> > > > >> >> >>>>> >>>>>>> Cloudera
> > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>
> > > > >> >> >>>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>> --
> > > > >> >> >>>> Director of Data Science
> > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Director of Data Science
> > > > >> > Cloudera <http://www.cloudera.com>
> > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Director of Data Science
> > > > > Cloudera <http://www.cloudera.com>
> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >
> > >
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Flume R -- any interest?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

so java tasks need to be able to load libjri.so from
whatever system.file("jri", package="rJava") says.

Traditionally, these issues were handled with -Djava.library.path.
Apparently there's nothing java task can do to enable loadLibrary() command
to see the damn library once started. But -Djava.library.path requires for
nodes to configure and lock jvm command line from modifications of the
client.  which is fine.

I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 (again).

but... any other suggestions about best practice configuring crunch to run
user's .so's?

thanks.






On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <jo...@gmail.com> wrote:

> I believe that is a safe assumption, at least right now.
>
>
> On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Question.
> >
> > So in Crunch api, initialize() doesn't get an emitter. and the process
> gets
> > emitter every time.
> >
> > However, my guess any single reincranation of a DoFn object in the
> backend
> > will always be getting the same emitter thru its lifecycle. Is it an
> > admissible assumption or there's currently a counter example to that?
> >
> > The problem is that as i implement the two way pipeline of input and
> > emitter data between R and Java, I am bulking these calls together for
> > performance reasons. Each individual datum in these chunks of data will
> not
> > have attached emitter function information to them in any way. (well it
> > could but it would be a performance killer and i bet emitter never
> > changes).
> >
> > So, thoughts? can i assume emitter never changes between first and lass
> > call to DoFn instance?
> >
> > thanks.
> >
> >
> > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > yes...
> > >
> > > i think it worked for me before, although just adding all jars from R
> > > package distribution would be a little bit more appropriate approach
> > > -- but it creates a problem with jars in dependent R packages. I think
> > > it would be much easier to just compile a hadoop-job file and stick it
> > > in rather than doing cherry-picking of individual jars from who knows
> > > how many locations.
> > >
> > > i think i used the hadoop job format with distributed cache before and
> > > it worked... at least with Pig "register jar" functionality.
> > >
> > > ok i guess i will just try if it works.
> > >
> > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com>
> wrote:
> > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > wrote:
> > > >
> > > >> Great! so it is in Crunch.
> > > >>
> > > >> does it support hadoop-job jar format or only pure java jars?
> > > >>
> > > >
> > > > I think just pure jars-- you're referring to hadoop-job format as
> > having
> > > > all the dependencies in a lib/ directory within the jar?
> > > >
> > > >
> > > >>
> > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jw...@cloudera.com>
> > > wrote:
> > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> >> I think i need functionality to add more jars (or external
> > > hadoop-jar)
> > > >> >> to drive that from an R package. Just setting job jar by class is
> > not
> > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
> > package;
> > > >> >> however, i cannot really run hadoop command line on it, i need to
> > set
> > > >> >> up classpath thru RJava.
> > > >> >>
> > > >> >> Traditional single hadoop job jar will unlikely work here since
> we
> > > >> >> cannot hardcode pipelines in java code but rather have to
> construct
> > > >> >> them on the fly. (well, we could serialize pipeline definitions
> > from
> > > R
> > > >> >> and then replay them in a driver -- but that's too cumbersome and
> > > more
> > > >> >> work than it has to be.) There's no reason why i shouldn't be
> able
> > to
> > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
> > kicking
> > > >> >> off a pipeline.
> > > >> >>
> > > >> >
> > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > >> >
> > > >> >
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com>
> > > >> >> wrote:
> > > >> >> > Ok, sounds very promising...
> > > >> >> >
> > > >> >> > i'll try to start digging on the driver part this week then
> > > (Pipeline
> > > >> >> > wrapper in R5).
> > > >> >> >
> > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > josh.wills@gmail.com
> > > >
> > > >> >> wrote:
> > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com
> > > >> >
> > > >> >> wrote:
> > > >> >> >>> Ok, cool.
> > > >> >> >>>
> > > >> >> >>> So what state is Crunch in? I take it is in a fairly advanced
> > > state.
> > > >> >> >>> So every api mentioned in the  FlumeJava paper is working ,
> > > right?
> > > >> Or
> > > >> >> >>> there's something that is not working specifically?
> > > >> >> >>
> > > >> >> >> I think the only thing in the paper that we don't have in a
> > > working
> > > >> >> >> state is MSCR fusion. It's mostly just a question of
> > prioritizing
> > > it
> > > >> >> >> and getting the work done.
> > > >> >> >>
> > > >> >> >>>
> > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > jwills@cloudera.com
> > > >
> > > >> >> wrote:
> > > >> >> >>>> Hey Dmitriy,
> > > >> >> >>>>
> > > >> >> >>>> Got a fork going and looking forward to playing with crunchR
> > > this
> > > >> >> weekend--
> > > >> >> >>>> thanks!
> > > >> >> >>>>
> > > >> >> >>>> J
> > > >> >> >>>>
> > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> > > >> dlieu.7@gmail.com>
> > > >> >> wrote:
> > > >> >> >>>>
> > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > >> >> >>>>>
> > > >> >> >>>>> Default profile does not compile R artifact . R profile
> > > compiles R
> > > >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR
> to
> > > mvn
> > > >> >> >>>>> command line, e.g.
> > > >> >> >>>>>
> > > >> >> >>>>> mvn install -DR
> > > >> >> >>>>>
> > > >> >> >>>>> there's also a helper that installs the snapshot version of
> > the
> > > >> >> >>>>> package in the crunchR module.
> > > >> >> >>>>>
> > > >> >> >>>>> There's RJava and JRI java dependencies which i did not
> find
> > > >> anywhere
> > > >> >> >>>>> in public maven repos; so it is installed into my github
> > maven
> > > >> repo
> > > >> >> so
> > > >> >> >>>>> far. Should compile for 3rd party.
> > > >> >> >>>>>
> > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
> RProtoBuf.
> > R
> > > Doc
> > > >> >> >>>>> compilation requires roxygen2 (i think).
> > > >> >> >>>>>
> > > >> >> >>>>> For some reason RProtoBuf fails to import into another
> > package,
> > > >> got a
> > > >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR,
> so
> > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that
> > may
> > > >> be a
> > > >> >> >>>>> problem though...
> > > >> >> >>>>>
> > > >> >> >>>>> other than the template, not much else has been done so
> > far...
> > > >> >> finding
> > > >> >> >>>>> hadoop libraries and adding it to the package path on
> > > >> initialization
> > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> > > >> non-"provided"
> > > >> >> >>>>> transitives to the crunchR's java part...
> > > >> >> >>>>>
> > > >> >> >>>>> No legal stuff...
> > > >> >> >>>>>
> > > >> >> >>>>> No readmes... complete stealth at this point.
> > > >> >> >>>>>
> > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> > > >> >> dlieu.7@gmail.com>
> > > >> >> >>>>> wrote:
> > > >> >> >>>>> > Ok, cool. I will try to roll project template by some
> time
> > > next
> > > >> >> week.
> > > >> >> >>>>> > we can start with prototyping and benchmarking something
> > > really
> > > >> >> >>>>> > simple, such as parallelDo().
> > > >> >> >>>>> >
> > > >> >> >>>>> > My interim goal is to perhaps take some more or less
> simple
> > > >> >> algorithm
> > > >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch
> > (or
> > > >> >> whatever
> > > >> >> >>>>> > name it has to be) in a comparable time (performance) but
> > > with
> > > >> much
> > > >> >> >>>>> > fewer lines of code. (say one of factorization or
> > clustering
> > > >> >> things)
> > > >> >> >>>>> >
> > > >> >> >>>>> >
> > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> rsharma@xebia.com
> > >
> > > >> wrote:
> > > >> >> >>>>> >> I am not much of R user but I am interested to see how
> > well
> > > we
> > > >> can
> > > >> >> >>>>> integrate
> > > >> >> >>>>> >> the two. I would be happy to help.
> > > >> >> >>>>> >>
> > > >> >> >>>>> >> regards,
> > > >> >> >>>>> >> Rahul
> > > >> >> >>>>> >>
> > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> > > >> >> dlieu.7@gmail.com>
> > > >> >> >>>>> >>> wrote:
> > > >> >> >>>>> >>>>
> > > >> >> >>>>> >>>> Yep, ok.
> > > >> >> >>>>> >>>>
> > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a
> > maven
> > > >> >> project
> > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot
> > > lately).
> > > >> Or
> > > >> >> if you
> > > >> >> >>>>> >>>> have a template to look at, it would be useful i guess
> > > too.
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>> No, please go right ahead.
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>>>
> > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> > > >> >> josh.wills@gmail.com>
> > > >> >> >>>>> wrote:
> > > >> >> >>>>> >>>>>
> > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy
> to
> > > help.
> > > >> >> Github
> > > >> >> >>>>> >>>>> repo?
> > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > > >> dlieu.7@gmail.com
> > > >> >> >
> > > >> >> >>>>> wrote:
> > > >> >> >>>>> >>>>>
> > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
> > prototype
> > > on
> > > >> >> top of
> > > >> >> >>>>> >>>>>> Crunch for something simple. This should both save
> > time
> > > and
> > > >> >> prove or
> > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable.
> > > >> >> >>>>> >>>>>>
> > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
> framework
> > > or we
> > > >> >> can keep
> > > >> >> >>>>> >>>>>> it completely separate.
> > > >> >> >>>>> >>>>>>
> > > >> >> >>>>> >>>>>> -d
> > > >> >> >>>>> >>>>>>
> > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > >> >> jwills@cloudera.com>
> > > >> >> >>>>> >>>>>> wrote:
> > > >> >> >>>>> >>>>>>>
> > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave
> > the
> > > >> >> talk? Was
> > > >> >> >>>>> it
> > > >> >> >>>>> >>>>>>> Murray Stokely?
> > > >> >> >>>>> >>>>>>>
> > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov <
> > > >> >> >>>>> dlieu.7@gmail.com>
> > > >> >> >>>>> >>>>>>
> > > >> >> >>>>> >>>>>> wrote:
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> Hello,
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
> experience
> > > of R
> > > >> >> mapping
> > > >> >> >>>>> of
> > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot
> of
> > > >> >> applications
> > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
> prototyped
> > > using
> > > >> >> flume R.
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> > > implementation
> > > >> of
> > > >> >> R
> > > >> >> >>>>> >>>>>>>> mapping,
> > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R
> to
> > > >> Crunch
> > > >> >> would
> > > >> >> >>>>> be
> > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
> > RJava/JRI
> > > and
> > > >> >> jni
> > > >> >> >>>>> seem to
> > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
> directly.
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could
> > > have a
> > > >> >> >>>>> contributed
> > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that
> would
> > > be
> > > >> >> just a
> > > >> >> >>>>> very
> > > >> >> >>>>> >>>>>>>> good synergy.
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> contributing/advising
> > > for
> > > >> open
> > > >> >> >>>>> source
> > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest,
> > > Crunch
> > > >> >> list
> > > >> >> >>>>> seems
> > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> Thanks .
> > > >> >> >>>>> >>>>>>>>
> > > >> >> >>>>> >>>>>>>> -Dmitriy
> > > >> >> >>>>> >>>>>>>
> > > >> >> >>>>> >>>>>>>
> > > >> >> >>>>> >>>>>>>
> > > >> >> >>>>> >>>>>>> --
> > > >> >> >>>>> >>>>>>> Director of Data Science
> > > >> >> >>>>> >>>>>>> Cloudera
> > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>>
> > > >> >> >>>>> >>
> > > >> >> >>>>>
> > > >> >> >>>>
> > > >> >> >>>>
> > > >> >> >>>>
> > > >> >> >>>> --
> > > >> >> >>>> Director of Data Science
> > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Director of Data Science
> > > >> > Cloudera <http://www.cloudera.com>
> > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Director of Data Science
> > > > Cloudera <http://www.cloudera.com>
> > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >
> >
>

Re: Flume R -- any interest?

Posted by Josh Wills <jo...@gmail.com>.

I believe that is a safe assumption, at least right now.


On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Question.
>
> So in Crunch api, initialize() doesn't get an emitter. and the process gets
> emitter every time.
>
> However, my guess any single reincranation of a DoFn object in the backend
> will always be getting the same emitter thru its lifecycle. Is it an
> admissible assumption or there's currently a counter example to that?
>
> The problem is that as i implement the two way pipeline of input and
> emitter data between R and Java, I am bulking these calls together for
> performance reasons. Each individual datum in these chunks of data will not
> have attached emitter function information to them in any way. (well it
> could but it would be a performance killer and i bet emitter never
> changes).
>
> So, thoughts? can i assume emitter never changes between first and lass
> call to DoFn instance?
>
> thanks.
>
>
> On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > yes...
> >
> > i think it worked for me before, although just adding all jars from R
> > package distribution would be a little bit more appropriate approach
> > -- but it creates a problem with jars in dependent R packages. I think
> > it would be much easier to just compile a hadoop-job file and stick it
> > in rather than doing cherry-picking of individual jars from who knows
> > how many locations.
> >
> > i think i used the hadoop job format with distributed cache before and
> > it worked... at least with Pig "register jar" functionality.
> >
> > ok i guess i will just try if it works.
> >
> > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jw...@cloudera.com> wrote:
> > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > >
> > >> Great! so it is in Crunch.
> > >>
> > >> does it support hadoop-job jar format or only pure java jars?
> > >>
> > >
> > > I think just pure jars-- you're referring to hadoop-job format as
> having
> > > all the dependencies in a lib/ directory within the jar?
> > >
> > >
> > >>
> > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jw...@cloudera.com>
> > wrote:
> > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > >> wrote:
> > >> >
> > >> >> I think i need functionality to add more jars (or external
> > hadoop-jar)
> > >> >> to drive that from an R package. Just setting job jar by class is
> not
> > >> >> enough. I can push overall job-jar as an addiitonal jar to R
> package;
> > >> >> however, i cannot really run hadoop command line on it, i need to
> set
> > >> >> up classpath thru RJava.
> > >> >>
> > >> >> Traditional single hadoop job jar will unlikely work here since we
> > >> >> cannot hardcode pipelines in java code but rather have to construct
> > >> >> them on the fly. (well, we could serialize pipeline definitions
> from
> > R
> > >> >> and then replay them in a driver -- but that's too cumbersome and
> > more
> > >> >> work than it has to be.) There's no reason why i shouldn't be able
> to
> > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
> kicking
> > >> >> off a pipeline.
> > >> >>
> > >> >
> > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > >> >
> > >> >
> > >> >>
> > >> >>
> > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > >> >> wrote:
> > >> >> > Ok, sounds very promising...
> > >> >> >
> > >> >> > i'll try to start digging on the driver part this week then
> > (Pipeline
> > >> >> > wrapper in R5).
> > >> >> >
> > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> josh.wills@gmail.com
> > >
> > >> >> wrote:
> > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >> >
> > >> >> wrote:
> > >> >> >>> Ok, cool.
> > >> >> >>>
> > >> >> >>> So what state is Crunch in? I take it is in a fairly advanced
> > state.
> > >> >> >>> So every api mentioned in the  FlumeJava paper is working ,
> > right?
> > >> Or
> > >> >> >>> there's something that is not working specifically?
> > >> >> >>
> > >> >> >> I think the only thing in the paper that we don't have in a
> > working
> > >> >> >> state is MSCR fusion. It's mostly just a question of
> prioritizing
> > it
> > >> >> >> and getting the work done.
> > >> >> >>
> > >> >> >>>
> > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> jwills@cloudera.com
> > >
> > >> >> wrote:
> > >> >> >>>> Hey Dmitriy,
> > >> >> >>>>
> > >> >> >>>> Got a fork going and looking forward to playing with crunchR
> > this
> > >> >> weekend--
> > >> >> >>>> thanks!
> > >> >> >>>>
> > >> >> >>>> J
> > >> >> >>>>
> > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> > >> dlieu.7@gmail.com>
> > >> >> wrote:
> > >> >> >>>>
> > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > >> >> >>>>>
> > >> >> >>>>> Default profile does not compile R artifact . R profile
> > compiles R
> > >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR to
> > mvn
> > >> >> >>>>> command line, e.g.
> > >> >> >>>>>
> > >> >> >>>>> mvn install -DR
> > >> >> >>>>>
> > >> >> >>>>> there's also a helper that installs the snapshot version of
> the
> > >> >> >>>>> package in the crunchR module.
> > >> >> >>>>>
> > >> >> >>>>> There's RJava and JRI java dependencies which i did not find
> > >> anywhere
> > >> >> >>>>> in public maven repos; so it is installed into my github
> maven
> > >> repo
> > >> >> so
> > >> >> >>>>> far. Should compile for 3rd party.
> > >> >> >>>>>
> > >> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf.
> R
> > Doc
> > >> >> >>>>> compilation requires roxygen2 (i think).
> > >> >> >>>>>
> > >> >> >>>>> For some reason RProtoBuf fails to import into another
> package,
> > >> got a
> > >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so
> > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that
> may
> > >> be a
> > >> >> >>>>> problem though...
> > >> >> >>>>>
> > >> >> >>>>> other than the template, not much else has been done so
> far...
> > >> >> finding
> > >> >> >>>>> hadoop libraries and adding it to the package path on
> > >> initialization
> > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> > >> non-"provided"
> > >> >> >>>>> transitives to the crunchR's java part...
> > >> >> >>>>>
> > >> >> >>>>> No legal stuff...
> > >> >> >>>>>
> > >> >> >>>>> No readmes... complete stealth at this point.
> > >> >> >>>>>
> > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> > >> >> dlieu.7@gmail.com>
> > >> >> >>>>> wrote:
> > >> >> >>>>> > Ok, cool. I will try to roll project template by some time
> > next
> > >> >> week.
> > >> >> >>>>> > we can start with prototyping and benchmarking something
> > really
> > >> >> >>>>> > simple, such as parallelDo().
> > >> >> >>>>> >
> > >> >> >>>>> > My interim goal is to perhaps take some more or less simple
> > >> >> algorithm
> > >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch
> (or
> > >> >> whatever
> > >> >> >>>>> > name it has to be) in a comparable time (performance) but
> > with
> > >> much
> > >> >> >>>>> > fewer lines of code. (say one of factorization or
> clustering
> > >> >> things)
> > >> >> >>>>> >
> > >> >> >>>>> >
> > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com
> >
> > >> wrote:
> > >> >> >>>>> >> I am not much of R user but I am interested to see how
> well
> > we
> > >> can
> > >> >> >>>>> integrate
> > >> >> >>>>> >> the two. I would be happy to help.
> > >> >> >>>>> >>
> > >> >> >>>>> >> regards,
> > >> >> >>>>> >> Rahul
> > >> >> >>>>> >>
> > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > >> >> >>>>> >>>
> > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> > >> >> dlieu.7@gmail.com>
> > >> >> >>>>> >>> wrote:
> > >> >> >>>>> >>>>
> > >> >> >>>>> >>>> Yep, ok.
> > >> >> >>>>> >>>>
> > >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a
> maven
> > >> >> project
> > >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot
> > lately).
> > >> Or
> > >> >> if you
> > >> >> >>>>> >>>> have a template to look at, it would be useful i guess
> > too.
> > >> >> >>>>> >>>
> > >> >> >>>>> >>> No, please go right ahead.
> > >> >> >>>>> >>>
> > >> >> >>>>> >>>>
> > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> > >> >> josh.wills@gmail.com>
> > >> >> >>>>> wrote:
> > >> >> >>>>> >>>>>
> > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to
> > help.
> > >> >> Github
> > >> >> >>>>> >>>>> repo?
> > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > >> dlieu.7@gmail.com
> > >> >> >
> > >> >> >>>>> wrote:
> > >> >> >>>>> >>>>>
> > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
> prototype
> > on
> > >> >> top of
> > >> >> >>>>> >>>>>> Crunch for something simple. This should both save
> time
> > and
> > >> >> prove or
> > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable.
> > >> >> >>>>> >>>>>>
> > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework
> > or we
> > >> >> can keep
> > >> >> >>>>> >>>>>> it completely separate.
> > >> >> >>>>> >>>>>>
> > >> >> >>>>> >>>>>> -d
> > >> >> >>>>> >>>>>>
> > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > >> >> jwills@cloudera.com>
> > >> >> >>>>> >>>>>> wrote:
> > >> >> >>>>> >>>>>>>
> > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave
> the
> > >> >> talk? Was
> > >> >> >>>>> it
> > >> >> >>>>> >>>>>>> Murray Stokely?
> > >> >> >>>>> >>>>>>>
> > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov <
> > >> >> >>>>> dlieu.7@gmail.com>
> > >> >> >>>>> >>>>>>
> > >> >> >>>>> >>>>>> wrote:
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> Hello,
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience
> > of R
> > >> >> mapping
> > >> >> >>>>> of
> > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of
> > >> >> applications
> > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped
> > using
> > >> >> flume R.
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> > implementation
> > >> of
> > >> >> R
> > >> >> >>>>> >>>>>>>> mapping,
> > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to
> > >> Crunch
> > >> >> would
> > >> >> >>>>> be
> > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
> RJava/JRI
> > and
> > >> >> jni
> > >> >> >>>>> seem to
> > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly.
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could
> > have a
> > >> >> >>>>> contributed
> > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would
> > be
> > >> >> just a
> > >> >> >>>>> very
> > >> >> >>>>> >>>>>>>> good synergy.
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising
> > for
> > >> open
> > >> >> >>>>> source
> > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest,
> > Crunch
> > >> >> list
> > >> >> >>>>> seems
> > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> Thanks .
> > >> >> >>>>> >>>>>>>>
> > >> >> >>>>> >>>>>>>> -Dmitriy
> > >> >> >>>>> >>>>>>>
> > >> >> >>>>> >>>>>>>
> > >> >> >>>>> >>>>>>>
> > >> >> >>>>> >>>>>>> --
> > >> >> >>>>> >>>>>>> Director of Data Science
> > >> >> >>>>> >>>>>>> Cloudera
> > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > >> >> >>>>> >>>
> > >> >> >>>>> >>>
> > >> >> >>>>> >>>
> > >> >> >>>>> >>
> > >> >> >>>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> --
> > >> >> >>>> Director of Data Science
> > >> >> >>>> Cloudera <http://www.cloudera.com>
> > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Director of Data Science
> > >> > Cloudera <http://www.cloudera.com>
> > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >>
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>