You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Benjamin Mears <be...@gmail.com> on 2015/01/21 18:01:10 UTC

In memory PCollection for use in MRPipeline

Hi,

I'm trying to write a Crunch job to generate a large amount of simulated
data.  To kick the job off, I need inputs into a do function.  These inputs
are essentially dummy values that will be ignored in the do fn.  To
accomplish this, I'd like to create an inmemory PCollection that can then
be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
I get an error:

Exception in thread "main" java.lang.IllegalStateException:  named
'null' cannot be serialized
	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)

Is it possible to explicitly declare/instantiate a PCollection to pass
into an MRPipeline?

Thanks!

-Ben

Re: In memory PCollection for use in MRPipeline

Posted by Benjamin Mears <be...@gmail.com>.

Great, thanks!

-Ben

On Thu, Jan 22, 2015 at 10:12 AM, Josh Wills <jw...@cloudera.com> wrote:

> The in-memory and Spark versions are pretty easy, the MR one will be a bit
> more work. Will track this at
> https://issues.apache.org/jira/browse/CRUNCH-489
>
> J
>
> On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <be...@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> 1) Yes, having a version that allowed a specification of parallelism
>> would be very useful!  I had been thinking of using scaleFactor to try to
>> force a higher degree of parallelism but not sure if that would have worked
>> and being able to explicitly specify the parallelism is much cleaner.
>>
>> 2) Yes, the difference would be a varargs array vs. an iterable as the
>> argument so having the analogous overloaded methods to
>> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
>> initially notice typedCollectionOf and collectionOf each had two overloaded
>> versions).
>>
>> Thanks again!
>>
>> -Ben
>>
>>
>> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Ben,
>>>
>>> Couple of questions:
>>>
>>> 1) If one potential use case for this was running simulations, wouldn't
>>> you want a version of collectionOf that allowed you to specify parallelism,
>>> like via NLineFileSource?
>>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs
>>> array vs. an Iterable as the argument difference here? I also think that
>>> whatever version of this I did would have to take a PType so we knew how to
>>> serialize the data, so they would look more like typedCollectionOf on
>>> MemPipeline.
>>>
>>> Thanks!
>>> J
>>>
>>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <
>>> benjaminmmears@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Thanks for the quick reply!
>>>>
>>>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
>>>> and also potentially a method like MRPipeline.collectionFrom that takes in
>>>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>>>
>>>> -Ben
>>>>
>>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Ben,
>>>>>
>>>>> No easy way to do it right now besides writing the data yourself,
>>>>> though that sort of simulation-based use case has been in the back of my
>>>>> mind ever since we added the NLineFileSource. What would your ideal API
>>>>> look like here?
>>>>>
>>>>> Thanks,
>>>>> J
>>>>>
>>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>>>> benjaminmmears@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to write a Crunch job to generate a large amount of
>>>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>>>> These inputs are essentially dummy values that will be ignored in the do
>>>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>>>>> I get an error:
>>>>>>
>>>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>>>
>>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -Ben
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: In memory PCollection for use in MRPipeline

Posted by Josh Wills <jw...@cloudera.com>.

The in-memory and Spark versions are pretty easy, the MR one will be a bit
more work. Will track this at
https://issues.apache.org/jira/browse/CRUNCH-489

J

On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <be...@gmail.com>
wrote:

> Hi Josh,
>
> 1) Yes, having a version that allowed a specification of parallelism would
> be very useful!  I had been thinking of using scaleFactor to try to force a
> higher degree of parallelism but not sure if that would have worked and
> being able to explicitly specify the parallelism is much cleaner.
>
> 2) Yes, the difference would be a varargs array vs. an iterable as the
> argument so having the analogous overloaded methods to
> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
> initially notice typedCollectionOf and collectionOf each had two overloaded
> versions).
>
> Thanks again!
>
> -Ben
>
>
> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Ben,
>>
>> Couple of questions:
>>
>> 1) If one potential use case for this was running simulations, wouldn't
>> you want a version of collectionOf that allowed you to specify parallelism,
>> like via NLineFileSource?
>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs array
>> vs. an Iterable as the argument difference here? I also think that whatever
>> version of this I did would have to take a PType so we knew how to
>> serialize the data, so they would look more like typedCollectionOf on
>> MemPipeline.
>>
>> Thanks!
>> J
>>
>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <benjaminmmears@gmail.com
>> > wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for the quick reply!
>>>
>>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
>>> and also potentially a method like MRPipeline.collectionFrom that takes in
>>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>>
>>> -Ben
>>>
>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jw...@cloudera.com>
>>> wrote:
>>>
>>>> Hey Ben,
>>>>
>>>> No easy way to do it right now besides writing the data yourself,
>>>> though that sort of simulation-based use case has been in the back of my
>>>> mind ever since we added the NLineFileSource. What would your ideal API
>>>> look like here?
>>>>
>>>> Thanks,
>>>> J
>>>>
>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>>> benjaminmmears@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to write a Crunch job to generate a large amount of
>>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>>> These inputs are essentially dummy values that will be ignored in the do
>>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>>>> I get an error:
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>>
>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Ben
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: In memory PCollection for use in MRPipeline

Posted by Benjamin Mears <be...@gmail.com>.

Hi Josh,

1) Yes, having a version that allowed a specification of parallelism would
be very useful!  I had been thinking of using scaleFactor to try to force a
higher degree of parallelism but not sure if that would have worked and
being able to explicitly specify the parallelism is much cleaner.

2) Yes, the difference would be a varargs array vs. an iterable as the
argument so having the analogous overloaded methods to
MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
initially notice typedCollectionOf and collectionOf each had two overloaded
versions).

Thanks again!

-Ben


On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Ben,
>
> Couple of questions:
>
> 1) If one potential use case for this was running simulations, wouldn't
> you want a version of collectionOf that allowed you to specify parallelism,
> like via NLineFileSource?
> 2) collectionOf vs. collectionFrom: do you just mean like a varargs array
> vs. an Iterable as the argument difference here? I also think that whatever
> version of this I did would have to take a PType so we knew how to
> serialize the data, so they would look more like typedCollectionOf on
> MemPipeline.
>
> Thanks!
> J
>
> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <be...@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> Thanks for the quick reply!
>>
>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
>> and also potentially a method like MRPipeline.collectionFrom that takes in
>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>
>> -Ben
>>
>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Ben,
>>>
>>> No easy way to do it right now besides writing the data yourself, though
>>> that sort of simulation-based use case has been in the back of my mind ever
>>> since we added the NLineFileSource. What would your ideal API look like
>>> here?
>>>
>>> Thanks,
>>> J
>>>
>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>> benjaminmmears@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to write a Crunch job to generate a large amount of
>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>> These inputs are essentially dummy values that will be ignored in the do
>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>>> I get an error:
>>>>
>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>
>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>>>
>>>> Thanks!
>>>>
>>>> -Ben
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: In memory PCollection for use in MRPipeline

Posted by Josh Wills <jw...@cloudera.com>.

Hey Ben,

Couple of questions:

1) If one potential use case for this was running simulations, wouldn't you
want a version of collectionOf that allowed you to specify parallelism,
like via NLineFileSource?
2) collectionOf vs. collectionFrom: do you just mean like a varargs array
vs. an Iterable as the argument difference here? I also think that whatever
version of this I did would have to take a PType so we knew how to
serialize the data, so they would look more like typedCollectionOf on
MemPipeline.

Thanks!
J

On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <be...@gmail.com>
wrote:

> Hi Josh,
>
> Thanks for the quick reply!
>
> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
> and also potentially a method like MRPipeline.collectionFrom that takes in
> a Java Iterable and returns a PCollection compatible with MRPipeline.
>
> -Ben
>
> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Ben,
>>
>> No easy way to do it right now besides writing the data yourself, though
>> that sort of simulation-based use case has been in the back of my mind ever
>> since we added the NLineFileSource. What would your ideal API look like
>> here?
>>
>> Thanks,
>> J
>>
>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <benjaminmmears@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I'm trying to write a Crunch job to generate a large amount of simulated
>>> data.  To kick the job off, I need inputs into a do function.  These inputs
>>> are essentially dummy values that will be ignored in the do fn.  To
>>> accomplish this, I'd like to create an inmemory PCollection that can then
>>> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>> I get an error:
>>>
>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>
>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>>
>>> Thanks!
>>>
>>> -Ben
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: In memory PCollection for use in MRPipeline

Posted by Benjamin Mears <be...@gmail.com>.

Hi Josh,

Thanks for the quick reply!

For me, I think a useful API would be to have an analogous
MRPipeline.collectionOf
and also potentially a method like MRPipeline.collectionFrom that takes in
a Java Iterable and returns a PCollection compatible with MRPipeline.

-Ben

On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Ben,
>
> No easy way to do it right now besides writing the data yourself, though
> that sort of simulation-based use case has been in the back of my mind ever
> since we added the NLineFileSource. What would your ideal API look like
> here?
>
> Thanks,
> J
>
> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <be...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to write a Crunch job to generate a large amount of simulated
>> data.  To kick the job off, I need inputs into a do function.  These inputs
>> are essentially dummy values that will be ignored in the do fn.  To
>> accomplish this, I'd like to create an inmemory PCollection that can then
>> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>> I get an error:
>>
>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>
>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>
>> Thanks!
>>
>> -Ben
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: In memory PCollection for use in MRPipeline

Posted by Josh Wills <jw...@cloudera.com>.

Hey Ben,

No easy way to do it right now besides writing the data yourself, though
that sort of simulation-based use case has been in the back of my mind ever
since we added the NLineFileSource. What would your ideal API look like
here?

Thanks,
J

On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <be...@gmail.com>
wrote:

> Hi,
>
> I'm trying to write a Crunch job to generate a large amount of simulated
> data.  To kick the job off, I need inputs into a do function.  These inputs
> are essentially dummy values that will be ignored in the do fn.  To
> accomplish this, I'd like to create an inmemory PCollection that can then
> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
> I get an error:
>
> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>
> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>
> Thanks!
>
> -Ben
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>