You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2014/01/07 19:51:50 UTC

[jira] [Updated] (CRUNCH-320) Materialize several PObject & PCollection objects in parallel (deferred materialization)

     [ https://issues.apache.org/jira/browse/CRUNCH-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills updated CRUNCH-320:
------------------------------

    Attachment: CRUNCH-320.patch

Here's a patch for this-- thanks for digging this up, and sorry for the trouble.

As a workaround for your example, you can call materialize() on rawInput and Sample.sample(rawInput, 0.5) directly, and then call the PObject methods to get their length. We'll only materialize the collection once, and that should signal the outputs to the planner. (If you're using Crunch 0.9.0 or 0.8.2, we added a cache() method to PCollection that makes this process more literate, s.t. you could do:

rawInput.cache().length();
Sample.sample(rawInput, 0.5).cache().length();

to make the workaround a little bit cleaner.

> Materialize several PObject & PCollection objects in parallel (deferred materialization)
> ----------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-320
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-320
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jason Gauci
>            Assignee: Josh Wills
>         Attachments: CRUNCH-320.patch
>
>
> Currently, Crunch blocks and materializes PCollections (through foo.materialize()) and PObjects (through foo.getValue()) on demand, but it would be a significant performance improvement if we could mark several of these objects as to be materialized, and then materialize all of them in parallel as part of a pipeline.run() call.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)