You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by "Patel,Stephen" <St...@Cerner.com> on 2015/05/21 19:48:11 UTC

ExtractKeyFn scaleFactor seems to be incorrect

I was looking at the PCollectionImpl.by method[0] today, and I think that the ExtractKeyFn[1] it's using may not be calculating scaleFactor correctly.  The ExtractKeyFn is using the default scaleFactor for a MapFn (1.0), but shouldn't it have a scaleFactor of 1 + the input MapFn's scaleFactor?

As an example, if you had a Pcollection<T> and you call by with the IdentifyFn, the returned table should have a size of 2 * the original collections size, but as it stands now, it will have the same size as the original.

Assuming we later group a table that we constructed with by, won't we use (potentially) far fewer reducers than we actually should be?

[0]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270
[1]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: ExtractKeyFn scaleFactor seems to be incorrect

Posted by "Patel,Stephen" <St...@Cerner.com>.
Thanks!

https://issues.apache.org/jira/browse/CRUNCH-525

From: Josh Wills <jo...@gmail.com>>
Reply-To: "user@crunch.apache.org<ma...@crunch.apache.org>" <us...@crunch.apache.org>>
Date: Thursday, May 21, 2015 2:42 PM
To: "user@crunch.apache.org<ma...@crunch.apache.org>" <us...@crunch.apache.org>>
Subject: Re: ExtractKeyFn scaleFactor seems to be incorrect

Yes, you're right-- file a JIRA for it?

J

On Thu, May 21, 2015 at 10:48 AM, Patel,Stephen <St...@cerner.com>> wrote:
I was looking at the PCollectionImpl.by method[0] today, and I think that the ExtractKeyFn[1] it's using may not be calculating scaleFactor correctly.  The ExtractKeyFn is using the default scaleFactor for a MapFn (1.0), but shouldn't it have a scaleFactor of 1 + the input MapFn's scaleFactor?

As an example, if you had a Pcollection<T> and you call by with the IdentifyFn, the returned table should have a size of 2 * the original collections size, but as it stands now, it will have the same size as the original.

Assuming we later group a table that we constructed with by, won't we use (potentially) far fewer reducers than we actually should be?

[0]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_crunch_blob_master_crunch-2Dcore_src_main_java_org_apache_crunch_impl_dist_collect_PCollectionImpl.java-23L270&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=heMtsNUnSvUNkLr3-_z5d3hBAgWghswOidN-QwCMKOk&m=MIr3Gvh_tC1PfkWs0o2CVQxCkz7BzF4hXL0ULzTs644&s=M4RAGjb1kRqQWHLyOhCfA5gP7u5aMsr0G26_b364Dnw&e=>
[1]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_crunch_blob_master_crunch-2Dcore_src_main_java_org_apache_crunch_fn_ExtractKeyFn.java&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=heMtsNUnSvUNkLr3-_z5d3hBAgWghswOidN-QwCMKOk&m=MIr3Gvh_tC1PfkWs0o2CVQxCkz7BzF4hXL0ULzTs644&s=xAZsOsj7-jZVJYkT2IukxGB8L0Jaf04o04UPEOWf-wQ&e=>
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.


Re: ExtractKeyFn scaleFactor seems to be incorrect

Posted by Josh Wills <jo...@gmail.com>.
Yes, you're right-- file a JIRA for it?

J

On Thu, May 21, 2015 at 10:48 AM, Patel,Stephen <St...@cerner.com>
wrote:

>   I was looking at the PCollectionImpl.by method[0] today, and I think
> that the ExtractKeyFn[1] it's using may not be calculating scaleFactor
> correctly.  The ExtractKeyFn is using the default scaleFactor for a MapFn
> (1.0), but shouldn't it have a scaleFactor of 1 + the input MapFn's
> scaleFactor?
>
>  As an example, if you had a Pcollection<T> and you call by with the
> IdentifyFn, the returned table should have a size of 2 * the original
> collections size, but as it stands now, it will have the same size as the
> original.
>
>  Assuming we later group a table that we constructed with by, won't we
> use (potentially) far fewer reducers than we actually should be?
>
>  [0]:
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270
> [1]:
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java
>   CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>