You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "Matthew Hayes (JIRA)" <ji...@apache.org> on 2018/07/09 22:22:00 UTC

[jira] [Commented] (DATAFU-127) New macro - samply by keys

    [ https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537696#comment-16537696 ] 

Matthew Hayes commented on DATAFU-127:
--------------------------------------

Okay the name seems reasonable.  I suggest dropping the {{sample_by_keys_with_date}} macro however since it isn't providing much value over the other value.  The user could add the filter if needed and there isn't anything about this filter that makes it specific to dates.  What are your thoughts on this?  I went ahead and removed this and merged the remaining code for the sake of getting the bulk of this checked in.

By the way, I checked in a fix for {{test.single}} so the following works again:
{code:java}
./gradlew :datafu-pig:test  -Dtest.single=SamplingTests{code}
 

> New macro - samply by keys
> --------------------------
>
>                 Key: DATAFU-127
>                 URL: https://issues.apache.org/jira/browse/DATAFU-127
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-127.patch
>
>
> Two macros that return a sample of a larger table based on a list of keys, with the schema of the larger table. One of the macros filters by dates, the other doesn't.
> If there are multiple rows with a key that appears in the key list, all of them will be returned (no deduplication is done). The results are returned ordered by the key field in a single file.
> The implementation uses a replicated join for efficiency, but this means the key list shouldn't be too large as to not fit in memory.
> The first macro's definition looks as follows:
> DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) returns out {
> - table_name 				- table name to sample
> - sample_set 				- a set of keys
> - join_key_table 			- join column name in the table
> - join_key_sample 			- join column name in the sample



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)