You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Venkatesh Seetharam (JIRA)" <ji...@apache.org> on 2014/08/22 21:32:11 UTC

[jira] [Comment Edited] (FALCON-630) late data rerun for process broken in trunk

    [ https://issues.apache.org/jira/browse/FALCON-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107260#comment-14107260 ] 

Venkatesh Seetharam edited comment on FALCON-630 at 8/22/14 7:31 PM:
---------------------------------------------------------------------

I have a few questions:

* Why do you need this? How is this different from feedNames? This same property can be overloaded with input names in the process and feed names in replication, no?

{noformat}
+    INPUT_NAMES("falconInputs", "name of the inputs", false),
+    INPUT_STORAGE_TYPES("falconInputFeedStorageTypes", "input storage types", false),
 {noformat}

The code already is doing this:
{noformat}
org.apache.falcon.oozie.process.ProcessExecutionCoordinatorBuilder#propagateLateDataProperties
{noformat}

* +1 for this.

{noformat}
     // what outputs
-    FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
-    FEED_INSTANCE_PATHS("feedInstancePaths", "comma separated feed instance paths"),
+    OUTPUT_FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
+    OUTPUT_FEED_PATHS("feedInstancePaths", "comma separated feed instance paths"),
 {noformat}

I'd like to reduce the payload and make it easier to evolve but this is not solved in FALCON-327. Also, in replication, we will have a cluster pair but will only have one cluster for all other lifecycles. How do we make this seamless instead of having to add this at 100 places?


was (Author: svenkat):
I have a few questions:

* Why do you need this? How is this different from feedNames?
{code}
+    INPUT_NAMES("falconInputs", "name of the inputs", false),
+    INPUT_STORAGE_TYPES("falconInputFeedStorageTypes", "input storage types", false),
 {code}

* +1 for this.
 {code}
     // what outputs
-    FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
-    FEED_INSTANCE_PATHS("feedInstancePaths", "comma separated feed instance paths"),
+    OUTPUT_FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
+    OUTPUT_FEED_PATHS("feedInstancePaths", "comma separated feed instance paths"),
 {code}

I'd like to reduce the payload and make it easier to evolve but this is not solved in FALCON-327. Also, in replication, we will have a cluster pair but will only have one cluster for all other lifecycles. How do we make this seamless instead of having to add this at 100 places?

> late data rerun for process broken in trunk 
> --------------------------------------------
>
>                 Key: FALCON-630
>                 URL: https://issues.apache.org/jira/browse/FALCON-630
>             Project: Falcon
>          Issue Type: Bug
>          Components: rerun
>    Affects Versions: 0.5
>            Reporter: Samarth Gupta
>            Assignee: Shwetha G S
>            Priority: Blocker
>             Fix For: 0.4
>
>         Attachments: FALCON-630.patch
>
>
> late data rerun for process is not working . it seems like in pre processing record size is storing data by Feed name and not by input name , due to which late data is never detected. 
> {code}
>                     -falconInputFeeds
>                     FETL2-RRLog#FETL-RTBS-PRLog#FETL-RTBS-NPRLog
> {code}
> above even though param in tasktracker logs says InputFeeds , they are actually feed name. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)