You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Venkatesh Seetharam (JIRA)" <ji...@apache.org> on 2014/08/22 21:32:11 UTC
[jira] [Comment Edited] (FALCON-630) late data rerun for process
broken in trunk
[ https://issues.apache.org/jira/browse/FALCON-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107260#comment-14107260 ]
Venkatesh Seetharam edited comment on FALCON-630 at 8/22/14 7:31 PM:
---------------------------------------------------------------------
I have a few questions:
* Why do you need this? How is this different from feedNames? This same property can be overloaded with input names in the process and feed names in replication, no?
{noformat}
+ INPUT_NAMES("falconInputs", "name of the inputs", false),
+ INPUT_STORAGE_TYPES("falconInputFeedStorageTypes", "input storage types", false),
{noformat}
The code already is doing this:
{noformat}
org.apache.falcon.oozie.process.ProcessExecutionCoordinatorBuilder#propagateLateDataProperties
{noformat}
* +1 for this.
{noformat}
// what outputs
- FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
- FEED_INSTANCE_PATHS("feedInstancePaths", "comma separated feed instance paths"),
+ OUTPUT_FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
+ OUTPUT_FEED_PATHS("feedInstancePaths", "comma separated feed instance paths"),
{noformat}
I'd like to reduce the payload and make it easier to evolve but this is not solved in FALCON-327. Also, in replication, we will have a cluster pair but will only have one cluster for all other lifecycles. How do we make this seamless instead of having to add this at 100 places?
was (Author: svenkat):
I have a few questions:
* Why do you need this? How is this different from feedNames?
{code}
+ INPUT_NAMES("falconInputs", "name of the inputs", false),
+ INPUT_STORAGE_TYPES("falconInputFeedStorageTypes", "input storage types", false),
{code}
* +1 for this.
{code}
// what outputs
- FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
- FEED_INSTANCE_PATHS("feedInstancePaths", "comma separated feed instance paths"),
+ OUTPUT_FEED_NAMES("feedNames", "name of the feeds which are generated/replicated/deleted"),
+ OUTPUT_FEED_PATHS("feedInstancePaths", "comma separated feed instance paths"),
{code}
I'd like to reduce the payload and make it easier to evolve but this is not solved in FALCON-327. Also, in replication, we will have a cluster pair but will only have one cluster for all other lifecycles. How do we make this seamless instead of having to add this at 100 places?
> late data rerun for process broken in trunk
> --------------------------------------------
>
> Key: FALCON-630
> URL: https://issues.apache.org/jira/browse/FALCON-630
> Project: Falcon
> Issue Type: Bug
> Components: rerun
> Affects Versions: 0.5
> Reporter: Samarth Gupta
> Assignee: Shwetha G S
> Priority: Blocker
> Fix For: 0.4
>
> Attachments: FALCON-630.patch
>
>
> late data rerun for process is not working . it seems like in pre processing record size is storing data by Feed name and not by input name , due to which late data is never detected.
> {code}
> -falconInputFeeds
> FETL2-RRLog#FETL-RTBS-PRLog#FETL-RTBS-NPRLog
> {code}
> above even though param in tasktracker logs says InputFeeds , they are actually feed name.
--
This message was sent by Atlassian JIRA
(v6.2#6252)