You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Huiting Li <hu...@autodesk.com> on 2013/12/12 11:12:29 UTC

Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Hi,

In oozie coordinator, we can Using ${coord:current(int n)} to create a data-pipeline using a coordinator application. It's said that "${coord:current(int n)} represents the nth dataset instance for a synchronous dataset, relative to the coordinator action creation (materialization) time. The coordinator action creation (materialization) time is computed based on the coordinator job start time and its frequency. The nth dataset instance is computed based on the dataset's initial-instance datetime, its frequency and the (current) coordinator action creation (materialization) time."
However, our case is: coordinator starts at for example 2013-12-12-02, step 1 outputs multiple partitioned data, like partitions /data/dth=2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We want to process all these newly generated partitions in step 2. That means, step 2 take the output of step 1 as its input, and will process data in the new partitions one by one. So if we define a dataset like below in step 2, how could we define the input events (in </data-in>) and pass parameters(in configuration property) to step2?
          <uri-template>
                 hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
          </uri-template>

Does oozie support such kind of pipeline?

Thanks,
Huiting

Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Posted by Mohammad Islam <mi...@yahoo.com>.

Huiting,
Few question to clarify:
1. What is your step 2? Is it separate coordinator or it is a workflow action node?
2. Is the output of step 1 followed any fixed pattern such as : always creates 3 output dir or something like that.  Please give example.

Regards,
Mohammad



On Wednesday, December 18, 2013 10:31 PM, Huiting Li <hu...@autodesk.com> wrote:
 
I think that couldn't achieve exactly what we want, as we also need the workflow to detect and process the dynamically generated partitions in iteration. We may need to implement this logic in other ways, instead of using oozie directly.

Anyway, thanks all the same, Rohini!

Thanks,
Huiting


-----Original Message-----
From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com] 
Sent: Thursday, December 19, 2013 1:37 AM
To: user@oozie.apache.org
Subject: Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

You can specify more than one instance in data-out. But if the instances produced are random, then the only think I can think of is passing the partitions created from one action to the next in the workflow through action output. You can write any data in a java action and pass it on to the next action. Or you can write them to a file in hdfs and let the other action pick it up.

https://cwiki.apache.org/confluence/display/OOZIE/Java+Cookbook - Check out capture-output

Regards,
Rohini



On Tue, Dec 17, 2013 at 6:12 PM, Huiting Li <hu...@autodesk.com> wrote:

> It's said that coord:dataOut() resolves to all the URIs for the 
> dataset instance specified in an output event dataset section. From my 
> understanding, the output event is a kind of pre-determined value, as 
> usually coord:current(0) is used in output event. Taking the oozie doc 
> below for example, for the first run, coord:dataOut('outputLogs') will 
> resolve to " hdfs://bar:8020/app/logs/2009/01/02", instead of the 
> actual output of the last step, which may be a few random partitions, right?
>
> So how to specify the output event in my case? Thanks a lot!
>
>
> ====oozie example=====
> <coordinator-app name="app-coord" frequency="${coord:days(1)}"
>                     start="2009-01-01T24:00Z" end="2009-12-31T24:00Z"
> timezone="UTC"
>                     xmlns="uri:oozie:coordinator:0.1">
>       <datasets>
>         <dataset name="dailyLogs" frequency="${coord:days(1)}"
> initial-instance="2009-01-01T24:00Z" timezone="UTC">
>
> <uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
>          </dataset>
>       </datasets>
>       <input-events>... </input-events>
>       <output-events>
>         <data-out name="outputLogs" dataset="dailyLogs">
>           <instance>${coord:current(0)}</instance>
>         </data-out>
>       </output-events>
> <action>.....
>        <property>
>               <name>wfOutput</name>
>               <value>${coord:dataOut('outputLogs')}</value>
>        </property>
> </action>
> </coordinator-app>
>
> Thanks,
> Huiting
>
> -----Original Message-----
> From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com]
> Sent: 2013年12月18日 9:09
> To: user@oozie.apache.org
> Subject: Re: Data Pipeline - Does oozie support the newly created 
> partitions from step 1 as the input events and parameters for step 2?
>
> The newly generated partitions should be part of data-out. You can 
> pass the partitions using coord:dataOut() EL function
>
> Regards,
> Rohini
>
>
>
> On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <hu...@autodesk.com>
> wrote:
>
> > Hi,
> >
> > In oozie coordinator, we can Using ${coord:current(int n)} to create 
> > a data-pipeline using a coordinator application. It's said that 
> > "${coord:current(int n)} represents the nth dataset instance for a 
> > synchronous dataset, relative to the coordinator action creation
> > (materialization) time. The coordinator action creation
> > (materialization) time is computed based on the coordinator job 
> > start
> time and its frequency.
> > The nth dataset instance is computed based on the dataset's 
> > initial-instance datetime, its frequency and the (current) 
> > coordinator action creation (materialization) time."
> > However, our case is: coordinator starts at for example 
> > 2013-12-12-02, step 1 outputs multiple partitioned data, like 
> > partitions /data/dth= 2013-12-11-22, /data/dth=2013-12-11-23, 
> > /data/dth=2013-12-12-02. We want to process all these newly 
> > generated partitions in step 2. That means, step
> > 2 take the output of step 1 as its input, and will process data in 
> > the new partitions one by one. So if we define a dataset like below 
> > in step 2, how could we define the input events (in </data-in>) and 
> > pass parameters(in configuration property) to step2?
> >           <uri-template>
> >                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
> >           </uri-template>
> >
> > Does oozie support such kind of pipeline?
> >
> > Thanks,
> > Huiting
> >
>

RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Posted by Huiting Li <hu...@autodesk.com>.

I think that couldn't achieve exactly what we want, as we also need the workflow to detect and process the dynamically generated partitions in iteration. We may need to implement this logic in other ways, instead of using oozie directly.

Anyway, thanks all the same, Rohini!

Thanks,
Huiting

-----Original Message-----
From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com] 
Sent: Thursday, December 19, 2013 1:37 AM
To: user@oozie.apache.org
Subject: Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

You can specify more than one instance in data-out. But if the instances produced are random, then the only think I can think of is passing the partitions created from one action to the next in the workflow through action output. You can write any data in a java action and pass it on to the next action. Or you can write them to a file in hdfs and let the other action pick it up.

https://cwiki.apache.org/confluence/display/OOZIE/Java+Cookbook - Check out capture-output

Regards,
Rohini



On Tue, Dec 17, 2013 at 6:12 PM, Huiting Li <hu...@autodesk.com> wrote:

> It's said that coord:dataOut() resolves to all the URIs for the 
> dataset instance specified in an output event dataset section. From my 
> understanding, the output event is a kind of pre-determined value, as 
> usually coord:current(0) is used in output event. Taking the oozie doc 
> below for example, for the first run, coord:dataOut('outputLogs') will 
> resolve to " hdfs://bar:8020/app/logs/2009/01/02", instead of the 
> actual output of the last step, which may be a few random partitions, right?
>
> So how to specify the output event in my case? Thanks a lot!
>
>
> ====oozie example=====
> <coordinator-app name="app-coord" frequency="${coord:days(1)}"
>                     start="2009-01-01T24:00Z" end="2009-12-31T24:00Z"
> timezone="UTC"
>                     xmlns="uri:oozie:coordinator:0.1">
>       <datasets>
>         <dataset name="dailyLogs" frequency="${coord:days(1)}"
> initial-instance="2009-01-01T24:00Z" timezone="UTC">
>
> <uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
>          </dataset>
>       </datasets>
>       <input-events>... </input-events>
>       <output-events>
>         <data-out name="outputLogs" dataset="dailyLogs">
>           <instance>${coord:current(0)}</instance>
>         </data-out>
>       </output-events>
> <action>.....
>        <property>
>               <name>wfOutput</name>
>               <value>${coord:dataOut('outputLogs')}</value>
>        </property>
> </action>
> </coordinator-app>
>
> Thanks,
> Huiting
>
> -----Original Message-----
> From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com]
> Sent: 2013年12月18日 9:09
> To: user@oozie.apache.org
> Subject: Re: Data Pipeline - Does oozie support the newly created 
> partitions from step 1 as the input events and parameters for step 2?
>
> The newly generated partitions should be part of data-out. You can 
> pass the partitions using coord:dataOut() EL function
>
> Regards,
> Rohini
>
>
>
> On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <hu...@autodesk.com>
> wrote:
>
> > Hi,
> >
> > In oozie coordinator, we can Using ${coord:current(int n)} to create 
> > a data-pipeline using a coordinator application. It's said that 
> > "${coord:current(int n)} represents the nth dataset instance for a 
> > synchronous dataset, relative to the coordinator action creation
> > (materialization) time. The coordinator action creation
> > (materialization) time is computed based on the coordinator job 
> > start
> time and its frequency.
> > The nth dataset instance is computed based on the dataset's 
> > initial-instance datetime, its frequency and the (current) 
> > coordinator action creation (materialization) time."
> > However, our case is: coordinator starts at for example 
> > 2013-12-12-02, step 1 outputs multiple partitioned data, like 
> > partitions /data/dth= 2013-12-11-22, /data/dth=2013-12-11-23, 
> > /data/dth=2013-12-12-02. We want to process all these newly 
> > generated partitions in step 2. That means, step
> > 2 take the output of step 1 as its input, and will process data in 
> > the new partitions one by one. So if we define a dataset like below 
> > in step 2, how could we define the input events (in </data-in>) and 
> > pass parameters(in configuration property) to step2?
> >           <uri-template>
> >                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
> >           </uri-template>
> >
> > Does oozie support such kind of pipeline?
> >
> > Thanks,
> > Huiting
> >
>

Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Posted by Rohini Palaniswamy <ro...@gmail.com>.

You can specify more than one instance in data-out. But if the instances
produced are random, then the only think I can think of is passing the
partitions created from one action to the next in the workflow through
action output. You can write any data in a java action and pass it on to
the next action. Or you can write them to a file in hdfs and let the other
action pick it up.

https://cwiki.apache.org/confluence/display/OOZIE/Java+Cookbook - Check out
capture-output

Regards,
Rohini



On Tue, Dec 17, 2013 at 6:12 PM, Huiting Li <hu...@autodesk.com> wrote:

> It's said that coord:dataOut() resolves to all the URIs for the dataset
> instance specified in an output event dataset section. From my
> understanding, the output event is a kind of pre-determined value, as
> usually coord:current(0) is used in output event. Taking the oozie doc
> below for example, for the first run, coord:dataOut('outputLogs') will
> resolve to " hdfs://bar:8020/app/logs/2009/01/02", instead of the actual
> output of the last step, which may be a few random partitions, right?
>
> So how to specify the output event in my case? Thanks a lot!
>
>
> ====oozie example=====
> <coordinator-app name="app-coord" frequency="${coord:days(1)}"
>                     start="2009-01-01T24:00Z" end="2009-12-31T24:00Z"
> timezone="UTC"
>                     xmlns="uri:oozie:coordinator:0.1">
>       <datasets>
>         <dataset name="dailyLogs" frequency="${coord:days(1)}"
> initial-instance="2009-01-01T24:00Z" timezone="UTC">
>
> <uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
>          </dataset>
>       </datasets>
>       <input-events>... </input-events>
>       <output-events>
>         <data-out name="outputLogs" dataset="dailyLogs">
>           <instance>${coord:current(0)}</instance>
>         </data-out>
>       </output-events>
> <action>.....
>        <property>
>               <name>wfOutput</name>
>               <value>${coord:dataOut('outputLogs')}</value>
>        </property>
> </action>
> </coordinator-app>
>
> Thanks,
> Huiting
>
> -----Original Message-----
> From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com]
> Sent: 2013年12月18日 9:09
> To: user@oozie.apache.org
> Subject: Re: Data Pipeline - Does oozie support the newly created
> partitions from step 1 as the input events and parameters for step 2?
>
> The newly generated partitions should be part of data-out. You can pass
> the partitions using coord:dataOut() EL function
>
> Regards,
> Rohini
>
>
>
> On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <hu...@autodesk.com>
> wrote:
>
> > Hi,
> >
> > In oozie coordinator, we can Using ${coord:current(int n)} to create a
> > data-pipeline using a coordinator application. It's said that
> > "${coord:current(int n)} represents the nth dataset instance for a
> > synchronous dataset, relative to the coordinator action creation
> > (materialization) time. The coordinator action creation
> > (materialization) time is computed based on the coordinator job start
> time and its frequency.
> > The nth dataset instance is computed based on the dataset's
> > initial-instance datetime, its frequency and the (current) coordinator
> > action creation (materialization) time."
> > However, our case is: coordinator starts at for example 2013-12-12-02,
> > step 1 outputs multiple partitioned data, like partitions /data/dth=
> > 2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We
> > want to process all these newly generated partitions in step 2. That
> > means, step
> > 2 take the output of step 1 as its input, and will process data in the
> > new partitions one by one. So if we define a dataset like below in
> > step 2, how could we define the input events (in </data-in>) and pass
> > parameters(in configuration property) to step2?
> >           <uri-template>
> >                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
> >           </uri-template>
> >
> > Does oozie support such kind of pipeline?
> >
> > Thanks,
> > Huiting
> >
>

RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Posted by Huiting Li <hu...@autodesk.com>.

It's said that coord:dataOut() resolves to all the URIs for the dataset instance specified in an output event dataset section. From my understanding, the output event is a kind of pre-determined value, as usually coord:current(0) is used in output event. Taking the oozie doc below for example, for the first run, coord:dataOut('outputLogs') will resolve to " hdfs://bar:8020/app/logs/2009/01/02", instead of the actual output of the last step, which may be a few random partitions, right?

So how to specify the output event in my case? Thanks a lot!


====oozie example=====
<coordinator-app name="app-coord" frequency="${coord:days(1)}"
                    start="2009-01-01T24:00Z" end="2009-12-31T24:00Z" timezone="UTC"
                    xmlns="uri:oozie:coordinator:0.1">
      <datasets>
	<dataset name="dailyLogs" frequency="${coord:days(1)}" initial-instance="2009-01-01T24:00Z" timezone="UTC">
   	 	<uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
 	 </dataset>
      </datasets>
      <input-events>... </input-events>
      <output-events>
        <data-out name="outputLogs" dataset="dailyLogs">
          <instance>${coord:current(0)}</instance>
        </data-out>
      </output-events>
<action>.....
       <property>
              <name>wfOutput</name>
              <value>${coord:dataOut('outputLogs')}</value>
       </property> 
</action>
</coordinator-app>

Thanks,
Huiting

-----Original Message-----
From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com] 
Sent: 2013年12月18日 9:09
To: user@oozie.apache.org
Subject: Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

The newly generated partitions should be part of data-out. You can pass the partitions using coord:dataOut() EL function

Regards,
Rohini



On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <hu...@autodesk.com> wrote:

> Hi,
>
> In oozie coordinator, we can Using ${coord:current(int n)} to create a 
> data-pipeline using a coordinator application. It's said that 
> "${coord:current(int n)} represents the nth dataset instance for a 
> synchronous dataset, relative to the coordinator action creation
> (materialization) time. The coordinator action creation 
> (materialization) time is computed based on the coordinator job start time and its frequency.
> The nth dataset instance is computed based on the dataset's 
> initial-instance datetime, its frequency and the (current) coordinator 
> action creation (materialization) time."
> However, our case is: coordinator starts at for example 2013-12-12-02, 
> step 1 outputs multiple partitioned data, like partitions /data/dth= 
> 2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We 
> want to process all these newly generated partitions in step 2. That 
> means, step
> 2 take the output of step 1 as its input, and will process data in the 
> new partitions one by one. So if we define a dataset like below in 
> step 2, how could we define the input events (in </data-in>) and pass 
> parameters(in configuration property) to step2?
>           <uri-template>
>                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
>           </uri-template>
>
> Does oozie support such kind of pipeline?
>
> Thanks,
> Huiting
>

Re: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Posted by Rohini Palaniswamy <ro...@gmail.com>.

The newly generated partitions should be part of data-out. You can pass the
partitions using coord:dataOut() EL function

Regards,
Rohini



On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <hu...@autodesk.com> wrote:

> Hi,
>
> In oozie coordinator, we can Using ${coord:current(int n)} to create a
> data-pipeline using a coordinator application. It's said that
> "${coord:current(int n)} represents the nth dataset instance for a
> synchronous dataset, relative to the coordinator action creation
> (materialization) time. The coordinator action creation (materialization)
> time is computed based on the coordinator job start time and its frequency.
> The nth dataset instance is computed based on the dataset's
> initial-instance datetime, its frequency and the (current) coordinator
> action creation (materialization) time."
> However, our case is: coordinator starts at for example 2013-12-12-02,
> step 1 outputs multiple partitioned data, like partitions /data/dth=
> 2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We want
> to process all these newly generated partitions in step 2. That means, step
> 2 take the output of step 1 as its input, and will process data in the new
> partitions one by one. So if we define a dataset like below in step 2, how
> could we define the input events (in </data-in>) and pass parameters(in
> configuration property) to step2?
>           <uri-template>
>                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
>           </uri-template>
>
> Does oozie support such kind of pipeline?
>
> Thanks,
> Huiting
>