You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by John Yu <jo...@gmail.com> on 2014/07/24 02:16:34 UTC

Partitions in Feed definition

Hey all,

Few questions about Partitions:

Partitions in the FEED xml like below:

    <partitions>
        <partition name="colo"/>
        <partition name="country"/>
    </partitions>


   1. I see these are partition keys; are the partition keys values
(say country=us or country=uk) need to be defined before-hand or
unbounded?
   2. does the storage location need to have the partition key in
them? Like below (see the colo and country partition keys)

   <location path="/data/${colo}/${country}/${YEAR}/${MONTH}/${DAY}"
type="data"/>

   3.

   if the partition keys are not in the FileSystem path, how does
Falcon identify a feed partition physical location (actually,
how/where is it used)? I understand if it were HCAT, the Feed
definition has the partition key-values.

   4.

   Are these partition keys and values validated against the
FileSystem or HCAT locations?



Partition attribute in the Cluster reference:

Using the example from the documentation page
<http://falcon.incubator.apache.org/docs/FalconArchitecture.html#Replication>


   1. What does it mean to specify partitions in a source cluster ?
   2. vs target cluster? (does it act like a filter to pull only a
subset of data from source? -- if so how does Falcon know to read the
subset in Filesystem feed?)
   3. What data is in sourceCluster1, sourceCluster2 and what location?
   4. Which path does the replicated data end up in the backupCluster (target)?


A few questions.  Hopefully it's something straightforward about
partitions that I have missed.


Thanks for your answers,John

Re: Partitions in Feed definition

Posted by John Yu <jo...@gmail.com>.
Hey Srikanth,

Thanks a lot for the reply!
This clarifies many of our understanding regarding partitions.  With this
in mind we will try to come up with a proposal to tackle
https://issues.apache.org/jira/browse/FALCON-511.

Thanks,
John


2014-07-23 20:36 GMT-07:00 Srikanth Sundarrajan <sr...@hotmail.com>:

> > are the partition keys values (say country=us or country=uk) need to be
> defined before-hand or unbounded?Yes the partition values themselves are
> unbounded.
> >  does the storage location need to have the partition key in themIn most
> cases there are time partitions, besides the time partition, there can be
> other partition, which are declared in the partition section. So the
> partitions ought to be in the path as a variable. It can be skipped if no
> consumer has interest in filtering and selecting a section of the data
> through the dataIn(input, partitionSpec) function.
> > if the partition keys are not in the FileSystem path, how does Falcon
> identify a feed partition physical location
> If partition keys aren't specified, then Falcon can't use it either in the
> file system version of the input. Partitions are only used in two scenarios
> by Falcon. 1) When data is partitioned in multiple clusters, they can be
> merged into a single location using replication (single target, multiple
> source). For this to work, each source should own a partition exclusively.
> 2) Data can be selectively consumed by filtering specific partition through
> the dataIn() EL expression
> RegardsSrikanth Sundarrajan
>
> > From: johnyu0520@gmail.com
> > Date: Wed, 23 Jul 2014 17:16:34 -0700
> > Subject: Partitions in Feed definition
> > To: dev@falcon.incubator.apache.org
> >
> > Hey all,
> >
> > Few questions about Partitions:
> >
> > Partitions in the FEED xml like below:
> >
> >     <partitions>
> >         <partition name="colo"/>
> >         <partition name="country"/>
> >     </partitions>
> >
> >
> >    1. I see these are partition keys; are the partition keys values
> > (say country=us or country=uk) need to be defined before-hand or
> > unbounded?
> >    2. does the storage location need to have the partition key in
> > them? Like below (see the colo and country partition keys)
> >
> >    <location path="/data/${colo}/${country}/${YEAR}/${MONTH}/${DAY}"
> > type="data"/>
> >
> >    3.
> >
> >    if the partition keys are not in the FileSystem path, how does
> > Falcon identify a feed partition physical location (actually,
> > how/where is it used)? I understand if it were HCAT, the Feed
> > definition has the partition key-values.
> >
> >    4.
> >
> >    Are these partition keys and values validated against the
> > FileSystem or HCAT locations?
> >
> >
> >
> > Partition attribute in the Cluster reference:
> >
> > Using the example from the documentation page
> > <
> http://falcon.incubator.apache.org/docs/FalconArchitecture.html#Replication
> >
> >
> >
> >    1. What does it mean to specify partitions in a source cluster ?
> >    2. vs target cluster? (does it act like a filter to pull only a
> > subset of data from source? -- if so how does Falcon know to read the
> > subset in Filesystem feed?)
> >    3. What data is in sourceCluster1, sourceCluster2 and what location?
> >    4. Which path does the replicated data end up in the backupCluster
> (target)?
> >
> >
> > A few questions.  Hopefully it's something straightforward about
> > partitions that I have missed.
> >
> >
> > Thanks for your answers,John
>
>



-- 
余守中  John Yu (Yu, Shoou-Jong)
Mobile: 650-691-3314

RE: Partitions in Feed definition

Posted by Srikanth Sundarrajan <sr...@hotmail.com>.
> are the partition keys values (say country=us or country=uk) need to be defined before-hand or unbounded?Yes the partition values themselves are unbounded.
>  does the storage location need to have the partition key in themIn most cases there are time partitions, besides the time partition, there can be other partition, which are declared in the partition section. So the partitions ought to be in the path as a variable. It can be skipped if no consumer has interest in filtering and selecting a section of the data through the dataIn(input, partitionSpec) function.
> if the partition keys are not in the FileSystem path, how does Falcon identify a feed partition physical location
If partition keys aren't specified, then Falcon can't use it either in the file system version of the input. Partitions are only used in two scenarios by Falcon. 1) When data is partitioned in multiple clusters, they can be merged into a single location using replication (single target, multiple source). For this to work, each source should own a partition exclusively. 2) Data can be selectively consumed by filtering specific partition through the dataIn() EL expression
RegardsSrikanth Sundarrajan

> From: johnyu0520@gmail.com
> Date: Wed, 23 Jul 2014 17:16:34 -0700
> Subject: Partitions in Feed definition
> To: dev@falcon.incubator.apache.org
> 
> Hey all,
> 
> Few questions about Partitions:
> 
> Partitions in the FEED xml like below:
> 
>     <partitions>
>         <partition name="colo"/>
>         <partition name="country"/>
>     </partitions>
> 
> 
>    1. I see these are partition keys; are the partition keys values
> (say country=us or country=uk) need to be defined before-hand or
> unbounded?
>    2. does the storage location need to have the partition key in
> them? Like below (see the colo and country partition keys)
> 
>    <location path="/data/${colo}/${country}/${YEAR}/${MONTH}/${DAY}"
> type="data"/>
> 
>    3.
> 
>    if the partition keys are not in the FileSystem path, how does
> Falcon identify a feed partition physical location (actually,
> how/where is it used)? I understand if it were HCAT, the Feed
> definition has the partition key-values.
> 
>    4.
> 
>    Are these partition keys and values validated against the
> FileSystem or HCAT locations?
> 
> 
> 
> Partition attribute in the Cluster reference:
> 
> Using the example from the documentation page
> <http://falcon.incubator.apache.org/docs/FalconArchitecture.html#Replication>
> 
> 
>    1. What does it mean to specify partitions in a source cluster ?
>    2. vs target cluster? (does it act like a filter to pull only a
> subset of data from source? -- if so how does Falcon know to read the
> subset in Filesystem feed?)
>    3. What data is in sourceCluster1, sourceCluster2 and what location?
>    4. Which path does the replicated data end up in the backupCluster (target)?
> 
> 
> A few questions.  Hopefully it's something straightforward about
> partitions that I have missed.
> 
> 
> Thanks for your answers,John