You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by John Yu <jo...@gmail.com> on 2014/08/06 02:39:01 UTC

Re: Partitions in Feed definition

Hey Srikanth,

Thanks a lot for the reply!
This clarifies many of our understanding regarding partitions.  With this
in mind we will try to come up with a proposal to tackle
https://issues.apache.org/jira/browse/FALCON-511.

Thanks,
John


2014-07-23 20:36 GMT-07:00 Srikanth Sundarrajan <sr...@hotmail.com>:

> > are the partition keys values (say country=us or country=uk) need to be
> defined before-hand or unbounded?Yes the partition values themselves are
> unbounded.
> >  does the storage location need to have the partition key in themIn most
> cases there are time partitions, besides the time partition, there can be
> other partition, which are declared in the partition section. So the
> partitions ought to be in the path as a variable. It can be skipped if no
> consumer has interest in filtering and selecting a section of the data
> through the dataIn(input, partitionSpec) function.
> > if the partition keys are not in the FileSystem path, how does Falcon
> identify a feed partition physical location
> If partition keys aren't specified, then Falcon can't use it either in the
> file system version of the input. Partitions are only used in two scenarios
> by Falcon. 1) When data is partitioned in multiple clusters, they can be
> merged into a single location using replication (single target, multiple
> source). For this to work, each source should own a partition exclusively.
> 2) Data can be selectively consumed by filtering specific partition through
> the dataIn() EL expression
> RegardsSrikanth Sundarrajan
>
> > From: johnyu0520@gmail.com
> > Date: Wed, 23 Jul 2014 17:16:34 -0700
> > Subject: Partitions in Feed definition
> > To: dev@falcon.incubator.apache.org
> >
> > Hey all,
> >
> > Few questions about Partitions:
> >
> > Partitions in the FEED xml like below:
> >
> >     <partitions>
> >         <partition name="colo"/>
> >         <partition name="country"/>
> >     </partitions>
> >
> >
> >    1. I see these are partition keys; are the partition keys values
> > (say country=us or country=uk) need to be defined before-hand or
> > unbounded?
> >    2. does the storage location need to have the partition key in
> > them? Like below (see the colo and country partition keys)
> >
> >    <location path="/data/${colo}/${country}/${YEAR}/${MONTH}/${DAY}"
> > type="data"/>
> >
> >    3.
> >
> >    if the partition keys are not in the FileSystem path, how does
> > Falcon identify a feed partition physical location (actually,
> > how/where is it used)? I understand if it were HCAT, the Feed
> > definition has the partition key-values.
> >
> >    4.
> >
> >    Are these partition keys and values validated against the
> > FileSystem or HCAT locations?
> >
> >
> >
> > Partition attribute in the Cluster reference:
> >
> > Using the example from the documentation page
> > <
> http://falcon.incubator.apache.org/docs/FalconArchitecture.html#Replication
> >
> >
> >
> >    1. What does it mean to specify partitions in a source cluster ?
> >    2. vs target cluster? (does it act like a filter to pull only a
> > subset of data from source? -- if so how does Falcon know to read the
> > subset in Filesystem feed?)
> >    3. What data is in sourceCluster1, sourceCluster2 and what location?
> >    4. Which path does the replicated data end up in the backupCluster
> (target)?
> >
> >
> > A few questions.  Hopefully it's something straightforward about
> > partitions that I have missed.
> >
> >
> > Thanks for your answers,John
>
>



-- 
余守中  John Yu (Yu, Shoou-Jong)
Mobile: 650-691-3314