You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Namit Jain <nj...@fb.com> on 2012/05/21 21:30:39 UTC

new feature in hive: links


There is a requirement for a new feature, for which Sambavi has written a
detailed overview  https://cwiki.apache.org/confluence/display/Hive/Links.

We would like to get some core concepts out of it, and implement them in open source hive, so that the whole community gets it.
There are new concepts, so please comment, and we can take it forward accordingly.


Thanks,
-namit





Re: new feature in hive: links

Posted by Carl Steinbach <ca...@cloudera.com>.
I added the comments/questions to the wiki (
https://cwiki.apache.org/confluence/display/Hive/Links). I'm also copying
them here:

The first draft of this proposal is very hard to decipher because it relies
on terms that aren't well defined. For example, here's the second sentence
from the motivations section:

bq. Growth beyond a single warehouse (or) separation of capacity usage and
allocation requires the creation of multiple physical warehouses, i.e.,
separate Hive instances.

What's the difference between a warehouse and a physical warehouse? How do
you define a Hive instance? In the requirements section the term virtual
warehouse is introduced and equated to a namespace, but clearly it's more
than that because otherwise DBs/Schemas would suffice. Can you please
update the proposal to include definitions for these terms?


bq. Prevent access using two part name syntax (Y.T) if namespaces feature
is "on" in a Hive instance. This ensures the database is self-contained.

The cross-namespace HiveConf ACL proposed in HIVE-3016 doesn't prevent
anyone from doing anything because there is no way to keep users from
disabling it. I'm surprised to see this ticket mentioned here since three
committers have already gone on record saying that this is the wrong
approach, and one committer even -1'd it. If preventing cross-db references
in queries is a requirement for this project, then I think Hive's
authorization mechanism will need to be extended to support this
privilege/restriction.

>From the design section:

bq. We are building a namespace service external to Hive that has metadata
on namespace location across the Hive instances, and allows importing data
across Hive instances using replication.

Does the work proposed in HIVE-2989 also include adding this Db/Table
replication infrastructure to Hive? If so, what is the timeline for adding
it?

Thanks.

Carl

On Tue, May 22, 2012 at 9:18 AM, Ashutosh Chauhan <ha...@apache.org>wrote:

> To kickstart the review, I did a quick review of the doc. Few questions
> popped out to me, which I asked. Sambavi was kind enough to come back with
> replies for them. I am continuing to look into it. Will encourage other
> folks to look into it as well.
>
>
> Thanks,
>
> Ashutosh
>
>
> <Begin Forward Message>
>
>
> Hi Ashutosh****
>
> ** **
>
> Thanks for looking through the design and providing your feedback!****
>
> ** **
>
> Responses below:****
>
> * What exactly is contained in tracking capacity usage. One is disk space.
> That I presume you are going to track via summing size under database
> directory. Are you also thinking of tracking resource usage in terms of
> CPU/memory/network utilization for different teams? ****
>
> Right now the capacity usage in Hive we will track is the disk space
> (managed tables that belong to the namespace + imported tables). We will
> track the mappers and reducers that the namepace utilizes directly from
> Hadoop.****
>
> ** **
>
> * Each namespace (ns) will have exactly one database. If so, then users are
> not allowed to create/use databases in such deployment? Not necessarily a
> problem, just trying to understand design.****
>
> Yes, you are correct – this is a limitation of the design. Introducing a
> new concept seemed heavyweight, so you can instead think of this as
> “self-contained” databases. But it means that a given namespace cannot have
> sub-databases in it.****
>
> ** **
>
> * How are you going to keep metadata consistent across two ns? If metadata
> gets updated in remote ns, will it get automatically updated in user's
> local ns? If yes, how will this be implemented? If no, then every time user
> need to use data from remote ns, she has to bring metadata uptodate in her
> ns. How will she do it?****
>
> Metadata will be kept in sync for linked tables. We will make alter table
> on the remote table (source of the link) cause an update to the target of
> the link. Note that from a Hive perspective, the metadata for the source
> and target of a link is in the same metastore.****
>
> ** **
>
> * Is it even required that metadata of two linked tables to be consistent?
> Seems like user has to run "alter link add partition" herself for each
> partition. She can choose only to add few partitions. In this case, tables
> in two ns have different number of partitions and thus data.****
>
> What you say above is true for static links. For dynamic links, add and
> drop partition on the source of the link will cause the target to get those
> partitions as well (we trap alter table add/drop partition to provide this
> behavior).****
>
> ** **
>
> * Who is allowed to create links?****
>
> Any user on the database who has create/all privileges on the database. We
> could potentially create a new privilege for this, but I think create
> privilege should suffice. We can similarly map alter, drop privileges to
> the appropriate operations.****
>
> ** **
>
> * Once user creates a link, who can use it? If everyone is allowed to
> access, then I don't see how is it different from the problem that you are
> outlining in first alternative design option, wherein user having an access
> to two ns via roles has access to data on both ns.****
>
> The link creates metadata in the target database. So you can only access
> data that has been linked into this database (access is via the T@Y or Y.T
> syntax depending on the chosen design option). Note that this is different
> than having a role that a user maps to since in that case, there is no
> local metadata in the target database specifying if the imported data is
> accessible from this database.****
>
> ** **
>
> * If links are first class concepts, then authorization model also needs to
> understand them? I don't see any mention of that.****
>
> Yes, you are correct. We need to account for the authorization model.****
>
> ** **
>
> * I see there is a hdfs jira for implementing hard links of files in hdfs
> layer, so that takes care of linking physical data on hdfs. What about
> tables whose data is stored in external systems. For example, hbase. Does
> hbase also needs to implement feature of hard-linking their table for hive
> to make use of this feature? What about other storage handlers like
> cassandra, mongodb etc.****
>
> The link does not create a link on HDFS. It just points to the source
> table/partitions. You can think of it as a Hive-level link so there is no
> need for any changes/features from the other storage handlers.****
>
> ** **
>
> * Migration will involve two step process of distcp'ing data from one
> cluster to another and then replicating one mysql instance to another. Are
> there any other steps? Do you plan to (later) build tools to automate this
> process of migration.****
>
> Yes, we will be building tools to enable migration of a namespace.
> Migration will involve replicating the metadata and the data as you mention
> above.****
>
> ** **
>
> * When migrating ns from one datacenter to another, will links be dropped
> or they are also preserved? ****
>
> We will preserve them – by copying the data for the links to the other
> datacenter.****
>
> ** **
>
> Hope that helps. Please ask any more questions that come up as you read the
> design.****
>
> ** **
>
> Thanks!****
>
> Sambavi****
>
> **
>
> On Mon, May 21, 2012 at 3:34 PM, Namit Jain <nj...@fb.com> wrote:
>
> > Yes, regions have been abandoned.
> > The cross data center writes for the metastases (in case of multiple
> > regions)
> > turns out to a deal-breaker.
> >
> >
> >
> > On 5/21/12 2:30 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
> >
> > >Can I ask a possibly related question? How does this fit in with hive
> > >regions? Have regions been abandoned?
> > >
> > >On Mon, May 21, 2012 at 3:30 PM, Namit Jain <nj...@fb.com> wrote:
> > >>
> > >>
> > >> There is a requirement for a new feature, for which Sambavi has
> written
> > >>a
> > >> detailed overview
> > >>https://cwiki.apache.org/confluence/display/Hive/Links.
> > >>
> > >> We would like to get some core concepts out of it, and implement them
> > >>in open source hive, so that the whole community gets it.
> > >> There are new concepts, so please comment, and we can take it forward
> > >>accordingly.
> > >>
> > >>
> > >> Thanks,
> > >> -namit
> > >>
> > >>
> > >>
> > >>
> >
> >
>

Re: new feature in hive: links

Posted by Ashutosh Chauhan <ha...@apache.org>.
To kickstart the review, I did a quick review of the doc. Few questions
popped out to me, which I asked. Sambavi was kind enough to come back with
replies for them. I am continuing to look into it. Will encourage other
folks to look into it as well.


Thanks,

Ashutosh


<Begin Forward Message>


Hi Ashutosh****

** **

Thanks for looking through the design and providing your feedback!****

** **

Responses below:****

* What exactly is contained in tracking capacity usage. One is disk space.
That I presume you are going to track via summing size under database
directory. Are you also thinking of tracking resource usage in terms of
CPU/memory/network utilization for different teams? ****

Right now the capacity usage in Hive we will track is the disk space
(managed tables that belong to the namespace + imported tables). We will
track the mappers and reducers that the namepace utilizes directly from
Hadoop.****

** **

* Each namespace (ns) will have exactly one database. If so, then users are
not allowed to create/use databases in such deployment? Not necessarily a
problem, just trying to understand design.****

Yes, you are correct – this is a limitation of the design. Introducing a
new concept seemed heavyweight, so you can instead think of this as
“self-contained” databases. But it means that a given namespace cannot have
sub-databases in it.****

** **

* How are you going to keep metadata consistent across two ns? If metadata
gets updated in remote ns, will it get automatically updated in user's
local ns? If yes, how will this be implemented? If no, then every time user
need to use data from remote ns, she has to bring metadata uptodate in her
ns. How will she do it?****

Metadata will be kept in sync for linked tables. We will make alter table
on the remote table (source of the link) cause an update to the target of
the link. Note that from a Hive perspective, the metadata for the source
and target of a link is in the same metastore.****

** **

* Is it even required that metadata of two linked tables to be consistent?
Seems like user has to run "alter link add partition" herself for each
partition. She can choose only to add few partitions. In this case, tables
in two ns have different number of partitions and thus data.****

What you say above is true for static links. For dynamic links, add and
drop partition on the source of the link will cause the target to get those
partitions as well (we trap alter table add/drop partition to provide this
behavior).****

** **

* Who is allowed to create links?****

Any user on the database who has create/all privileges on the database. We
could potentially create a new privilege for this, but I think create
privilege should suffice. We can similarly map alter, drop privileges to
the appropriate operations.****

** **

* Once user creates a link, who can use it? If everyone is allowed to
access, then I don't see how is it different from the problem that you are
outlining in first alternative design option, wherein user having an access
to two ns via roles has access to data on both ns.****

The link creates metadata in the target database. So you can only access
data that has been linked into this database (access is via the T@Y or Y.T
syntax depending on the chosen design option). Note that this is different
than having a role that a user maps to since in that case, there is no
local metadata in the target database specifying if the imported data is
accessible from this database.****

** **

* If links are first class concepts, then authorization model also needs to
understand them? I don't see any mention of that.****

Yes, you are correct. We need to account for the authorization model.****

** **

* I see there is a hdfs jira for implementing hard links of files in hdfs
layer, so that takes care of linking physical data on hdfs. What about
tables whose data is stored in external systems. For example, hbase. Does
hbase also needs to implement feature of hard-linking their table for hive
to make use of this feature? What about other storage handlers like
cassandra, mongodb etc.****

The link does not create a link on HDFS. It just points to the source
table/partitions. You can think of it as a Hive-level link so there is no
need for any changes/features from the other storage handlers.****

** **

* Migration will involve two step process of distcp'ing data from one
cluster to another and then replicating one mysql instance to another. Are
there any other steps? Do you plan to (later) build tools to automate this
process of migration.****

Yes, we will be building tools to enable migration of a namespace.
Migration will involve replicating the metadata and the data as you mention
above.****

** **

* When migrating ns from one datacenter to another, will links be dropped
or they are also preserved? ****

We will preserve them – by copying the data for the links to the other
datacenter.****

** **

Hope that helps. Please ask any more questions that come up as you read the
design.****

** **

Thanks!****

Sambavi****

**

On Mon, May 21, 2012 at 3:34 PM, Namit Jain <nj...@fb.com> wrote:

> Yes, regions have been abandoned.
> The cross data center writes for the metastases (in case of multiple
> regions)
> turns out to a deal-breaker.
>
>
>
> On 5/21/12 2:30 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
> >Can I ask a possibly related question? How does this fit in with hive
> >regions? Have regions been abandoned?
> >
> >On Mon, May 21, 2012 at 3:30 PM, Namit Jain <nj...@fb.com> wrote:
> >>
> >>
> >> There is a requirement for a new feature, for which Sambavi has written
> >>a
> >> detailed overview
> >>https://cwiki.apache.org/confluence/display/Hive/Links.
> >>
> >> We would like to get some core concepts out of it, and implement them
> >>in open source hive, so that the whole community gets it.
> >> There are new concepts, so please comment, and we can take it forward
> >>accordingly.
> >>
> >>
> >> Thanks,
> >> -namit
> >>
> >>
> >>
> >>
>
>

Re: new feature in hive: links

Posted by Namit Jain <nj...@fb.com>.
Yes, regions have been abandoned.
The cross data center writes for the metastases (in case of multiple
regions) 
turns out to a deal-breaker.



On 5/21/12 2:30 PM, "Edward Capriolo" <ed...@gmail.com> wrote:

>Can I ask a possibly related question? How does this fit in with hive
>regions? Have regions been abandoned?
>
>On Mon, May 21, 2012 at 3:30 PM, Namit Jain <nj...@fb.com> wrote:
>>
>>
>> There is a requirement for a new feature, for which Sambavi has written
>>a
>> detailed overview
>>https://cwiki.apache.org/confluence/display/Hive/Links.
>>
>> We would like to get some core concepts out of it, and implement them
>>in open source hive, so that the whole community gets it.
>> There are new concepts, so please comment, and we can take it forward
>>accordingly.
>>
>>
>> Thanks,
>> -namit
>>
>>
>>
>>


Re: new feature in hive: links

Posted by Edward Capriolo <ed...@gmail.com>.
Can I ask a possibly related question? How does this fit in with hive
regions? Have regions been abandoned?

On Mon, May 21, 2012 at 3:30 PM, Namit Jain <nj...@fb.com> wrote:
>
>
> There is a requirement for a new feature, for which Sambavi has written a
> detailed overview  https://cwiki.apache.org/confluence/display/Hive/Links.
>
> We would like to get some core concepts out of it, and implement them in open source hive, so that the whole community gets it.
> There are new concepts, so please comment, and we can take it forward accordingly.
>
>
> Thanks,
> -namit
>
>
>
>