You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "kulkarni.swarnim@gmail.com" <ku...@gmail.com> on 2012/05/10 23:52:30 UTC

Managed vs external tables in hive

I am pretty new to hive and was trying to clearly understand the difference
between a managed and an external table.

As my current understanding stands, a managed table is a table whose data
is completely owned by hive whereas an external table is usually created to
have a hive frontend for the data managed in external systems.I would
suppose this would mean that a query on an external table goes out to fetch
data from the given external table, deserialize according to the
given/suitable SerDe and then show the output of the query in hive format.

So does this mean that cost of using external tables is much higher than
the native ones? Or is there some caching that comes into play that I am
not seeing right now.

Thanks for the help.

-- 
Swarnim

Re: Managed vs external tables in hive

Posted by Edward Capriolo <ed...@gmail.com>.
The only actual differences is:

If you drop a managed table the LOCATION it refers to will be deleted.
If you drop an external table the LOCATION it refers to will not be deleted.

Confusion happens because when hive creates a managed table it defaults to :

fs.default.name+/user/hive/warehouse/+tablename
eg
hdfs://myserver:9091:/user/hive/warehouse/tablename

So people make the leap that EXTERNAL tables have a location and
managed tables do not, but MANAGED tables can have a location outside
the warehouse and EXTERNAL tables could have a location inside the
warehouse depending on how the tables/ partitions were defined.


On Thu, May 10, 2012 at 5:52 PM, kulkarni.swarnim@gmail.com
<ku...@gmail.com> wrote:
> I am pretty new to hive and was trying to clearly understand the difference
> between a managed and an external table.
>
> As my current understanding stands, a managed table is a table whose data is
> completely owned by hive whereas an external table is usually created to
> have a hive frontend for the data managed in external systems.I would
> suppose this would mean that a query on an external table goes out to fetch
> data from the given external table, deserialize according to the
> given/suitable SerDe and then show the output of the query in hive format.
>
> So does this mean that cost of using external tables is much higher than the
> native ones? Or is there some caching that comes into play that I am not
> seeing right now.
>
> Thanks for the help.
>
> --
> Swarnim

Re: Managed vs external tables in hive

Posted by Ranjith <ra...@gmail.com>.
Starting in .7 hive introduced indexing, https://issues.apache.org/jira/browse/HIVE-417. So indexes are available in hive.

Thanks,
Ranjith

On May 12, 2012, at 11:13 PM, Raja Thiruvathuru <th...@gmail.com> wrote:

> No indexing in hive.
> 
> 
> On Sunday, May 13, 2012, Ranjith wrote:
> Indexes can be built on tables managed by hive. For external tables I do not believe that to be true. Please feel to correct if I am wrong.
> 
> Thanks,
> Ranjith
> 
> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com> wrote:
> 
>> In hive, the raw data is in HDFS and there is a metadata layer that defines the structure of the raw data. Table is usually a reference to metadata, probably in a mySQL server and it contains a reference to the location of the data in HDFS, type of delimiter or serde to use and so on.  
>> 1. With hive managed tables, when you drop a table, both the metadata in mysql and raw data on the cluster gets deleted. 
>> 2. With external tables, when you drop a table, just the metadata gets deleted and the raw data continues to exist on the cluster. 
>> 
>>  
>> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
>> It's simpler than this.  All files look the same -- and are often very simple delimited text -- whether managed or external.  The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path.  External tables never move or remove files.  Performance is the same.
>> 
>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>> 
>> > I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table.
>> >
>> > As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format.
>> >
>> > So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now.
>> >
>> > Thanks for the help.
>> >
>> > --
>> > Swarnim
>> 
>> 
> 
> 
> -- 
> 
> Raja Thiruvathuru

Re: Managed vs external tables in hive

Posted by Raja Thiruvathuru <th...@gmail.com>.
No indexing in hive.


On Sunday, May 13, 2012, Ranjith wrote:

> Indexes can be built on tables managed by hive. For external tables I do
> not believe that to be true. Please feel to correct if I am wrong.
>
> Thanks,
> Ranjith
>
> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <nanda.vijaydev@gmail.com<javascript:_e({}, 'cvml', 'nanda.vijaydev@gmail.com');>>
> wrote:
>
> In hive, the raw data is in HDFS and there is a metadata layer that
> defines the structure of the raw data. Table is usually a reference to
> metadata, probably in a mySQL server and it contains a reference to the
> location of the data in HDFS, type of delimiter or serde to use and so on.
> 1. With hive managed tables, when you drop a table, both the metadata in
> mysql and raw data on the cluster gets deleted.
> 2. With external tables, when you drop a table, just the metadata gets
> deleted and the raw data continues to exist on the cluster.
>
>
> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dkulp@fiksu.com<javascript:_e({}, 'cvml', 'dkulp@fiksu.com');>
> > wrote:
>
>> It's simpler than this.  All files look the same -- and are often very
>> simple delimited text -- whether managed or external.  The only difference
>> is that the files associated with a managed table are dropped when the
>> table is dropped and files that are loaded into a managed table are moved
>> into hive's private path.  External tables never move or remove files.
>>  Performance is the same.
>>
>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com<javascript:_e({}, 'cvml', 'kulkarni.swarnim@gmail.com');>wrote:
>>
>> > I am pretty new to hive and was trying to clearly understand the
>> difference between a managed and an external table.
>> >
>> > As my current understanding stands, a managed table is a table whose
>> data is completely owned by hive whereas an external table is usually
>> created to have a hive frontend for the data managed in external systems.I
>> would suppose this would mean that a query on an external table goes out to
>> fetch data from the given external table, deserialize according to the
>> given/suitable SerDe and then show the output of the query in hive format.
>> >
>> > So does this mean that cost of using external tables is much higher
>> than the native ones? Or is there some caching that comes into play that I
>> am not seeing right now.
>> >
>> > Thanks for the help.
>> >
>> > --
>> > Swarnim
>>
>>
>

-- 

Raja Thiruvathuru

Re: Managed vs external tables in hive

Posted by Ranjith <ra...@gmail.com>.
Good info Edward. Thanks.

Thanks,
Ranjith

On May 13, 2012, at 2:33 PM, Edward Capriolo <ed...@gmail.com> wrote:

> The original design docs say you can not build indexes on external tables but I tried it in 0.8.x and confirmed you can.
> 
> On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
> > Indexes can be built on tables managed by hive. For external tables I do not believe that to be true. Please feel to correct if I am wrong.
> >
> > Thanks,
> > Ranjith
> > On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com> wrote:
> >
> > In hive, the raw data is in HDFS and there is a metadata layer that defines the structure of the raw data. Table is usually a reference to metadata, probably in a mySQL server and it contains a reference to the location of the data in HDFS, type of delimiter or serde to use and so on.  
> > 1. With hive managed tables, when you drop a table, both the metadata in mysql and raw data on the cluster gets deleted. 
> > 2. With external tables, when you drop a table, just the metadata gets deleted and the raw data continues to exist on the cluster. 
> >  
> > On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
> >>
> >> It's simpler than this.  All files look the same -- and are often very simple delimited text -- whether managed or external.  The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path.  External tables never move or remove files.  Performance is the same.
> >>
> >> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
> >>
> >> > I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table.
> >> >
> >> > As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format.
> >> >
> >> > So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now.
> >> >
> >> > Thanks for the help.
> >> >
> >> > --
> >> > Swarnim
> >>
> >
> >

Re: Managed vs external tables in hive

Posted by Mark Grover <mg...@oanda.com>.
Ranjith,
If the schema of the data changes, when using external tables, you can drop the table and re-create it on the same dataset taking care of the schema changes (hopefully, maintaining backwards compatibility).

I think you can still achieve that using alter table commands with managed tables; however, I find external tables just easier to manage, so I almost always end up making all my HDFS tables external.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

----- Original Message -----
From: "Ranjith" <ra...@gmail.com>
To: user@hive.apache.org
Cc: user@hive.apache.org
Sent: Sunday, May 13, 2012 8:54:35 PM
Subject: Re: Managed vs external tables in hive

Thanks Mark and Edward. This is good info to keep in mind. So is it fair to say that external tables offer flexibility, in that, you can have multiple schemas on the same data asset without data duplication. Is there anything else that an external table may offer versus a hive managed table or vice versa?

Thanks,
Ranjith

On May 13, 2012, at 6:47 PM, Edward Capriolo <ed...@gmail.com> wrote:

> I believe I walked through the entire process.
> 
> You can ALTER TABLE a table and change it from external to managed. So
> someone could always change the table to MANAGED do the index thing
> and then change it back. Just be aware of the tables current status
> before it is dropped.
> 
> Edward
> 
> On Sun, May 13, 2012 at 4:07 PM, Ranjith <ra...@gmail.com> wrote:
>> Edward,
>> Did you confirm this through the explain plan or through the execution of
>> the ddl alone. And have you tried buckets with external tables?
>> 
>> Thanks,
>> Ranjith
>> 
>> On May 13, 2012, at 2:33 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> 
>> The original design docs say you can not build indexes on external tables
>> but I tried it in 0.8.x and confirmed you can.
>> 
>> On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
>>> Indexes can be built on tables managed by hive. For external tables I do
>>> not believe that to be true. Please feel to correct if I am wrong.
>>> 
>>> Thanks,
>>> Ranjith
>>> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com>
>>> wrote:
>>> 
>>> In hive, the raw data is in HDFS and there is a metadata layer that
>>> defines the structure of the raw data. Table is usually a reference to
>>> metadata, probably in a mySQL server and it contains a reference to the
>>> location of the data in HDFS, type of delimiter or serde to use and so on.
>>> 1. With hive managed tables, when you drop a table, both the metadata in
>>> mysql and raw data on the cluster gets deleted.
>>> 2. With external tables, when you drop a table, just the metadata gets
>>> deleted and the raw data continues to exist on the cluster.
>>> 
>>> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
>>>> 
>>>> It's simpler than this.  All files look the same -- and are often very
>>>> simple delimited text -- whether managed or external.  The only difference
>>>> is that the files associated with a managed table are dropped when the table
>>>> is dropped and files that are loaded into a managed table are moved into
>>>> hive's private path.  External tables never move or remove files.
>>>>  Performance is the same.
>>>> 
>>>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>>>> 
>>>>> I am pretty new to hive and was trying to clearly understand the
>>>>> difference between a managed and an external table.
>>>>> 
>>>>> As my current understanding stands, a managed table is a table whose
>>>>> data is completely owned by hive whereas an external table is usually
>>>>> created to have a hive frontend for the data managed in external systems.I
>>>>> would suppose this would mean that a query on an external table goes out to
>>>>> fetch data from the given external table, deserialize according to the
>>>>> given/suitable SerDe and then show the output of the query in hive format.
>>>>> 
>>>>> So does this mean that cost of using external tables is much higher
>>>>> than the native ones? Or is there some caching that comes into play that I
>>>>> am not seeing right now.
>>>>> 
>>>>> Thanks for the help.
>>>>> 
>>>>> --
>>>>> Swarnim
>>>> 
>>> 
>>> 

Re: Managed vs external tables in hive

Posted by Ranjith <ra...@gmail.com>.
Thanks Mark and Edward. This is good info to keep in mind. So is it fair to say that external tables offer flexibility, in that, you can have multiple schemas on the same data asset without data duplication. Is there anything else that an external table may offer versus a hive managed table or vice versa?

Thanks,
Ranjith

On May 13, 2012, at 6:47 PM, Edward Capriolo <ed...@gmail.com> wrote:

> I believe I walked through the entire process.
> 
> You can ALTER TABLE a table and change it from external to managed. So
> someone could always change the table to MANAGED do the index thing
> and then change it back. Just be aware of the tables current status
> before it is dropped.
> 
> Edward
> 
> On Sun, May 13, 2012 at 4:07 PM, Ranjith <ra...@gmail.com> wrote:
>> Edward,
>> Did you confirm this through the explain plan or through the execution of
>> the ddl alone. And have you tried buckets with external tables?
>> 
>> Thanks,
>> Ranjith
>> 
>> On May 13, 2012, at 2:33 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> 
>> The original design docs say you can not build indexes on external tables
>> but I tried it in 0.8.x and confirmed you can.
>> 
>> On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
>>> Indexes can be built on tables managed by hive. For external tables I do
>>> not believe that to be true. Please feel to correct if I am wrong.
>>> 
>>> Thanks,
>>> Ranjith
>>> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com>
>>> wrote:
>>> 
>>> In hive, the raw data is in HDFS and there is a metadata layer that
>>> defines the structure of the raw data. Table is usually a reference to
>>> metadata, probably in a mySQL server and it contains a reference to the
>>> location of the data in HDFS, type of delimiter or serde to use and so on.
>>> 1. With hive managed tables, when you drop a table, both the metadata in
>>> mysql and raw data on the cluster gets deleted.
>>> 2. With external tables, when you drop a table, just the metadata gets
>>> deleted and the raw data continues to exist on the cluster.
>>> 
>>> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
>>>> 
>>>> It's simpler than this.  All files look the same -- and are often very
>>>> simple delimited text -- whether managed or external.  The only difference
>>>> is that the files associated with a managed table are dropped when the table
>>>> is dropped and files that are loaded into a managed table are moved into
>>>> hive's private path.  External tables never move or remove files.
>>>>  Performance is the same.
>>>> 
>>>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>>>> 
>>>>> I am pretty new to hive and was trying to clearly understand the
>>>>> difference between a managed and an external table.
>>>>> 
>>>>> As my current understanding stands, a managed table is a table whose
>>>>> data is completely owned by hive whereas an external table is usually
>>>>> created to have a hive frontend for the data managed in external systems.I
>>>>> would suppose this would mean that a query on an external table goes out to
>>>>> fetch data from the given external table, deserialize according to the
>>>>> given/suitable SerDe and then show the output of the query in hive format.
>>>>> 
>>>>> So does this mean that cost of using external tables is much higher
>>>>> than the native ones? Or is there some caching that comes into play that I
>>>>> am not seeing right now.
>>>>> 
>>>>> Thanks for the help.
>>>>> 
>>>>> --
>>>>> Swarnim
>>>> 
>>> 
>>> 

Re: Managed vs external tables in hive

Posted by Edward Capriolo <ed...@gmail.com>.
I believe I walked through the entire process.

You can ALTER TABLE a table and change it from external to managed. So
someone could always change the table to MANAGED do the index thing
and then change it back. Just be aware of the tables current status
before it is dropped.

Edward

On Sun, May 13, 2012 at 4:07 PM, Ranjith <ra...@gmail.com> wrote:
> Edward,
> Did you confirm this through the explain plan or through the execution of
> the ddl alone. And have you tried buckets with external tables?
>
> Thanks,
> Ranjith
>
> On May 13, 2012, at 2:33 PM, Edward Capriolo <ed...@gmail.com> wrote:
>
> The original design docs say you can not build indexes on external tables
> but I tried it in 0.8.x and confirmed you can.
>
> On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
>> Indexes can be built on tables managed by hive. For external tables I do
>> not believe that to be true. Please feel to correct if I am wrong.
>>
>> Thanks,
>> Ranjith
>> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com>
>> wrote:
>>
>> In hive, the raw data is in HDFS and there is a metadata layer that
>> defines the structure of the raw data. Table is usually a reference to
>> metadata, probably in a mySQL server and it contains a reference to the
>> location of the data in HDFS, type of delimiter or serde to use and so on.
>> 1. With hive managed tables, when you drop a table, both the metadata in
>> mysql and raw data on the cluster gets deleted.
>> 2. With external tables, when you drop a table, just the metadata gets
>> deleted and the raw data continues to exist on the cluster.
>>
>> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
>>>
>>> It's simpler than this.  All files look the same -- and are often very
>>> simple delimited text -- whether managed or external.  The only difference
>>> is that the files associated with a managed table are dropped when the table
>>> is dropped and files that are loaded into a managed table are moved into
>>> hive's private path.  External tables never move or remove files.
>>>  Performance is the same.
>>>
>>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>>>
>>> > I am pretty new to hive and was trying to clearly understand the
>>> > difference between a managed and an external table.
>>> >
>>> > As my current understanding stands, a managed table is a table whose
>>> > data is completely owned by hive whereas an external table is usually
>>> > created to have a hive frontend for the data managed in external systems.I
>>> > would suppose this would mean that a query on an external table goes out to
>>> > fetch data from the given external table, deserialize according to the
>>> > given/suitable SerDe and then show the output of the query in hive format.
>>> >
>>> > So does this mean that cost of using external tables is much higher
>>> > than the native ones? Or is there some caching that comes into play that I
>>> > am not seeing right now.
>>> >
>>> > Thanks for the help.
>>> >
>>> > --
>>> > Swarnim
>>>
>>
>>

Re: Managed vs external tables in hive

Posted by Mark Grover <mg...@oanda.com>.
Hi Ranjith,
I use buckets with external tables, no problem.

I concur with other people on the thread. Having an external table vs. managed table on HDFS should have minimal impact what operations you can perform on those tables.

Mark

----- Original Message -----
From: "Ranjith" <ra...@gmail.com>
To: user@hive.apache.org
Cc: user@hive.apache.org
Sent: Sunday, May 13, 2012 4:07:48 PM
Subject: Re: Managed vs external tables in hive


Edward, 
Did you confirm this through the explain plan or through the execution of the ddl alone. And have you tried buckets with external tables? 

Thanks, 
Ranjith 

On May 13, 2012, at 2:33 PM, Edward Capriolo < edlinuxguru@gmail.com > wrote: 





The original design docs say you can not build indexes on external tables but I tried it in 0.8.x and confirmed you can. 

On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com > wrote: 
> Indexes can be built on tables managed by hive. For external tables I do not believe that to be true. Please feel to correct if I am wrong. 
> 
> Thanks, 
> Ranjith 
> On May 12, 2012, at 9:24 PM, Nanda Vijaydev < nanda.vijaydev@gmail.com > wrote: 
> 
> In hive, the raw data is in HDFS and there is a metadata layer that defines the structure of the raw data. Table is usually a reference to metadata, probably in a mySQL server and it contains a reference to the location of the data in HDFS, type of delimiter or serde to use and so on. 
> 1. With hive managed tables, when you drop a table, both the metadata in mysql and raw data on the cluster gets deleted. 
> 2. With external tables, when you drop a table, just the metadata gets deleted and the raw data continues to exist on the cluster. 
> 
> On Thu, May 10, 2012 at 3:02 PM, David Kulp < dkulp@fiksu.com > wrote: 
>> 
>> It's simpler than this. All files look the same -- and are often very simple delimited text -- whether managed or external. The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path. External tables never move or remove files. Performance is the same. 
>> 
>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote: 
>> 
>> > I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table. 
>> > 
>> > As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format. 
>> > 
>> > So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now. 
>> > 
>> > Thanks for the help. 
>> > 
>> > -- 
>> > Swarnim 
>> 
> 
> 

Re: Managed vs external tables in hive

Posted by Ranjith <ra...@gmail.com>.
Edward,
Did you confirm this through the explain plan or through the execution of the ddl alone. And have you tried buckets with external tables?

Thanks,
Ranjith

On May 13, 2012, at 2:33 PM, Edward Capriolo <ed...@gmail.com> wrote:

> The original design docs say you can not build indexes on external tables but I tried it in 0.8.x and confirmed you can.
> 
> On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
> > Indexes can be built on tables managed by hive. For external tables I do not believe that to be true. Please feel to correct if I am wrong.
> >
> > Thanks,
> > Ranjith
> > On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com> wrote:
> >
> > In hive, the raw data is in HDFS and there is a metadata layer that defines the structure of the raw data. Table is usually a reference to metadata, probably in a mySQL server and it contains a reference to the location of the data in HDFS, type of delimiter or serde to use and so on.  
> > 1. With hive managed tables, when you drop a table, both the metadata in mysql and raw data on the cluster gets deleted. 
> > 2. With external tables, when you drop a table, just the metadata gets deleted and the raw data continues to exist on the cluster. 
> >  
> > On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
> >>
> >> It's simpler than this.  All files look the same -- and are often very simple delimited text -- whether managed or external.  The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path.  External tables never move or remove files.  Performance is the same.
> >>
> >> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
> >>
> >> > I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table.
> >> >
> >> > As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format.
> >> >
> >> > So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now.
> >> >
> >> > Thanks for the help.
> >> >
> >> > --
> >> > Swarnim
> >>
> >
> >

Re: Managed vs external tables in hive

Posted by Edward Capriolo <ed...@gmail.com>.
The original design docs say you can not build indexes on external tables
but I tried it in 0.8.x and confirmed you can.

On Sunday, May 13, 2012, Ranjith <ranjith.raghunat h1@gmail.com> wrote:
> Indexes can be built on tables managed by hive. For external tables I do
not believe that to be true. Please feel to correct if I am wrong.
>
> Thanks,
> Ranjith
> On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com>
wrote:
>
> In hive, the raw data is in HDFS and there is a metadata layer that
defines the structure of the raw data. Table is usually a reference to
metadata, probably in a mySQL server and it contains a reference to the
location of the data in HDFS, type of delimiter or serde to use and so on.
> 1. With hive managed tables, when you drop a table, both the metadata in
mysql and raw data on the cluster gets deleted.
> 2. With external tables, when you drop a table, just the metadata gets
deleted and the raw data continues to exist on the cluster.
>
> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
>>
>> It's simpler than this.  All files look the same -- and are often very
simple delimited text -- whether managed or external.  The only difference
is that the files associated with a managed table are dropped when the
table is dropped and files that are loaded into a managed table are moved
into hive's private path.  External tables never move or remove files.
 Performance is the same.
>>
>> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>>
>> > I am pretty new to hive and was trying to clearly understand the
difference between a managed and an external table.
>> >
>> > As my current understanding stands, a managed table is a table whose
data is completely owned by hive whereas an external table is usually
created to have a hive frontend for the data managed in external systems.I
would suppose this would mean that a query on an external table goes out to
fetch data from the given external table, deserialize according to the
given/suitable SerDe and then show the output of the query in hive format.
>> >
>> > So does this mean that cost of using external tables is much higher
than the native ones? Or is there some caching that comes into play that I
am not seeing right now.
>> >
>> > Thanks for the help.
>> >
>> > --
>> > Swarnim
>>
>
>

Re: Managed vs external tables in hive

Posted by Ranjith <ra...@gmail.com>.
Indexes can be built on tables managed by hive. For external tables I do not believe that to be true. Please feel to correct if I am wrong.

Thanks,
Ranjith

On May 12, 2012, at 9:24 PM, Nanda Vijaydev <na...@gmail.com> wrote:

> In hive, the raw data is in HDFS and there is a metadata layer that defines the structure of the raw data. Table is usually a reference to metadata, probably in a mySQL server and it contains a reference to the location of the data in HDFS, type of delimiter or serde to use and so on.  
> 1. With hive managed tables, when you drop a table, both the metadata in mysql and raw data on the cluster gets deleted. 
> 2. With external tables, when you drop a table, just the metadata gets deleted and the raw data continues to exist on the cluster. 
> 
>  
> On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:
> It's simpler than this.  All files look the same -- and are often very simple delimited text -- whether managed or external.  The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path.  External tables never move or remove files.  Performance is the same.
> 
> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
> 
> > I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table.
> >
> > As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format.
> >
> > So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now.
> >
> > Thanks for the help.
> >
> > --
> > Swarnim
> 
> 

Re: Managed vs external tables in hive

Posted by Nanda Vijaydev <na...@gmail.com>.
In hive, the raw data is in HDFS and there is a metadata layer that defines
the structure of the raw data. Table is usually a reference to metadata,
probably in a mySQL server and it contains a reference to the location of
the data in HDFS, type of delimiter or serde to use and so on.
1. With hive managed tables, when you drop a table, both the metadata in
mysql and raw data on the cluster gets deleted.
2. With external tables, when you drop a table, just the metadata gets
deleted and the raw data continues to exist on the cluster.


On Thu, May 10, 2012 at 3:02 PM, David Kulp <dk...@fiksu.com> wrote:

> It's simpler than this.  All files look the same -- and are often very
> simple delimited text -- whether managed or external.  The only difference
> is that the files associated with a managed table are dropped when the
> table is dropped and files that are loaded into a managed table are moved
> into hive's private path.  External tables never move or remove files.
>  Performance is the same.
>
> On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:
>
> > I am pretty new to hive and was trying to clearly understand the
> difference between a managed and an external table.
> >
> > As my current understanding stands, a managed table is a table whose
> data is completely owned by hive whereas an external table is usually
> created to have a hive frontend for the data managed in external systems.I
> would suppose this would mean that a query on an external table goes out to
> fetch data from the given external table, deserialize according to the
> given/suitable SerDe and then show the output of the query in hive format.
> >
> > So does this mean that cost of using external tables is much higher than
> the native ones? Or is there some caching that comes into play that I am
> not seeing right now.
> >
> > Thanks for the help.
> >
> > --
> > Swarnim
>
>

Re: Managed vs external tables in hive

Posted by David Kulp <dk...@fiksu.com>.
It's simpler than this.  All files look the same -- and are often very simple delimited text -- whether managed or external.  The only difference is that the files associated with a managed table are dropped when the table is dropped and files that are loaded into a managed table are moved into hive's private path.  External tables never move or remove files.  Performance is the same.

On May 10, 2012, at 5:52 PM, kulkarni.swarnim@gmail.com wrote:

> I am pretty new to hive and was trying to clearly understand the difference between a managed and an external table. 
> 
> As my current understanding stands, a managed table is a table whose data is completely owned by hive whereas an external table is usually created to have a hive frontend for the data managed in external systems.I would suppose this would mean that a query on an external table goes out to fetch data from the given external table, deserialize according to the given/suitable SerDe and then show the output of the query in hive format.
> 
> So does this mean that cost of using external tables is much higher than the native ones? Or is there some caching that comes into play that I am not seeing right now.
> 
> Thanks for the help.
> 
> -- 
> Swarnim