You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ashok Kumar <as...@yahoo.com> on 2016/01/03 12:03:16 UTC

Re: Immutable data in Hive

Any comments on ELT will be greatly appreciated gurus.
With warmest greetings 

    On Wednesday, 30 December 2015, 18:20, Ashok Kumar <as...@yahoo.com> wrote:
 

 Tank you sir,  very helpful. Could you also briefly describe from your experience  the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation? Regards    

    On Wednesday, 30 December 2015, 18:00, Alan Gates <al...@gmail.com> wrote:
 

 Traditionally data in Hive was write once (insert) read many.  You could append to tables and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.  But there was no updates of data or deletes of particular rows.  This was what was meant by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features.  Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.)  Also streaming data into warehouses from transactional systems is a common use case.

Alan.


    Ashok Kumar  December 29, 2015 at 14:59  Hi,
Can someone please clarify what  "immutable data" in Hive means?
I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.
thanks and greetings

RE: Immutable data in Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Guys,

 

Thank you for your kind words.

 

I also created couple of diagrams for ETL and ELT respectively.

 

Please feel free to have a look, comment and criticize. All welcome.

 

Regards,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 04 January 2016 20:06
To: user@hive.apache.org
Subject: Re: Immutable data in Hive

 

I second that. Many thanks Mich for your reply.

 

Regards

 

On Monday, 4 January 2016, 10:58, "Singh, Abhijeet" <absingh@informatica.com <ma...@informatica.com> > wrote:

 

Very well answered by Mich.

 

Thanks Mich !!

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Sunday, January 03, 2016 8:35 PM
To: user@hive.apache.org <ma...@hive.apache.org> ; 'Ashok Kumar'
Subject: RE: Immutable data in Hive

 

Hi Ashok.

 

I will have a go at this on top of Alan’s very valuable clarification.

 

Extraction, Transformation and Load  (ETL) is a very common method in Data Warehousing (DW) and Business Analytics projects and can be performed by custom programming like writing shell scripts, JAVA. .NET tools or combination of all to get the data from internal or external sources and put them in DW.

 

In general only data of value ends up in DW. What this mean is that in say in Banking environment you collect and feed (Extract) data into a staging area (in relational term often staging tables or the so called global temporary tables that are cleared daily for the next cycle in a staging database), prune it from unwanted data, do some manipulation (Transformation) (often happens into another set of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember a Data Warehouse is still a relational database most probably a columnar implementation of relational model like SAP Sybase IQ.

 

 

There are many examples of DW repositories used for Business Intelligence (BI, another fancy term for Analytics)  such as working out global trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously although DW store larger amount of data (large being a relative term) and have impressive compression like Sybase IQ (every column is stored as an index so it is far more effective to do columnar compression (all data being the same type as opposed to row compression in OLTP databases)), they still require additional space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional license).

 

ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental difference being that it is not just confined to data deemed to be of specific value, meaning you know what you are looking for in advance. In Hadoop one can store everything from data coming from structured data (transactional databases) and unstructured data (data coming from internet, excel sheets, email, logs and others). This means that you can store potentially all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t care whether that data is exactly what you want) into HDFS and then you do (T)ransformation later through Schema on Read (you decide at time of exploration your data needs). 

 

HDFS is great for storing large amount of data but on top of that you will need all tools like Hive, Spark, Cassandra and others to explore your data lake.

 

 

HTH

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com/

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 03 January 2016 11:03
To: user@hive.apache.org <ma...@hive.apache.org> ; Ashok Kumar <ashok34668@yahoo.com <ma...@yahoo.com> >
Subject: Re: Immutable data in Hive

 

Any comments on ELT will be greatly appreciated gurus.

 

With warmest greetings

 

On Wednesday, 30 December 2015, 18:20, Ashok Kumar <ashok34668@yahoo.com <ma...@yahoo.com> > wrote:

 

Tank you sir,  very helpful.

 

Could you also briefly describe from your experience  the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation?

 

Regards 

 

 

 

On Wednesday, 30 December 2015, 18:00, Alan Gates <alanfgates@gmail.com <ma...@gmail.com> > wrote:

 

Traditionally data in Hive was write once (insert) read many.  You could append to tables and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.  But there was no updates of data or deletes of particular rows.  This was what was meant by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features.  Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.)  Also streaming data into warehouses from transactional systems is a common use case.

Alan.

 

 <ma...@yahoo.com> Ashok Kumar

December 29, 2015 at 14:59

Hi,

 

Can someone please clarify what  "immutable data" in Hive means?

 

I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.

 

thanks and greetings

Re: Immutable data in Hive

Posted by Ashok Kumar <as...@yahoo.com>.

I second that. Many thanks Mich for your reply.
Regards 

    On Monday, 4 January 2016, 10:58, "Singh, Abhijeet" <ab...@informatica.com> wrote:

 #yiv5347372295 #yiv5347372295 -- _filtered #yiv5347372295 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv5347372295 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv5347372295 {}#yiv5347372295 #yiv5347372295 p.yiv5347372295MsoNormal, #yiv5347372295 li.yiv5347372295MsoNormal, #yiv5347372295 div.yiv5347372295MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5347372295 a:link, #yiv5347372295 span.yiv5347372295MsoHyperlink {color:blue;text-decoration:underline;}#yiv5347372295 a:visited, #yiv5347372295 span.yiv5347372295MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5347372295 p.yiv5347372295MsoAcetate, #yiv5347372295 li.yiv5347372295MsoAcetate, #yiv5347372295 div.yiv5347372295MsoAcetate {margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv5347372295 p.yiv5347372295MsoListParagraph, #yiv5347372295 li.yiv5347372295MsoListParagraph, #yiv5347372295 div.yiv5347372295MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv5347372295 p.yiv5347372295msonormal0, #yiv5347372295 li.yiv5347372295msonormal0, #yiv5347372295 div.yiv5347372295msonormal0 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5347372295 span.yiv5347372295EmailStyle19 {color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv5347372295 span.yiv5347372295BalloonTextChar {}#yiv5347372295 span.yiv5347372295EmailStyle22 {color:#1F497D;}#yiv5347372295 .yiv5347372295MsoChpDefault {font-size:10.0pt;} _filtered #yiv5347372295 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5347372295 div.yiv5347372295WordSection1 {}#yiv5347372295 Very well answered by Mich.    Thanks Mich !!    From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Sunday, January 03, 2016 8:35 PM
To: user@hive.apache.org; 'Ashok Kumar'
Subject: RE: Immutable data in Hive    Hi Ashok.    I will have a go at this on top of Alan’s very valuable clarification.    Extraction, Transformation and Load  (ETL) is a very common method in Data Warehousing (DW)and Business Analytics projects and can be performed by custom programming like writing shell scripts, JAVA. .NET tools or combination of all to get the data from internal or external sources and put them in DW.    In general only data of valueends up in DW. What this mean is that in say in Banking environment you collect and feed (Extract) data into a staging area (in relational term often staging tables or the so called global temporary tables that are cleared daily for the next cycle in a staging database), prune it from unwanted data, do some manipulation (Transformation) (often happens into another set of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember a Data Warehouse is still a relational database most probably a columnar implementation of relational model like SAP Sybase IQ.       There are many examples of DW repositories used for Business Intelligence (BI, another fancy term for Analytics)  such as working out global trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously although DW store larger amount of data (large being a relative term) and have impressive compression like Sybase IQ (every column is stored as an index so it is far more effective to do columnar compression (all data being the same type as opposed to row compression in OLTP databases)), they still require additional space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional license).    ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental difference being that it is not just confined to data deemed to be of specific value, meaning you know what you are looking for in advance. In Hadoop one can store everything from data coming from structured data (transactional databases) and unstructured data (data coming from internet, excel sheets, email, logs and others). This means that you can store potentially all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t care whether that data is exactly what you want) into HDFS and then you do (T)ransformation later through Schema on Read (you decide at time of exploration your data needs).    HDFS is great for storing large amount of data but on top of that you will need all tools like Hive, Spark, Cassandra and others to explore your data lake.       HTH       Mich Talebzadeh    Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author"Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly    http://talebzadehmich.wordpress.com/    NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.    From: Ashok Kumar [mailto:ashok34668@yahoo.com]
Sent: 03 January 2016 11:03
To: user@hive.apache.org; Ashok Kumar <as...@yahoo.com>
Subject: Re: Immutable data in Hive    Any comments on ELT will be greatly appreciated gurus.    With warmest greetings    On Wednesday, 30 December 2015, 18:20, Ashok Kumar <as...@yahoo.com> wrote:    Tank you sir,  very helpful.   Could you also briefly describe from your experience  the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation?   Regards         On Wednesday, 30 December 2015, 18:00, Alan Gates <al...@gmail.com> wrote:    Traditionally data in Hive was write once (insert) read many.  You could append to tables and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.  But there was no updates of data or deletes of particular rows.  This was what was meant by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features.  Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.)  Also streaming data into warehouses from transactional systems is a common use case.

Alan.

   Ashok Kumar December 29, 2015 at 14:59 Hi,    Can someone please clarify what  "immutable data" in Hive means?    I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.    thanks and greetings

RE: Immutable data in Hive

Posted by "Singh, Abhijeet" <ab...@informatica.com>.

Very well answered by Mich.

Thanks Mich !!

From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Sunday, January 03, 2016 8:35 PM
To: user@hive.apache.org; 'Ashok Kumar'
Subject: RE: Immutable data in Hive

Hi Ashok.

I will have a go at this on top of Alan’s very valuable clarification.

Extraction, Transformation and Load  (ETL) is a very common method in Data Warehousing (DW) and Business Analytics projects and can be performed by custom programming like writing shell scripts, JAVA. .NET tools or combination of all to get the data from internal or external sources and put them in DW.

In general only data of value ends up in DW. What this mean is that in say in Banking environment you collect and feed (Extract) data into a staging area (in relational term often staging tables or the so called global temporary tables that are cleared daily for the next cycle in a staging database), prune it from unwanted data, do some manipulation (Transformation) (often happens into another set of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember a Data Warehouse is still a relational database most probably a columnar implementation of relational model like SAP Sybase IQ.

There are many examples of DW repositories used for Business Intelligence (BI, another fancy term for Analytics)  such as working out global trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously although DW store larger amount of data (large being a relative term) and have impressive compression like Sybase IQ (every column is stored as an index so it is far more effective to do columnar compression (all data being the same type as opposed to row compression in OLTP databases)), they still require additional space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional license).

ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental difference being that it is not just confined to data deemed to be of specific value, meaning you know what you are looking for in advance. In Hadoop one can store everything from data coming from structured data (transactional databases) and unstructured data (data coming from internet, excel sheets, email, logs and others). This means that you can store potentially all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t care whether that data is exactly what you want) into HDFS and then you do (T)ransformation later through Schema on Read (you decide at time of exploration your data needs).

HDFS is great for storing large amount of data but on top of that you will need all tools like Hive, Spark, Cassandra and others to explore your data lake.

HTH

Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Ashok Kumar [mailto:ashok34668@yahoo.com]
Sent: 03 January 2016 11:03
To: user@hive.apache.org<ma...@hive.apache.org>; Ashok Kumar <as...@yahoo.com>>
Subject: Re: Immutable data in Hive

Any comments on ELT will be greatly appreciated gurus.

With warmest greetings

On Wednesday, 30 December 2015, 18:20, Ashok Kumar <as...@yahoo.com>> wrote:

Tank you sir,  very helpful.

Could you also briefly describe from your experience  the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation?

Regards

On Wednesday, 30 December 2015, 18:00, Alan Gates <al...@gmail.com>> wrote:

Traditionally data in Hive was write once (insert) read many.  You could append to tables and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.  But there was no updates of data or deletes of particular rows.  This was what was meant by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features.  Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.)  Also streaming data into warehouses from transactional systems is a common use case.

Alan.

Ashok Kumar<ma...@yahoo.com>
December 29, 2015 at 14:59
Hi,

Can someone please clarify what  "immutable data" in Hive means?

I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.

thanks and greetings

RE: Immutable data in Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Hi Ashok.

 

I will have a go at this on top of Alan’s very valuable clarification.

 

Extraction, Transformation and Load  (ETL) is a very common method in Data Warehousing (DW) and Business Analytics projects and can be performed by custom programming like writing shell scripts, JAVA. .NET tools or combination of all to get the data from internal or external sources and put them in DW.

 

In general only data of value ends up in DW. What this mean is that in say in Banking environment you collect and feed (Extract) data into a staging area (in relational term often staging tables or the so called global temporary tables that are cleared daily for the next cycle in a staging database), prune it from unwanted data, do some manipulation (Transformation) (often happens into another set of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember a Data Warehouse is still a relational database most probably a columnar implementation of relational model like SAP Sybase IQ.

 

 

There are many examples of DW repositories used for Business Intelligence (BI, another fancy term for Analytics)  such as working out global trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously although DW store larger amount of data (large being a relative term) and have impressive compression like Sybase IQ (every column is stored as an index so it is far more effective to do columnar compression (all data being the same type as opposed to row compression in OLTP databases)), they still require additional space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional license).

 

ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental difference being that it is not just confined to data deemed to be of specific value, meaning you know what you are looking for in advance. In Hadoop one can store everything from data coming from structured data (transactional databases) and unstructured data (data coming from internet, excel sheets, email, logs and others). This means that you can store potentially all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost of Ownership (TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t care whether that data is exactly what you want) into HDFS and then you do (T)ransformation later through Schema on Read (you decide at time of exploration your data needs). 

 

HDFS is great for storing large amount of data but on top of that you will need all tools like Hive, Spark, Cassandra and others to explore your data lake.

 

 

HTH

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 03 January 2016 11:03
To: user@hive.apache.org; Ashok Kumar <as...@yahoo.com>
Subject: Re: Immutable data in Hive

 

Any comments on ELT will be greatly appreciated gurus.

 

With warmest greetings

 

On Wednesday, 30 December 2015, 18:20, Ashok Kumar <ashok34668@yahoo.com <ma...@yahoo.com> > wrote:

 

Tank you sir,  very helpful.

 

Could you also briefly describe from your experience  the major differences between traditional ETL in DW and ELT in Hive?  Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation?

 

Regards 

 

 

 

On Wednesday, 30 December 2015, 18:00, Alan Gates <alanfgates@gmail.com <ma...@gmail.com> > wrote:

 

Traditionally data in Hive was write once (insert) read many.  You could append to tables and partitions, add new partitions, etc.  You could remove data by dropping tables or partitions.  But there was no updates of data or deletes of particular rows.  This was what was meant by immutable.  Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features.  Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.)  Also streaming data into warehouses from transactional systems is a common use case.

Alan.






 <ma...@yahoo.com> Ashok Kumar

December 29, 2015 at 14:59

Hi,

 

Can someone please clarify what  "immutable data" in Hive means?

 

I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.

 

thanks and greetings