You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Dinesh S. Atreya (JIRA)" <ji...@apache.org> on 2015/12/07 16:19:11 UTC

[jira] [Updated] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

     [ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dinesh S. Atreya updated HADOOP-12620:
--------------------------------------
    Description: 
h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)

One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture. 

HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS. 
•	Support seeks for writes (in addition to reads).
•	After seek, if the new byte length is the same as the old byte length, in place update is allowed.
•	Delete is an update with appropriate Delete marker
•	If byte length is different, old entry is marked as delete with new one appended as before. 
•	It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities.

These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically. 
•	Real time 
•	Mini-batch  
•	Stream based data processing
•	Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.

With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources  with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform. 

These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data.

There are many influences for this umbrella JIRA:

•	Preserve and Accelerate Hadoop
•	Efficient Data Management of variety of Data Formats natively in Hadoop
•	Enterprise Expansion 
•	Internet and Media 
•	Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.

It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place.  Some Data Management /Platform initial use-cases are given hereunder:

Key-Value Store
With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop.

MVCC 

Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC 
http://momjian.us/main/writings/pgsql/mvcc.pdf 

Data 
ID	Activity	Data Create Counter	Data Expiry
Counter	Comments
1	Insert	40	MAX_VAL	Conventionally MAX_VAL is null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
1	Delete	40	47	Marked as delete when current counter was 47.
2	Update (old Delete)	64	 78	Mark old data is DELETE
2	Update (new insert)	78	MAX_VAL	Insert new data.


Graph Stores
Enable native storage and processing for a variety of graph stores. 

Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency 
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.  

Graph Store 2 (Facebook Social Graph - TAO)

Object:  (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )

WEB
With the AHA enhancements, a variety of Web standards can be natively supported  such as updateable JSON (http://json.org/), XML, RDF and other documents.


RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple.

Mobile Apps Data and Resources

With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores.

About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html 
On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ 
Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf 



Temporal Data 
https://en.wikipedia.org/wiki/Temporal_database 
https://en.wikipedia.org/wiki/Valid_time 
In temporal data, data may get updated to reflect changes in data.
For example data change from 
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)

Media
Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem. 
Indexes
With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities. 

Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics.

Google References

While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components.

We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.


Year	Title	Links
2015	Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform 
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/ 
2014	Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf 
2013	F1: A Distributed SQL Database That Scales	http://research.google.com/pubs/pub41344.html 
2013	Online, Asynchronous Schema Change in F1	http://research.google.com/pubs/pub41376.html 
2013	Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf 
2012	F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business	http://research.google.com/pubs/pub38125.html 
2012	Spanner: Google's Globally-Distributed Database	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf 
2012	Clydesdale: structured data processing on MapReduce	http://dl.acm.org/citation.cfm?doid=2247596.2247600 
2011	Megastore: Providing Scalable, Highly Available Storage for Interactive Services	http://research.google.com/pubs/pub36971.html 
2011	Tenzing A SQL Implementation On The MapReduce Framework	http://research.google.com/pubs/pub37200.html 
2010	Dremel: Interactive Analysis of Web-Scale Datasets	http://research.google.com/pubs/pub36632.html 
2010	FlumeJava: Easy, Efficient Data-Parallel Pipelines	http://research.google.com/pubs/pub35650.html 
2010	Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications	http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 

Application Domains

The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder.

Data Warehousing and Enhanced ETL processing  
Supply Chain Planning
Web Sites 
Mobile App Stores
Financials 
Media 
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM 


Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components. 


> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS. 
> •	Support seeks for writes (in addition to reads).
> •	After seek, if the new byte length is the same as the old byte length, in place update is allowed.
> •	Delete is an update with appropriate Delete marker
> •	If byte length is different, old entry is marked as delete with new one appended as before. 
> •	It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities.
> These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically. 
> •	Real time 
> •	Mini-batch  
> •	Stream based data processing
> •	Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources  with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data.
> There are many influences for this umbrella JIRA:
> •	Preserve and Accelerate Hadoop
> •	Efficient Data Management of variety of Data Formats natively in Hadoop
> •	Enterprise Expansion 
> •	Internet and Media 
> •	Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place.  Some Data Management /Platform initial use-cases are given hereunder:
> Key-Value Store
> With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop.
> MVCC 
> Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC 
> http://momjian.us/main/writings/pgsql/mvcc.pdf 
> Data 
> ID	Activity	Data Create Counter	Data Expiry
> Counter	Comments
> 1	Insert	40	MAX_VAL	Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> 1	Delete	40	47	Marked as delete when current counter was 47.
> 2	Update (old Delete)	64	 78	Mark old data is DELETE
> 2	Update (new insert)	78	MAX_VAL	Insert new data.
> Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> WEB
> With the AHA enhancements, a variety of Web standards can be natively supported  such as updateable JSON (http://json.org/), XML, RDF and other documents.
> RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple.
> Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores.
> About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html 
> On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ 
> Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf 
> Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> Media
> Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem. 
> Indexes
> With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities. 
> Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics.
> Google References
> While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components.
> We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
> Year	Title	Links
> 2015	Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform 
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> 2014	Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf 
> 2013	F1: A Distributed SQL Database That Scales	http://research.google.com/pubs/pub41344.html 
> 2013	Online, Asynchronous Schema Change in F1	http://research.google.com/pubs/pub41376.html 
> 2013	Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf 
> 2012	F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business	http://research.google.com/pubs/pub38125.html 
> 2012	Spanner: Google's Globally-Distributed Database	http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf 
> 2012	Clydesdale: structured data processing on MapReduce	http://dl.acm.org/citation.cfm?doid=2247596.2247600 
> 2011	Megastore: Providing Scalable, Highly Available Storage for Interactive Services	http://research.google.com/pubs/pub36971.html 
> 2011	Tenzing A SQL Implementation On The MapReduce Framework	http://research.google.com/pubs/pub37200.html 
> 2010	Dremel: Interactive Analysis of Web-Scale Datasets	http://research.google.com/pubs/pub36632.html 
> 2010	FlumeJava: Easy, Efficient Data-Parallel Pipelines	http://research.google.com/pubs/pub35650.html 
> 2010	Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications	http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> Application Domains
> The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)