You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Dinesh S. Atreya (JIRA)" <ji...@apache.org> on 2015/12/07 16:19:11 UTC
[jira] [Updated] (HADOOP-12620) Advanced Hadoop Architecture (AHA)
- Common
[ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dinesh S. Atreya updated HADOOP-12620:
--------------------------------------
Description:
h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture.
HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS.
• Support seeks for writes (in addition to reads).
• After seek, if the new byte length is the same as the old byte length, in place update is allowed.
• Delete is an update with appropriate Delete marker
• If byte length is different, old entry is marked as delete with new one appended as before.
• It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities.
These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically.
• Real time
• Mini-batch
• Stream based data processing
• Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform.
These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data.
There are many influences for this umbrella JIRA:
• Preserve and Accelerate Hadoop
• Efficient Data Management of variety of Data Formats natively in Hadoop
• Enterprise Expansion
• Internet and Media
• Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place. Some Data Management /Platform initial use-cases are given hereunder:
Key-Value Store
With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop.
MVCC
Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC
http://momjian.us/main/writings/pgsql/mvcc.pdf
Data
ID Activity Data Create Counter Data Expiry
Counter Comments
1 Insert 40 MAX_VAL Conventionally MAX_VAL is null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
1 Delete 40 47 Marked as delete when current counter was 47.
2 Update (old Delete) 64 78 Mark old data is DELETE
2 Update (new insert) 78 MAX_VAL Insert new data.
Graph Stores
Enable native storage and processing for a variety of graph stores.
Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.
Graph Store 2 (Facebook Social Graph - TAO)
Object: (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
WEB
With the AHA enhancements, a variety of Web standards can be natively supported such as updateable JSON (http://json.org/), XML, RDF and other documents.
RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/
The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple.
Mobile Apps Data and Resources
With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores.
About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
Temporal Data
https://en.wikipedia.org/wiki/Temporal_database
https://en.wikipedia.org/wiki/Valid_time
In temporal data, data may get updated to reflect changes in data.
For example data change from
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
Media
Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem.
Indexes
With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities.
Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics.
Google References
While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components.
We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
Year Title Links
2015 Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/
2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
2013 F1: A Distributed SQL Database That Scales http://research.google.com/pubs/pub41344.html
2013 Online, Asynchronous Schema Change in F1 http://research.google.com/pubs/pub41376.html
2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business http://research.google.com/pubs/pub38125.html
2012 Spanner: Google's Globally-Distributed Database http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
2012 Clydesdale: structured data processing on MapReduce http://dl.acm.org/citation.cfm?doid=2247596.2247600
2011 Megastore: Providing Scalable, Highly Available Storage for Interactive Services http://research.google.com/pubs/pub36971.html
2011 Tenzing A SQL Implementation On The MapReduce Framework http://research.google.com/pubs/pub37200.html
2010 Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html
2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines http://research.google.com/pubs/pub35650.html
2010 Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf
Application Domains
The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder.
Data Warehousing and Enhanced ETL processing
Supply Chain Planning
Web Sites
Mobile App Stores
Financials
Media
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM
Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components.
> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
> Key: HADOOP-12620
> URL: https://issues.apache.org/jira/browse/HADOOP-12620
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture.
> HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS.
> • Support seeks for writes (in addition to reads).
> • After seek, if the new byte length is the same as the old byte length, in place update is allowed.
> • Delete is an update with appropriate Delete marker
> • If byte length is different, old entry is marked as delete with new one appended as before.
> • It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities.
> These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically.
> • Real time
> • Mini-batch
> • Stream based data processing
> • Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform.
> These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data.
> There are many influences for this umbrella JIRA:
> • Preserve and Accelerate Hadoop
> • Efficient Data Management of variety of Data Formats natively in Hadoop
> • Enterprise Expansion
> • Internet and Media
> • Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place. Some Data Management /Platform initial use-cases are given hereunder:
> Key-Value Store
> With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop.
> MVCC
> Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC
> http://momjian.us/main/writings/pgsql/mvcc.pdf
> Data
> ID Activity Data Create Counter Data Expiry
> Counter Comments
> 1 Insert 40 MAX_VAL Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> 1 Delete 40 47 Marked as delete when current counter was 47.
> 2 Update (old Delete) 64 78 Mark old data is DELETE
> 2 Update (new insert) 78 MAX_VAL Insert new data.
> Graph Stores
> Enable native storage and processing for a variety of graph stores.
> Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.
> Graph Store 2 (Facebook Social Graph - TAO)
> Object: (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> WEB
> With the AHA enhancements, a variety of Web standards can be natively supported such as updateable JSON (http://json.org/), XML, RDF and other documents.
> RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/
> The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple.
> Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores.
> About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
> On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
> Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
> Temporal Data
> https://en.wikipedia.org/wiki/Temporal_database
> https://en.wikipedia.org/wiki/Valid_time
> In temporal data, data may get updated to reflect changes in data.
> For example data change from
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> Media
> Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem.
> Indexes
> With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities.
> Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics.
> Google References
> While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components.
> We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches.
> Year Title Links
> 2015 Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/
> 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
> 2013 F1: A Distributed SQL Database That Scales http://research.google.com/pubs/pub41344.html
> 2013 Online, Asynchronous Schema Change in F1 http://research.google.com/pubs/pub41376.html
> 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
> 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business http://research.google.com/pubs/pub38125.html
> 2012 Spanner: Google's Globally-Distributed Database http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
> 2012 Clydesdale: structured data processing on MapReduce http://dl.acm.org/citation.cfm?doid=2247596.2247600
> 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive Services http://research.google.com/pubs/pub36971.html
> 2011 Tenzing A SQL Implementation On The MapReduce Framework http://research.google.com/pubs/pub37200.html
> 2010 Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html
> 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines http://research.google.com/pubs/pub35650.html
> 2010 Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf
> Application Domains
> The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing
> Supply Chain Planning
> Web Sites
> Mobile App Stores
> Financials
> Media
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM
> Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)