You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Chris Nauroth <cn...@hortonworks.com> on 2016/03/01 00:02:50 UTC
Re: [VOTE] Accept Mnemonic into the Apache Incubator

+1 (binding)

--Chris Nauroth




On 2/29/16, 9:37 AM, "Patrick Hunt" <ph...@apache.org> wrote:

>Hi folks,
>
>OK the discussion is now completed. Please VOTE to accept Mnemonic
>into the Apache Incubator. I¹ll leave the VOTE open for at least
>the next 72 hours, with hopes to close it Thursday the 3rd of
>March, 2016 at 10am PT.
>https://wiki.apache.org/incubator/MnemonicProposal
>
>[ ] +1 Accept Mnemonic as an Apache Incubator podling.
>[ ] +0 Abstain.
>[ ] -1 Don¹t accept Mnemonic as an Apache Incubator podling because..
>
>Of course, I am +1 on this. Please note VOTEs from Incubator PMC
>members are binding but all are welcome to VOTE!
>
>Regards,
>
>Patrick
>
>--------------------
>= Mnemonic Proposal =
>=== Abstract ===
>Mnemonic is a Java based non-volatile memory library for in-place
>structured data processing and computing. It is a solution for generic
>object and block persistence on heterogeneous block and
>byte-addressable devices, such as DRAM, persistent memory, NVMe, SSD,
>and cloud network storage.
>
>=== Proposal ===
>Mnemonic is a structured data persistence in-memory in-place library
>for Java-based applications and frameworks. It provides unified
>interfaces for data manipulation on heterogeneous
>block/byte-addressable devices, such as DRAM, persistent memory, NVMe,
>SSD, and cloud network devices.
>
>The design motivation for this project is to create a non-volatile
>programming paradigm for in-memory data object persistence, in-memory
>data objects caching, and JNI-less IPC.
>Mnemonic simplifies the usage of data object caching, persistence, and
>JNI-less IPC for massive object oriented structural datasets.
>
>Mnemonic defines Non-Volatile Java objects that store data fields in
>persistent memory and storage. During the program runtime, only
>methods and volatile fields are instantiated in Java heap,
>Non-Volatile data fields are directly accessed via GET/SET operation
>to and from persistent memory and storage. Mnemonic avoids SerDes and
>significantly reduces amount of garbage in Java heap.
>
>Major features of Mnemonic:
>* Provides an abstract level of viewpoint to utilize heterogeneous
>block/byte-addressable device as a whole (e.g., DRAM, persistent
>memory, NVMe, SSD, HD, cloud network Storage).
>
>* Provides seamless support object oriented design and programming
>without adding burden to transfer object data to different form.
>
>* Avoids the object data serialization/de-serialization for data
>retrieval, caching and storage.
>
>* Reduces the consumption of on-heap memory and in turn to reduce and
>stabilize Java Garbage Collection (GC) pauses for latency sensitive
>applications.
>
>* Overcomes current limitations of Java GC to manage much larger
>memory resources for massive dataset processing and computing.
>
>* Supports the migration data usage model from traditional NVMe/SSD/HD
>to non-volatile memory with ease.
>
>* Uses lazy loading mechanism to avoid unnecessary memory consumption
>if some data does not need to use for computing immediately.
>
>* Bypasses JNI call for the interaction between Java runtime
>application and its native code.
>
>* Provides an allocation aware auto-reclaim mechanism to prevent
>external memory resource leaking.
>
>
>=== Background ===
>Big Data and Cloud applications increasingly require both high
>throughput and low latency processing. Java-based applications
>targeting the Big Data and Cloud space should be tuned for better
>throughput, lower latency, and more predictable response time.
>Typically, there are some issues that impact BigData applications'
>performance and scalability:
>
>1) The Complexity of Data Transformation/Organization: In most cases,
>during data processing, applications use their own complicated data
>caching mechanism for SerDes data objects, spilling to different
>storage and eviction large amount of data. Some data objects contains
>complex values and structure that will make it much more difficulty
>for data organization. To load and then parse/decode its datasets from
>storage consumes high system resource and computation power.
>
>2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes
>Frequent Long GC Pauses: Big Data computing/syntax generates large
>amount of temporary objects during processing, e.g. lambda, SerDes,
>copying and etc. This will trigger frequent long Java GC pause to scan
>references, to update references lists, and to copy live objects from
>one memory location to another blindly.
>
>3) The Unpredictable GC Pause: For latency sensitive applications,
>such as database, search engine, web query, real-time/streaming
>computing, require latency/request-response under control. But current
>Java GC does not provide predictable GC activities with large on-heap
>memory management.
>
>4) High JNI Invocation Cost: JNI calls are expensive, but high
>performance applications usually try to leverage native code to
>improve performance, however, JNI calls need to convert Java objects
>into something that C/C++ can understand. In addition, some
>comprehensive native code needs to communicate with Java based
>application that will cause frequently JNI call along with stack
>marshalling.
>
>Mnemonic project provides a solution to address above issues and
>performance bottlenecks for structured data processing and computing.
>It also simplifies the massive data handling with much reduced GC
>activity.
>
>=== Rationale ===
>There are strong needs for a cohesive, easy-to-use non-volatile
>programing model for unified heterogeneous memory resources management
>and allocation. Mnemonic project provides a reusable and flexible
>framework to accommodate other special type of memory/block devices
>for better performance without changing client code.
>
>Most of the BigData frameworks (e.g., Apache Spark, Apache Hadoop®,
>Apache HBase, Apache Flink, Apache Kafka, etc.) have their own
>complicated memory management modules for caching and checkpoint. Many
>approaches increase the complexity and are error-prone to maintain
>code.
>
>We have observed heavy overheads during the operations of data parse,
>SerDes, pack/unpack, code/decode for data loading, storage,
>checkpoint, caching, marshal and transferring. Mnemonic provides a
>generic in-memory persistence object model to address those overheads
>for better performance. In addition, it manages its in-memory
>persistence objects and blocks in the way that GC does, which means
>their underlying memory resource is able to be reclaimed without
>explicitly releasing it.
>
>Some existing Big Data applications suffer from poor Java GC behaviors
>when they process their massive unstructured datasets.  Those
>behaviors either cause very long stop-the-world GC pauses or take
>significant system resources during computing which impact throughput
>and incur significant perceivable pauses for interactive analytics.
>
>There are more and more computing intensive Big Data applications
>moving down to rely on JNI to offload their computing tasks to native
>code which dramatically increases the cost of JNI invocation and IPC.
>Mnemonic provides a mechanism to communicate with native code directly
>through in-place object data update to avoid complex object data type
>conversion and stack marshaling. In addition, this project can be
>extended to support various lockers for threads between Java code and
>native code.
>
>=== Initial Goals ===
>Our initial goal is to bring Mnemonic into the ASF and transit the
>engineering and governance processes to the "Apache Way."  We would
>like to enrich a collaborative development model that closely aligns
>with current and future industry memory and storage technologies.
>
>Another important goal is to encourage efforts to integrate
>non-volatile programming model into data centric processing/analytics
>frameworks/applications, (e.g., Apache Spark, Apache HBase, Apache
>Flink, Apache Hadoop®, Apache Cassandra,  etc.).
>
>We expect Mnemonic project to be continuously developing new
>functionalities in an open, community-driven way. We envision
>accelerating innovation under ASF governance in order to meet the
>requirements of a wide variety of use cases for in-memory non-volatile
>and volatile data caching programming.
>
>=== Current Status ===
>Mnemonic project is available at Intel¹s internal repository and
>managed by its designers and developers. It is also temporary hosted
>at Github for general view
>https://github.com/NonVolatileComputing/Mnemonic.git
>
>We have integrated this project for Apache Spark 1.5.0 and get 2X
>performance improvement ratio for Spark MLlib k-means workload and
>observed expected benefits of removing SerDes, reducing total GC pause
>time by 40% from our experiments.
>
>==== Meritocracy ====
>Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang
>in early 2015. The initial committers are the current Mnemonic R&D
>team members from US, China, and India Big Data Technologies Group at
>Intel. This group will form a base for much broader community to
>collaborate on this code base.
>
>We intend to radically expand the initial developer and user community
>by running the project in accordance with the "Apache Way." Users and
>new contributors will be treated with respect and welcomed. By
>participating in the community and providing quality patches/support
>that move the project forward, they will earn merit. They also will be
>encouraged to provide non-code contributions (documentation, events,
>community management, etc.) and will gain merit for doing so. Those
>with a proven support and quality track record will be encouraged to
>become committers.
>
>==== Community ====
>If Mnemonic is accepted for incubation, the primary initial goal is to
>transit the core community towards embracing the Apache Way of project
>governance. We would solicit major existing contributors to become
>committers on the project from the start.
>
>==== Core Developers ====
>Mnemonic core developers are all skilled software developers and
>system performance engineers at Intel Corp with years of experiences
>in their fields. They have contributed many code to Apache projects.
>There are PMCs and experienced committers have been working with us
>from Apache Spark, Apache HBase, Apache Phoenix, Apache Hadoop®
>for this project's open source efforts.
>
>=== Alignment ===
>The initial code base is targeted to data centric processing and
>analyzing in general. Mnemonic has been building the connection and
>integration for Apache projects and other projects.
>
>We believe Mnemonic will be evolved to become a promising project for
>real-time processing, in-memory streaming analytics and more, along
>with current and future new server platforms with persistent memory as
>base storage devices.
>
>=== Known Risks ===
>==== Orphaned products ====
>Intel¹s Big Data Technologies Group is actively working with community
>on integrating this project to Big Data frameworks and applications.
>We are continuously adding new concepts and codes to this project and
>support new usage cases and features for Apache Big Data ecosystem.
>
>The project contributors are leading contributors of Hadoop-based
>technologies and have a long standing in the Hadoop community. As we
>are addressing major Big Data processing performance issues, there is
>minimal risk of this work becoming non-strategic and unsupported.
>
>Our contributors are confident that a larger community will be formed
>within the project in a relatively short period of time.
>
>==== Inexperience with Open Source ====
>This project has long standing experienced mentors and interested
>contributors from Apache Spark, Apache HBase, Apache Phoenix,
>Apache Hadoop® to help us moving through open source process. We are
>actively working with experienced Apache community PMCs and committers
>to improve our project and further testing.
>
>==== Homogeneous Developers ====
>All initial committers and interested contributors are employed at
>Intel. As an infrastructure memory project, there are wide range of
>Apache projects are interested in innovative memory project to fit
>large sized persistent memory and storage devices. Various Apache
>projects such as Apache Spark, Apache HBase, Apache Phoenix, Apache
>Flink, Apache Cassandra etc. can take good advantage of this project
>to overcome serialization/de-serialization, Java GC, and caching
>issues. We expect a wide range of interest will be generated after we
>open source this project to Apache.
>
>==== Reliance on Salaried Developers ====
>All developers are paid by their employers to contribute to this
>project. We welcome all others to contribute to this project after it
>is open sourced.
>
>==== Relationships with Other Apache Product ====
>Relationship with Apache Arrow:
>Arrow's columnar data layout allows great use of CPU caches & SIMD. It
>places all data that relevant to a column operation in a compact
>format in memory.
>
>Mnemonic directly puts the whole business object graphs on external
>heterogeneous storage media, e.g. off-heap, SSD. It is not necessary
>to normalize the structures of object graphs for caching, checkpoint
>or storing. It doesn¹t require developers to normalize their data
>object graphs. Mnemonic applications can avoid indexing & join
>datasets compared to traditional approaches.
>
>Mnemonic can leverage Arrow to transparently re-layout qualified data
>objects or create special containers that is able to efficiently hold
>those data records in columnar form as one of major performance
>optimization constructs.
>
>Mnemonic can be integrated into various Big Data and Cloud frameworks
>and applications.
>We are currently working on several Apache projects with Mnemonic:
>For Apache Spark we are integrating Mnemonic to improve:
>a) Local checkpoints
>b) Memory management for caching
>c) Persistent memory datasets input
>d) Non-Volatile RDD operations
>The best use case for Apache Spark computing is that the input data
>is stored in form of Mnemonic native storage to avoid caching its row
>data for iterative processing. Moreover, Spark applications can
>leverage Mnemonic to perform data transforming in persistent or
>non-persistent memory without SerDes.
>
>For Apache Hadoop®, we are integrating HDFS Caching with Mnemonic
>instead of mmap. This will take advantage of persistent memory related
>features. We also plan to evaluate to integrate in Namenode Editlog,
>FSImage persistent data into Mnemonic persistent memory area.
>
>For Apache HBase, we are using Mnemonic for BucketCache and
>evaluating performance improvements.
>
>We expect Mnemonic will be further developed and integrated into many
>Apache BigData projects and so on, to enhance memory management
>solutions for much improved performance and reliability.
>
>==== An Excessive Fascination with the Apache Brand ====
>While we expect Apache brand helps to attract more contributors, our
>interests in starting this project is based on the factors mentioned
>in the Rationale section.
>
>We would like Mnemonic to become an Apache project to further foster a
>healthy community of contributors and consumers in BigData technology
>R&D areas. Since Mnemonic can directly benefit many Apache projects
>and solves major performance problems, we expect the Apache Software
>Foundation to increase interaction with the larger community as well.
>
>=== Documentation ===
>The documentation is currently available at Intel and will be posted
>under: https://mnemonic.incubator.apache.org/docs
>
>=== Initial Source ===
>Initial source code is temporary hosted Github for general viewing:
>https://github.com/NonVolatileComputing/Mnemonic.git
>It will be moved to Apache http://git.apache.org/ after podling.
>
>The initial Source is written in Java code (88%) and mixed with JNI C
>code (11%) and shell script (1%) for underlying native allocation
>libraries.
>
>=== Source and Intellectual Property Submission Plan ===
>As soon as Mnemonic is approved to join the Incubator, the source code
>will be transitioned via the Software Grant Agreement onto ASF
>infrastructure and in turn made available under the Apache License,
>version 2.0.
>
>=== External Dependencies ===
>The required external dependencies are all Apache licenses or other
>compatible Licenses
>Note: The runtime dependent licenses of Mnemonic are all declared as
>Apache 2.0, the GNU licensed components are used for Mnemonic build
>and deployment. The Mnemonic JNI libraries are built using the GNU
>tools.
>
>maven and its plugins (http://maven.apache.org/ ) [Apache 2.0]
>JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License]
>Nvml (http://pmem.io ) [optional] [Open Source]
>PMalloc (https://github.com/bigdata-memory/pmalloc ) [optional] [Apache
>2.0]
>
>Build and test dependencies:
>org.testng.testng v6.8.17  (http://testng.org) [Apache 2.0]
>org.flowcomputing.commons.commons-resgc v0.8.7 [Apache 2.0]
>org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0]
>com.squareup.javapoet v1.3.1-SNAPSHOT [Apache 2.0]
>JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License]
>
>=== Cryptography ===
>Project Mnemonic does not use cryptography itself, however, Hadoop
>projects use standard APIs and tools for SSH and SSL communication
>where necessary.
>
>=== Required Resources ===
>We request that following resources be created for the project to use
>
>==== Mailing lists ====
>private@mnemonic.incubator.apache.org (moderated subscriptions)
>commits@mnemonic.incubator.apache.org
>dev@mnemonic.incubator.apache.org
>
>==== Git repository ====
>https://github.com/apache/incubator-mnemonic
>
>==== Documentation ====
>https://mnemonic.incubator.apache.org/docs/
>
>==== JIRA instance ====
>https://issues.apache.org/jira/browse/mnemonic
>
>=== Initial Committers ===
>* Gang (Gary) Wang (gang1 dot wang at intel dot com)
>
>* Yanping Wang (yanping dot wang at intel dot com)
>
>* Uma Maheswara Rao G (umamahesh at apache dot org)
>
>* Kai Zheng (drankye at apache dot org)
>
>* Rakesh Radhakrishnan Potty  (rakeshr at apache dot org)
>
>* Sean Zhong  (seanzhong at apache dot org)
>
>* Henry Saputra  (hsaputra at apache dot org)
>
>* Hao Cheng (hao dot cheng at intel dot com)
>
>=== Additional Interested Contributors ===
>* Debo Dutta (dedutta at cisco dot com)
>
>* Liang Chen (chenliang613 at Huawei dot com)
>
>=== Affiliations ===
>* Gang (Gary) Wang, Intel
>
>* Yanping Wang, Intel
>
>* Uma Maheswara Rao G, Intel
>
>* Kai Zheng, Intel
>
>* Rakesh Radhakrishnan Potty, Intel
>
>* Sean Zhong, Intel
>
>* Henry Saputra, Independent
>
>* Hao Cheng, Intel
>
>=== Sponsors ===
>==== Champion ====
>Patrick Hunt
>
>==== Nominated Mentors ====
>* Patrick Hunt <phunt at apache dot org> - Apache IPMC member
>
>* Andrew Purtell <apurtell at apache dot org > - Apache IPMC member
>
>* James Taylor <jamestaylor at apache dot org> - Apache IPMC member
>
>* Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>
>==== Sponsoring Entity ====
>Apache Incubator PMC
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org