You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by okram <gi...@git.apache.org> on 2016/01/07 00:14:18 UTC

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192

    TINKERPOP-1033: Store sideEffects as a persisted RDD

    https://issues.apache.org/jira/browse/TINKERPOP-1033
    
    This is a massive amount of work. Just making sideEffects be stored as persisted RDDs led to a swath of other updates. Here is the list of things:
    
    * It is now possible for Spark users to completely avoid using HDFS -- they simply use `PersistedInputRDD` and `PersistedOutputRDD` for everything.
    * Added a significant amount of testing to ensure that persisted RDDs work as expected in all situations.
    * `InputRDD`s now have a `readMemoryRDD()` method which handles reading sideEffects (i.e. memory).
    * `OutputRDD`s now have a `writeMemoryRDD()` method which handles writing sideEffects (i.e. memory).
    * There is a `Storage` interface in gremlin-core which providers can implement to have "file-system semantics" for their data source. HDFS and Spark both implement it. No more Groovy meta-programming for HDFS! Sweeeeet.
    * With `Storage` all the file management in both Spark and Giraph is much simpler as the methods in `Storage` allowed me to gut alot of (error prone) code.
    * There is a general test suite which makes sure both HDFS and Spark storage behave "the same."
    * Updated documentation, upgrade docs, and added JavaDoc to `Storage`.
    * The docs for `BulkLoaderVertexProgram` and Spark/Giraph uses a `data/` directory. It wasn't consistent with our other examples so I cleaned it up.
    * Fixed a minor bug in `ClusterCountMapReduce`.
    * Cleaned up how HDFS data is streamed -- its pure now, based solely on `InputFormat` behavior (I learned something new in Hadoop).
    * There are a few minor "breaking changes" around `hdfs.methods()`. They are "ok" as HDFS interaction prior to this moment  has always been manual via the Gremlin Console.
    
    I updated the "update" docs:
    
    http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/upgrade/#_storage_i_o
    
    I updated the "reference" docs:
    
    http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/reference/#_storage_systems
    
    You can see the JavaDoc for the new `Storage` interface:
    
    http://tinkerpop.apache.org/javadocs/3.1.1-SNAPSHOT/core/org/apache/tinkerpop/gremlin/structure/io/Storage.html
    
    I ran integration tests and built and deployed docs successfully. 
      
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1033

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #192
    
----
commit f3ebed0bde6ac889640cb136b50b362c5cd2d2ea
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-09T17:41:09Z

    InputRDD now has readMemoryRDD(). OutputRDD now has writeMemoryRDD(). InputFormatRDD and OutputFormatRDD took the code from SparkExecutor that uses SequenceFiles for output. As such, memory reading/writing has been generalized. Graph system providers that ONLY want to provide Spark support are not required to have HDFS as SparkServer can maintains all persisted data via graphRDD and memoryRDD. There is still more work to do. More tests cases is next.

commit 58d9240764cd6e1f3779097966c53058264e00e6
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-09T20:46:43Z

    added Storage to gremlin-core. Storage is an interface that OLAP system can implement. It provides ls(), rmr(), rm(), etc. type methods that make it easy for users to interact (via a common interface) with the underlying persitance system. Now both HDFS and Spark provide their own Storage implementations and TADA. Really pretty.

commit 2c0d327c04219de9fdf20444a100d3cb3dd1d221
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-09T20:48:49Z

    merged master and merged conflicts from @spmallettes changes to SparkGremlinPlugin and HadoopGremlinPlugin.

commit b4d8e9608d4eca3ae177b28fe588518a9d77506c
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-09T22:58:50Z

    Greatly greatly simplified Hadoop OLTP and interactions with HDFS and SparkContext. The trend -- dir/~g for graphs and dir/x for memory. A consistent persistence schema makes everything so much simpler. I always assumed this would be all generalized/blah/blah. Never actually did it so, hell, stick with a consistent schema and watch the code just fall away.

commit 3fff8f546501d10a4c1d34762a626a2493e758be
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-09T23:57:28Z

    lots more clean up, tests, and organization. She is a real beauty.

commit 74b9c8ecfe787ead99d79c127fd85a4fccd926ec
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-10T01:27:29Z

    migrated GiraphGraphComputer over to the new Storage model via FileSystemStorage for HDFS.

commit 55165a572f5d07e1ca20be13b064843da18fc8e6
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-12-10T02:11:33Z

    cleanup HDFS if Persist.NOTHING.

commit dbd4a5360a75d562df64eecd91cc8c12550adb10
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2016-01-05T22:54:14Z

    merged master into branch. Minor tweaks given @spmallette new work on TestDirectory stuffs.

commit 53e57a73aa5316b44d5ef4917347a6ba8934a102
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2016-01-06T15:02:33Z

    breaking commit. ignore.

commit b0f3e4a96ced7f45f5e823b9060eac9dd0be1f7e
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2016-01-06T17:26:46Z

    Storage is complete and has a really cool TestSuite. There are two types of Storage. FileSystemStorage (HDFS) and SparkContextStorage (persited RDDs). You can ls(), cp(), rm(), rmr(), head(), etc. There is a single abstract test suite called AbstractStorageCheck that confirms that both Spark and HDFS behave the same. Moved around and organized Hadoop test cases given the new developments.

commit 5c9e81b0cebd8c3841e2442a8ef13b3d23d44295
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2016-01-06T22:58:18Z

    added documentation,  upgrade docs, JavaDoc, more test cases, and fixed up some random inconsistencies in BulkLoaderVertexProgram documentation examples.

commit a7db52bda732810fc8d5d3a8279a4f7095285d3d
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2016-01-06T23:03:59Z

    Merge branch 'master' into TINKERPOP-1033

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by spmallette <gi...@git.apache.org>.
Github user spmallette commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169827767
  
    This PR drops a few "internal" classes - folks shouldn't have been using those directly, but would it have been better to deprecate those as opposed to just removing completely?  
    
    Seems like deprecation would have worked for:
    
    * [HadoopLoader](https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-* 55e3610726b342e666b34223b8270526)
    * [HDFSTools](https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-88ec5bbe9a2817117d62799b9e91a20a)
    * [SparkLoader](https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-0c5aa03b09c929fa6440edb9bacc7e76)
    
    even though these are pretty low-level classes, it would be nice to stick to "not breaking (even if they are doing stuff they are not supposed to)" plan imo.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by spmallette <gi...@git.apache.org>.
Github user spmallette commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169840450
  
    Yeah - they shouldn't be using them, but we know how that goes....make a class public and someone is gonna use it to their detriment or otherwise.  anyway, i'd say just do one of the following then:
    
    1. bring them back and deprecate
    2. add something to upgrade docs to explain their removal - you already have the section - perhaps just a WARNING that explicitly mentions the classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by dkuppitz <gi...@git.apache.org>.
Github user dkuppitz commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169694891
  
    * `mvn clean install`: passed
    * integration tests: passed
    * `bin/process-docs.sh`: passed
    
    VOTE: +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by okram <gi...@git.apache.org>.
Github user okram commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169528513
  
    Integration test `BUILD SUCCESSFUL`.
    
    ```
    [INFO] Apache TinkerPop .................................. SUCCESS [4.800s]
    [INFO] Apache TinkerPop :: Gremlin Shaded ................ SUCCESS [2.300s]
    [INFO] Apache TinkerPop :: Gremlin Core .................. SUCCESS [34.224s]
    [INFO] Apache TinkerPop :: Gremlin Test .................. SUCCESS [11.772s]
    [INFO] Apache TinkerPop :: Gremlin Groovy ................ SUCCESS [32.672s]
    [INFO] Apache TinkerPop :: Gremlin Groovy Test ........... SUCCESS [6.828s]
    [INFO] Apache TinkerPop :: TinkerGraph Gremlin ........... SUCCESS [3:22.133s]
    [INFO] Apache TinkerPop :: Hadoop Gremlin ................ SUCCESS [5:02.151s]
    [INFO] Apache TinkerPop :: Spark Gremlin ................. SUCCESS [4:03.723s]
    [INFO] Apache TinkerPop :: Giraph Gremlin ................ SUCCESS [2:01:35.469s]
    [INFO] Apache TinkerPop :: Neo4j Gremlin ................. SUCCESS [18:06.653s]
    [INFO] Apache TinkerPop :: Gremlin Driver ................ SUCCESS [10.940s]
    [INFO] Apache TinkerPop :: Gremlin Server ................ SUCCESS [11:13.160s]
    [INFO] Apache TinkerPop :: Gremlin Console ............... SUCCESS [1:10.880s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 2:46:18.192s
    [INFO] Finished at: Wed Jan 06 19:01:26 MST 2016
    [INFO] Final Memory: 103M/708M
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by okram <gi...@git.apache.org>.
Github user okram commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169832850
  
    Here is the thing. `SparkLoader` was introduced in 3.1.1-SNAPSHOT :D so it okay to drop. `HadoopLoader` is all meta-programming Groovy stuff to get ls(), rm(), etc. to work in Gremlin Console. We can keep the the class, but we can't have it loaded else it will interfere with the new `FileSystemStorage`. However, I say we just drop it. Its so low level and all meta-programmy that if someone is using it, they are retarded.
    
    Finally, `HDFSTools`. Again, low level.... I can bring that class back, but people really shouldn't be using it. This is like an internal utility and so specific to TinkerPop filesystem stuff. ??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by spmallette <gi...@git.apache.org>.
Github user spmallette commented on the pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-170012024
  
    VOTE: +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-tinkerpop pull request: TINKERPOP-1033: Store sideEffect...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-tinkerpop/pull/192


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---