You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Avery Ching (Created) (JIRA)" <ji...@apache.org> on 2011/11/28 19:59:40 UTC

[jira] [Created] (GIRAPH-100) Data input sampling and testing improvements

Data input sampling and testing improvements
--------------------------------------------

                 Key: GIRAPH-100
                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
             Project: Giraph
          Issue Type: New Feature
          Components: graph
            Reporter: Avery Ching


It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161219#comment-13161219 ] 

Avery Ching commented on GIRAPH-100:
------------------------------------

Sorry Jakob, I'll try to stop doing formatting changes.  Habit, I suppose.  In the future, I'll file separate issues for formatting cleanup.

    What's the point of the changes in TextVertexInputFormat method visibility? Are they related to this patch?

No, I can remove it.  Just a bit safer I guess since they should be protected.

    We're throwing a lot of Stringly typed exceptions. For more robust error handling and recovery, it may be good to strongly type these instead.

Which exceptions are you referring to?

    re: SuperstepHashPartitionerFactory. Moving it out of test and into the example directory seems a bit counterproductive to me. It's a pathological implementation; wouldn't it be better to provide a more useful example, rather than one that's explicitly not meant to be used?

Until we start jaring up things separately, currently the Hadoop unit test is broken when the SuperstepHashPartitionerFactory is not found.  The right solution might be to create another jar that has the unittest classes and can be run as part of the Hadoop instance unittest.  Can we do that in another issue?  I agree that it isn't a good example, but it's still a good test since it guarantees partition movement between workers.

                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161243#comment-13161243 ] 

Avery Ching commented on GIRAPH-100:
------------------------------------

Ah, I see.  We should file another JIRA to create a GiraphException and the various types I suppose.  Or do you want me to do it in this JIRA?

I can put the SuperstepHashPartitionerFactory into another directory 
src/main/java/org/apache/giraph/integration/SuperstepHashPartitionerFactory.java

I like the idea of mocking in general, but not sure how mocking can verify the behavior in this case.  Probably leave it as an integration test only.  IMO, we should file a separate JIRA for separating the tests into unittests (mocking, individual class tests) and integration tests, but integration tests can still be run locally and/or remote.

Let me know what you think and I'll make the requested changes.
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Claudio Martella (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158682#comment-13158682 ] 

Claudio Martella commented on GIRAPH-100:
-----------------------------------------

biggest news is we reached 100! :)
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161264#comment-13161264 ] 

Jakob Homan commented on GIRAPH-100:
------------------------------------

bq. In the future, I'll file separate issues for formatting cleanup.
Great. This also gives us a steady supply of newbie JIRAs, since the latest batch is almost used up.

bq.We should file another JIRA to create a GiraphException and the various types I suppose. Or do you want me to do it in this JIRA?
Either in this JIRA, or put the current ones in with FIXME/TODO annotations so we can back and fix them easily, and then immediately open a new JIRA.

bq. but not sure how mocking can verify the behavior in this case.
It may end up being a challenge, but it's a strong guard against building up a huge number of integration tests, calling them unit tests and then having tests that run for four, six or nine hours (see: every other Hadoop ecosystem project).  Being able to swap out the backing dependency from a mock to a real Hadoop cluster is a great way to test quickly (ie, often) as well as test reality (ie, against a real cluster).  I'll take a look at making sure we have infrastructure that is amenable to this.

bq. we should file a separate JIRA for separating the tests into unittests (mocking, individual class tests) and integration tests, but integration tests can still be run locally and/or remote.
Can we go ahead and create test/integration as part of this JIRA and put SuperstepHashPartitionerFactory there? That way it doesn't go into the inappropriate examples directory but can still be bundled as part of the jar.  The remaining partitioning can probably be done as part of GIRAPH-22.

                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161795#comment-13161795 ] 

Hudson commented on GIRAPH-100:
-------------------------------

Integrated in Giraph-trunk-Commit #45 (See [https://builds.apache.org/job/Giraph-trunk-Commit/45/])
    GIRAPH-100: GIRAPH-100 - Data input sampling and testing
improvements. (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209640
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/integration
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/integration/SuperstepHashPartitionerFactory.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java

                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.2.patch, GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161213#comment-13161213 ] 

Jakob Homan commented on GIRAPH-100:
------------------------------------

Please avoid formatting changes as part of code change patches.  They blow up the size of the patch and introduce a lot of "what's the difference between these lines? Did anything change that needs to be reviewed?"  For instance, most of the changes in SimpleCheckpointVertex appear to be spurious.

* What's the point of the changes in TextVertexInputFormat method visibility? Are they related to this patch?
* We're throwing a lot of Stringly typed exceptions. For more robust error handling and recovery, it may be good to strongly type these instead.
* re: SuperstepHashPartitionerFactory. Moving it out of test and into the example directory seems a bit counterproductive to me.  It's a pathological implementation; wouldn't it be better to provide a more useful example, rather than one that's explicitly not meant to be used?




                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159175#comment-13159175 ] 

jiraposter@reviews.apache.org commented on GIRAPH-100:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2959/
-----------------------------------------------------------

Review request for giraph.


Summary
-------

Got rid of ZooKeeper message for node created on the input split reservation.

Adding some features for debugging:
- Taking only a % of the input splits
- Taking a maximum number of vertices in an input split

Added master status update for number of workers have responded.

Workers will output some information about how the % of input splits that have been completed.

Fixed a bug where a forced flush of cached vertices in the input split was happening per input split rather than at the end of processing all input splits.  This requires an additional barrier after processing all the input splits to allow for the final flush of the cached vertices.

Factored out barrierOnWorkerList to reuse the barrier code coordination by the master.

Factored out markInputSplitPathFinished to make the code a bit cleaner.

Clearing out the transientInMessages and inMessages maps to reduce processing time.

Changed the default partition count multipler to produce n^2 partitions rather than 0.5xn^2 for better balancing when the maximum limit is not exceeded.

Changed SimpleCheckpointVertex to throw an Exception instead of System.exit(-1) for a faster failure (seconds instead of minutes).

Moved SuperstepHashPartitionerFactory to the examples directory.  If it is not there, the test against a real Hadoop instance will fail from ClassNotFoundException.


This addresses bug GIRAPH-100.
    https://issues.apache.org/jira/browse/GIRAPH-100


Diffs
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleCheckpointVertex.java 1207804 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java 1207804 

Diff: https://reviews.apache.org/r/2959/diff


Testing
-------

Passed local and Hadoop instance unittests.  Ran PageRankBenchmark on a real Hadoop cluster.


Thanks,

Avery


                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-100:
-------------------------------

    Attachment: GIRAPH-100.2.patch

Moved examples/SuperstepHashPartitionerFactory.java to integration/SuperstepHashPartitionerFactory.java: 

Added a few context.progress() to the communication cycle to avoid task timeouts.
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.2.patch, GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161379#comment-13161379 ] 

jiraposter@reviews.apache.org commented on GIRAPH-100:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2959/
-----------------------------------------------------------

(Updated 2011-12-02 02:55:14.025295)


Review request for giraph.


Changes
-------

Moved examples/SuperstepHashPartitionerFactory.java to integration/SuperstepHashPartitionerFactory.java: 

Added a few context.progress() to the communication cycle to avoid task timeouts.


Summary
-------

Got rid of ZooKeeper message for node created on the input split reservation.

Adding some features for debugging:
- Taking only a % of the input splits
- Taking a maximum number of vertices in an input split

Added master status update for number of workers have responded.

Workers will output some information about how the % of input splits that have been completed.

Fixed a bug where a forced flush of cached vertices in the input split was happening per input split rather than at the end of processing all input splits.  This requires an additional barrier after processing all the input splits to allow for the final flush of the cached vertices.

Factored out barrierOnWorkerList to reuse the barrier code coordination by the master.

Factored out markInputSplitPathFinished to make the code a bit cleaner.

Clearing out the transientInMessages and inMessages maps to reduce processing time.

Changed the default partition count multipler to produce n^2 partitions rather than 0.5xn^2 for better balancing when the maximum limit is not exceeded.

Changed SimpleCheckpointVertex to throw an Exception instead of System.exit(-1) for a faster failure (seconds instead of minutes).

Moved SuperstepHashPartitionerFactory to the examples directory.  If it is not there, the test against a real Hadoop instance will fail from ClassNotFoundException.


This addresses bug GIRAPH-100.
    https://issues.apache.org/jira/browse/GIRAPH-100


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/integration/SuperstepHashPartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java 1209336 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java 1209336 

Diff: https://reviews.apache.org/r/2959/diff


Testing
-------

Passed local and Hadoop instance unittests.  Ran PageRankBenchmark on a real Hadoop cluster.


Thanks,

Avery


                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.2.patch, GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161786#comment-13161786 ] 

Jakob Homan commented on GIRAPH-100:
------------------------------------

+1
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.2.patch, GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching reassigned GIRAPH-100:
----------------------------------

    Assignee: Avery Ching
    
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-100.
--------------------------------

    Resolution: Fixed

Thanks for the suggestions and review Jakob, much appreciated.
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.2.patch, GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-100:
-------------------------------

    Attachment: GIRAPH-100.patch

Duplicate of reviewboard patch (https://reviews.apache.org/r/2959/diff/raw/).
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161236#comment-13161236 ] 

Jakob Homan commented on GIRAPH-100:
------------------------------------

bq. Which exceptions are you referring to?
{noformat}+                    throw new IllegalStateException(
+                        "prepareSuperstep: Impossible that this worker " +
+                        service.getWorkerInfo() + " was sent " +
+                        entry.getValue().size() + " message(s) with " +
+                        "vertex id " + entry.getKey() +
+                        " when it does not own this partition.  It should " +
+                        "have gone to partition owner " +
+                        service.getVertexPartitionOwner(entry.getKey()) +
+                        ".  The partition owners are " +
+                        service.getPartitionOwners());{noformat}
{noformat}+                            throw new IllegalStateException(
+                                "prepareSuperstep: Impossible to not remove " +
+                                vertex);{noformat}
{noformat}+                throw new IllegalStateException(
+                    "coordinateSuperstep: Worker failed during input split " +
+                    "(currently not supported)");{noformat}
{noformat}+                throw new IllegalStateException(
+                    "barrierOnWorkerList: KeeperException - " +
+                    "Couldn't get " + workerInfoHealthyPath, e);{noformat}
{noformat}+            throw new IllegalStateException(
+                "loadVertices: KeeperException on " +
+                inputSplitFinishedPath, e);{noformat}
etc. These are all specific types of exceptions being wrapped in IllegalStateException.  We'll likely want to catch and handle them later in an effort to be more robust. It'd be better if these existed as their own types, so we don't have to try to tease out the details later.
bq. Can we do that in another issue? I agree that it isn't a good example, but it's still a good test since it guarantees partition movement between workers.
I have trouble putting something that we agree is a bad example into the example directory. The issue is that it's not actually a unit test, since it involves Hadoop.  That makes it an integration test.  The best answer is to have integration tests in their own directory (and either bundled with the main jar or a separate integration test directory).  Since this verifies important behavior, the basic test itself should run without Hadoop, via mocking, and the ability to run it as an integration test under a real Hadoop maintained.
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161031#comment-13161031 ] 

Avery Ching commented on GIRAPH-100:
------------------------------------

Anyone? =)
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-100) Data input sampling and testing improvements

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161357#comment-13161357 ] 

Avery Ching commented on GIRAPH-100:
------------------------------------

I added another JIRA https://issues.apache.org/jira/browse/GIRAPH-102 to deal with getting better exceptions.  To be honest, we need to go through all our exceptions and fix them I think, a TODO isn't going to cut it.  I'll file a review to change the location of SuperstepHashPartitioner.
                
> Data input sampling and testing improvements
> --------------------------------------------
>
>                 Key: GIRAPH-100
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-100
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-100.patch
>
>
> It would be really nice to help debug an application by limiting the input data (% of input splits, max vertices per input split).  Also, it would be nice for the workers to provide a little more debugging info on how far along they are with processing the input data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira