You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Maja Kabiljo <ma...@fb.com> on 2013/10/31 19:43:11 UTC

Review Request 15142: GIRAPH-789: Upgrade hive-io to 0.20 - less metastore accesses

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15142/
-----------------------------------------------------------

Review request for giraph.


Bugs: GIRAPH-789
    https://issues.apache.org/jira/browse/GIRAPH-789


Repository: giraph-git


Description
-------

Currently each worker is sending multiple requests to metastore to get info about io formats, which is unnecessary and can cause issues when metastore is having problems.

Hive-io changed so it doesn't access metastore when schema/table info is already present in Configuration, and HiveGiraphRunner is now initializing all the formats to fill up the Configuration. If HiveGiraphRunner is not used everything will still work, but we'll have accesses to metastore from workers.


Diffs
-----

  giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java 6b8a8e9 
  giraph-hive/src/main/java/org/apache/giraph/hive/common/HiveUtils.java b809413 
  giraph-hive/src/main/java/org/apache/giraph/hive/input/edge/HiveEdgeInputFormat.java 534a773 
  giraph-hive/src/main/java/org/apache/giraph/hive/input/vertex/HiveVertexInputFormat.java d5c1279 
  giraph-hive/src/main/java/org/apache/giraph/hive/output/HiveVertexOutputFormat.java c4813fb 
  pom.xml f2981ff 

Diff: https://reviews.apache.org/r/15142/diff/


Testing
-------

mvn clean verify

Run jobs with single and multiple input formats, with added logging for each metastore call in hive-io. For example in case when we have single vertex and edge input and output, we'll have none instead of 8 metastore calls from each worker. The number of calls from master is also reduced - we are only getting input partition descriptions in the beginning of the job and have no calls in the end (for output). The only call left in the end is from cleanup task to register new partition. Clean up task used to have two additional calls which are also removed.


Thanks,

Maja Kabiljo


Re: Review Request 15142: GIRAPH-789: Upgrade hive-io to 0.20 - less metastore accesses

Posted by Maja Kabiljo <ma...@fb.com>.

> On Oct. 31, 2013, 8:07 p.m., Avery Ching wrote:
> > +1, this is awesome work Maja and will fail faster due to metastore issues and also cut back on metastore accesses.  Yay!

Thanks for a quick review, added comments and committing!


- Maja


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15142/#review27948
-----------------------------------------------------------


On Oct. 31, 2013, 6:43 p.m., Maja Kabiljo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15142/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2013, 6:43 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Bugs: GIRAPH-789
>     https://issues.apache.org/jira/browse/GIRAPH-789
> 
> 
> Repository: giraph-git
> 
> 
> Description
> -------
> 
> Currently each worker is sending multiple requests to metastore to get info about io formats, which is unnecessary and can cause issues when metastore is having problems.
> 
> Hive-io changed so it doesn't access metastore when schema/table info is already present in Configuration, and HiveGiraphRunner is now initializing all the formats to fill up the Configuration. If HiveGiraphRunner is not used everything will still work, but we'll have accesses to metastore from workers.
> 
> 
> Diffs
> -----
> 
>   giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java 6b8a8e9 
>   giraph-hive/src/main/java/org/apache/giraph/hive/common/HiveUtils.java b809413 
>   giraph-hive/src/main/java/org/apache/giraph/hive/input/edge/HiveEdgeInputFormat.java 534a773 
>   giraph-hive/src/main/java/org/apache/giraph/hive/input/vertex/HiveVertexInputFormat.java d5c1279 
>   giraph-hive/src/main/java/org/apache/giraph/hive/output/HiveVertexOutputFormat.java c4813fb 
>   pom.xml f2981ff 
> 
> Diff: https://reviews.apache.org/r/15142/diff/
> 
> 
> Testing
> -------
> 
> mvn clean verify
> 
> Run jobs with single and multiple input formats, with added logging for each metastore call in hive-io. For example in case when we have single vertex and edge input and output, we'll have none instead of 8 metastore calls from each worker. The number of calls from master is also reduced - we are only getting input partition descriptions in the beginning of the job and have no calls in the end (for output). The only call left in the end is from cleanup task to register new partition. Clean up task used to have two additional calls which are also removed.
> 
> 
> Thanks,
> 
> Maja Kabiljo
> 
>


Re: Review Request 15142: GIRAPH-789: Upgrade hive-io to 0.20 - less metastore accesses

Posted by Avery Ching <av...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15142/#review27948
-----------------------------------------------------------

Ship it!


+1, this is awesome work Maja and will fail faster due to metastore issues and also cut back on metastore accesses.  Yay!


giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java
<https://reviews.apache.org/r/15142/#comment54396>

    Maybe worth adding a top level comment for this method that says something like:
    For all Hive vertex inputs, add the user settings to the configuration.  Additionally, this checks the input specs for every input which caches metadata access into the configuration to eliminate worker access to the metastore and fail earlier in the case that metadata doesn't exist.  In the case of multiple vertex input descriptions, metadata is cached in each vertex input format description and then saved into a single Configuration via JSON.



giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java
<https://reviews.apache.org/r/15142/#comment54399>

    Maybe worth adding a top level comment for this method that says something like:
    For all Hive edge inputs, add the user settings to the configuration.  Additionally, this checks the input specs for every input which caches metadata access into the configuration to eliminate worker access to the metastore and fail earlier in the case that metadata doesn't exist.  In the case of multiple edge input descriptions, metadata is cached in each vertex input format description and then saved into a single Configuration via JSON.


- Avery Ching


On Oct. 31, 2013, 6:43 p.m., Maja Kabiljo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15142/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2013, 6:43 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Bugs: GIRAPH-789
>     https://issues.apache.org/jira/browse/GIRAPH-789
> 
> 
> Repository: giraph-git
> 
> 
> Description
> -------
> 
> Currently each worker is sending multiple requests to metastore to get info about io formats, which is unnecessary and can cause issues when metastore is having problems.
> 
> Hive-io changed so it doesn't access metastore when schema/table info is already present in Configuration, and HiveGiraphRunner is now initializing all the formats to fill up the Configuration. If HiveGiraphRunner is not used everything will still work, but we'll have accesses to metastore from workers.
> 
> 
> Diffs
> -----
> 
>   giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java 6b8a8e9 
>   giraph-hive/src/main/java/org/apache/giraph/hive/common/HiveUtils.java b809413 
>   giraph-hive/src/main/java/org/apache/giraph/hive/input/edge/HiveEdgeInputFormat.java 534a773 
>   giraph-hive/src/main/java/org/apache/giraph/hive/input/vertex/HiveVertexInputFormat.java d5c1279 
>   giraph-hive/src/main/java/org/apache/giraph/hive/output/HiveVertexOutputFormat.java c4813fb 
>   pom.xml f2981ff 
> 
> Diff: https://reviews.apache.org/r/15142/diff/
> 
> 
> Testing
> -------
> 
> mvn clean verify
> 
> Run jobs with single and multiple input formats, with added logging for each metastore call in hive-io. For example in case when we have single vertex and edge input and output, we'll have none instead of 8 metastore calls from each worker. The number of calls from master is also reduced - we are only getting input partition descriptions in the beginning of the job and have no calls in the end (for output). The only call left in the end is from cleanup task to register new partition. Clean up task used to have two additional calls which are also removed.
> 
> 
> Thanks,
> 
> Maja Kabiljo
> 
>