You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Josh Mahonin (JIRA)" <ji...@apache.org> on 2017/01/18 15:53:26 UTC

[jira] [Commented] (PHOENIX-3600) Core MapReduce classes don't provide location info

    [ https://issues.apache.org/jira/browse/PHOENIX-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828307#comment-15828307 ] 

Josh Mahonin commented on PHOENIX-3600:
---------------------------------------

Mostly just ported the Phoenix/MR specific code from:

https://github.com/apache/phoenix/blob/master/phoenix-hive/src/main/java/org/apache/phoenix/hive/mapreduce/PhoenixInputFormat.java#L151-L216

https://github.com/apache/phoenix/blob/master/phoenix-hive/src/main/java/org/apache/phoenix/hive/mapreduce/PhoenixInputSplit.java

Also included a new Configuration property "phoenix.mapreduce.split.by.stats" which does effectively the same thing as the Hive-specific "split.by.stats". In short, the MR code was only generating InputSplits based on Region Splits, and wasn't taking into account the possibility of more scans being generated by the statistics collection.

I'll follow-up on PHOENIX-3601 with the performance results I gathered in some Spark testing, but it would be awesome if other folks using Phoenix MR integration in some capacity could test this out. It's a bit of a double-whammy, since this patch gets us both node-awareness for the splits, and increases the potential parallelism by including the statistics-generated scans. cc [~maghamravikiran@gmail.com] [~ndimiduk] [~sergey.soldatov] [~elserj] [~jamestaylor], perhaps others :)

> Core MapReduce classes don't provide location info
> --------------------------------------------------
>
>                 Key: PHOENIX-3600
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3600
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.8.0
>            Reporter: Josh Mahonin
>            Assignee: Josh Mahonin
>         Attachments: PHOENIX-3600.patch
>
>
> The core MapReduce classes {{org.apache.phoenix.mapreduce.PhoenixInputSplit}} and {{org.apache.phoenix.mapreduce.PhoenixInputFormat}} don't provide region size or location information, leaving the execution engine (MR, Spark, etc.) to randomly assign splits to nodes.
> Interestingly, the phoenix-hive module has reimplemented these classes, including the node-aware functionality. We should port a subset of those changes back to the core code so that other engines can make use of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)