You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Remus Rusanu <re...@microsoft.com> on 2013/12/04 14:06:00 UTC

A question about SMB join and the MapWork pathToPartitionInfo/pathToAliases: considered 'local' plan for the 'small' SMB aliases

Hi all,

I'm working on HIVE-5595 to add vectorization support for SMB join operators. The problem I'm facing is that the vectorized record readers (eg. VectorizedOrcRecordReader) have a dependency on the MapWork.pathToPartitionInfo (see VectorizedRowBatchCtx.init).

What I discovered though is that for SMB join plans, this map (along with the related pathToAliases map) is incomplete. During the population, which occurs in GenMapRedUtils.setTaskPlan, the aliasToPartnInfo gets always populated:

plan.getAliasToPartnInfo().put(alias_id, aliasPartnDesc);

but the pathToAliases and pathToPartitionInfo maps are skipped for local case:

    if (!local) {
      while (iterPath.hasNext()) {
...
        plan.getPathToAliases().get(path).add(alias_id);
        plan.getPathToPartitionInfo().put(path, prtDesc);
...

And local in this case, for the 'small' alias, is true, being set up on the call stack by  MapJoinFactory$TableScanMapJoinProcessor.process:

      boolean local = pos != mapJoin.getConf().getPosBigTable();
      if (oldTask == null) {
        assert currPlan.getReduceWork() == null;
        initMapJoinPlan(mapJoin, currTask, ctx, local);


My question is towards SMB/MapJoin experts for clarification on this anomaly. SMB join is not local, but is treated as local. The resulted plan info has these anomalies, aforementioned maps are incomplete. Is the local-=true intentional in the SMB case, or is just leftover from the original MapJoin implementation? Should SMB join set it to false, or will the sky collapse? I can think of several 'workarounds', but there is too much context here that I don't have a strong grok on.

Relevant stack:

GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, GenMRProcContext, PrunedPartitionList) line: 658
GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, GenMRProcContext) line: 400
MapJoinFactory$TableScanMapJoinProcessor.initMapJoinPlan(AbstractMapJoinOperator<MapJoinDesc>, Task<Serializable>, GenMRProcContext, boolean) line: 157
MapJoinFactory$TableScanMapJoinProcessor.process(Node, Stack<Node>, NodeProcessorCtx, Object...) line: 219
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90
GenMapRedWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94
GenMapRedWalker.walk(Node) line: 54
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>) line: 109
MapReduceCompiler.compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>, HashSet<WriteEntity>) line: 267
SemanticAnalyzer.analyzeInternal(ASTNode) line: 8927


Thanks,
~Remus