You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2011/04/22 19:59:31 UTC

Notes on differences between Local and MR mode in Pig

I jotted down some notes on what the internal differences I could find
w.r.t. LOCAL vs MAPREDUCE mode in the Pig code base.

As discussed in the contributor meeting, we should audit our tests and
switch all the ones that don't touch on one of these conditions to start
using Local mode, as it will significantly speed up the test suite. See
PIG-2011 for example.

Differences:

- no distributed cache support.
  See JobControlCompiler's private static String
addSingleFileToDistributedCache
  also this means no FRJoin, no MergeJoin, no MergeCoGroup, no UDFs that
rely on DistCache

- outputCommitter's cleanupJob does not get called in local mode.
   TestStore.testSetStoreSchema() tests for a workaround, so don't mess with
TestStoreSchema stuff.

- anything that checks "MRCompiler.hasTooManyInputFiles" in MR mode. This
func gets called by
  FRJoin and aggregateScalarFiles which in turn gets called by
MapReduceLauncher.compile.
  gets called from aggregateScalarFiles for every Store in plan in a
map-only job. In local, always returns false.
  in MR, returns true if:
  -  nativeMR operator, and optimisticFileConcatenation is on
  - if input is hdfs file, and num splits (after potential combination), or
look at num mappers
    and the resulting number > threshold.
  If there's a test of this behavior, it has to stay in MR mode.

- parallelism of final Order by is set to 1 in Local, but can be dynamically
determined in MR mode.
  (perhaps we should not do this and do things serially for each requested
"parallel" task in local?)

- OpLimitOptimizer does not apply in LOCAL

- the PARSER (QueryParser.jjt) always sets parallelism to 1 in local mode.
So anything that tests parallelism
  has to test it in MR mode.

- same in LogicalPlanBuilder

- PigServer.capacity() supposed to return available space, but does not work
in local mode.
  This is only called n TestMapReduce (ever!) I think we can just toss the
method and the test.

- Map tests appear to insist on MR.. not necessary? (see PIG-2011)

- any sort of classpath / register machinations should be tested in MR

I also have a note of "- SimplePigStats??" but don't recall what that refers
to.. perhaps PigStats counters are messed up in local mode?

Cheers

D