You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Paul Burkhardt (JIRA)" <ji...@apache.org> on 2010/09/16 01:04:35 UTC

[jira] Created: (MAPREDUCE-2070) Cartesian product file split

Cartesian product file split
----------------------------

                 Key: MAPREDUCE-2070
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
    Affects Versions: 0.22.0
            Reporter: Paul Burkhardt
            Priority: Minor


Generates a Cartesian product of file pairs from two directory inputs and enables a RecordReader to optimally read the split in tuple order, eliminating extraneous read operations.

The new InputFormat generates a split comprised of file combinations as tuples. The size of the split is configurable. A RecordReader employs the convenience class, CartesianProductFileSplitReader, to generate file pairs in tuple ordering. The actual read operations are delegated to the RecordReader which must implement the CartesianProductTupleReader interface. An implementor of a RecordReader can perform file manipulations without restriction and also benefit from the optimization of tuple ordering.

In the Cartesian product of two sets with cardinalities, X and Y, each element x in {X } need only be referenced once, saving X(Y-1) references of the elements. If the Cartesian product is split into subsets of size N there are then X(Y/N) instead of XY references for a difference of XY(N-1)/N. Suppose each x is equal in size, s, this would save reading sXY(N-1)/N bytes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.