You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/04/18 06:01:25 UTC

[jira] [Updated] (PIG-4874) Remove schema tuple reference overhead for replicate join hashmap

     [ https://issues.apache.org/jira/browse/PIG-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-4874:
------------------------------------
    Attachment: PIG-4874-1.patch

We really take up a lot of space in memory to store the replicate table. 

{code}
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream("/tmp/data"));
        for (int i = 0; i < 100; i++) {
            bout.write(new String(i + "\t" + i + "\n").getBytes());
        }
        bout.close();

A = LOAD '/tmp/data' as (x:int, y:int);
B = LOAD '/tmp/data' as (x:int, y:int);
C = JOIN A by (x,y), B by (x,y) using 'replicated';
STORE C into '/tmp/tezout';
{code}

The 100 entries from 0 to 99 in above test take up 508 bytes on disk. Retained sizes in Yourkit for the replicates map are

pig.schematuple=false 
   Before patch  : 46096
   After patch  : 37232
pig.schematuple=true
     17808

This patch optimizes schema tuple for the case of primitive keys as well. Currently TestSchemaTuple is only ExecType.LOCAL. Found that SchemaTuple does not work with Tez as it requires some stuff shipped through DistributedCache and is not done. Will create a separate jira to fix SchemaTuple for Tez and also make it more robust so that we can try and make it default for replicate join with such huge memory savings.



> Remove schema tuple reference overhead for replicate join hashmap
> -----------------------------------------------------------------
>
>                 Key: PIG-4874
>                 URL: https://issues.apache.org/jira/browse/PIG-4874
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4874-1.patch
>
>
> Currently even if pig.schematuple is set to false which is the default, the usage of TupleToMapKey and TuplesToSchemaTupleList instead of plain HashMap<Object, ArrayList<Tuple>> costs a lot of memory.  Also key is currently converted to a tuple which is unnecessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)