You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/04/18 06:01:25 UTC
[jira] [Updated] (PIG-4874) Remove schema tuple reference overhead
for replicate join hashmap
[ https://issues.apache.org/jira/browse/PIG-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-4874:
------------------------------------
Attachment: PIG-4874-1.patch
We really take up a lot of space in memory to store the replicate table.
{code}
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream("/tmp/data"));
for (int i = 0; i < 100; i++) {
bout.write(new String(i + "\t" + i + "\n").getBytes());
}
bout.close();
A = LOAD '/tmp/data' as (x:int, y:int);
B = LOAD '/tmp/data' as (x:int, y:int);
C = JOIN A by (x,y), B by (x,y) using 'replicated';
STORE C into '/tmp/tezout';
{code}
The 100 entries from 0 to 99 in above test take up 508 bytes on disk. Retained sizes in Yourkit for the replicates map are
pig.schematuple=false
Before patch : 46096
After patch : 37232
pig.schematuple=true
17808
This patch optimizes schema tuple for the case of primitive keys as well. Currently TestSchemaTuple is only ExecType.LOCAL. Found that SchemaTuple does not work with Tez as it requires some stuff shipped through DistributedCache and is not done. Will create a separate jira to fix SchemaTuple for Tez and also make it more robust so that we can try and make it default for replicate join with such huge memory savings.
> Remove schema tuple reference overhead for replicate join hashmap
> -----------------------------------------------------------------
>
> Key: PIG-4874
> URL: https://issues.apache.org/jira/browse/PIG-4874
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4874-1.patch
>
>
> Currently even if pig.schematuple is set to false which is the default, the usage of TupleToMapKey and TuplesToSchemaTupleList instead of plain HashMap<Object, ArrayList<Tuple>> costs a lot of memory. Also key is currently converted to a tuple which is unnecessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)