You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2011/04/07 05:07:06 UTC

[jira] [Created] (PIG-1972) Cache split information details for data with large number of small part files

Cache split information details for data with large number of small part files
------------------------------------------------------------------------------

Key: PIG-1972
URL: https://issues.apache.org/jira/browse/PIG-1972
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.8.0
Environment: Pig 0.8 version with PigMix http://wiki.apache.org/pig/PigMix
Reporter: Rajesh Balamohan

While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, for the same problem size response time should decrease as we increase the number of nodes)

Investigating further revealed that L14 query merge-joins small dataset and another large dataset. If the small dataset has many part files with very little amount of data, it causes a huge pressure on NameNode. This is because it is read as a side file in all map slows.

In the environment where I ran the experiment, small dataset was spread across 1900+ part files in HDFS.

Following codepath has the perf issue.
DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge delay. Since
"users_sorted" data is spread across 1900+ small files, FileInputFormat.getSplits() hits the namenode too
frequently.

i.e, (number of machines * number of map slots * 1900+ times). This is the reason why L14 is not scaling up.

Suggestion would be to cache the splitInformation of the small dataset instead of hitting the namenode too frequently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1972) Cache split information details for data with large number of small part files

Posted by "Rajesh Balamohan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026786#comment-13026786 ] 

Rajesh Balamohan commented on PIG-1972:
---------------------------------------

If the side data being read is small (ex:<= a block size), it would be replicated only in 3 nodes by default. So when every map is trying to read the side data, it would be choking to read the required details only from the 3 nodes. Suggestion would be to increase the replication factor of the side data being read. Alternatively we can load the side data in the distributedcache as mentioned in this JIRA to reduce the performance impact.

> Cache split information details for data with large number of small part files
> ------------------------------------------------------------------------------
>
>                 Key: PIG-1972
>                 URL: https://issues.apache.org/jira/browse/PIG-1972
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0
>         Environment: Pig 0.8 version with PigMix http://wiki.apache.org/pig/PigMix
>            Reporter: Rajesh Balamohan
>
> While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, for the same problem size response time should decrease as we increase the number of nodes)
> Investigating further revealed that L14 query merge-joins small dataset and another large dataset. If the small dataset has many part files with very little amount of data, it causes a huge pressure on NameNode. This is because it is read as a side file in all map slows.
> In the environment where I ran the experiment, small dataset was spread across 1900+ part files in HDFS.
> Following codepath has the perf issue.
> DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge delay. Since
> "users_sorted" data is spread across 1900+ small files, FileInputFormat.getSplits() hits the namenode too
> frequently. 
> i.e, (number of machines * number of map slots * 1900+ times). This is the reason why L14 is not scaling up.
> Suggestion would be to cache the splitInformation of the small dataset instead of hitting the namenode too frequently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira