You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Vivek Padmanabhan (JIRA)" <ji...@apache.org> on 2011/02/08 09:05:58 UTC

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991832#comment-12991832 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

The below are some of the issues addressed in the patch :
a) Marking splittable of the loader as true except for gz formats
a) Changing XMLLoader to read for splits rather than entire file.
b) Handling scenarios regarding split/record boundaries
c) Using CBZip2InputStream to handle bzip2 files
d) An improvement on logic of collectTag (ie, skip unnecessary reads to find end tag if no start tags are found)

Manual tests for scalability and functional verification were done for the patch.
Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) the new loader completed within 3 minutes,while the older version took more than 35minutes for a simple load-filter null-store script.



> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira