You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Vivek Padmanabhan (JIRA)" <ji...@apache.org> on 2011/03/01 10:26:36 UTC

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000794#comment-13000794 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

The errors are because PIG-1839(XMLLoader will always add an extra empty tuple even if no tags are matched) was not applied to 0.8 branch which corrects these test cases. 

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira