You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2011/02/04 03:21:22 UTC

[jira] Created: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Improve Scalability of the XMLLoader for large datasets such as wikipedia
-------------------------------------------------------------------------

                 Key: PIG-1842
                 URL: https://issues.apache.org/jira/browse/PIG-1842
             Project: Pig
          Issue Type: Improvement
            Reporter: Viraj Bhat
            Assignee: Vivek Padmanabhan


The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.

Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vivek Padmanabhan updated PIG-1842:
-----------------------------------

    Attachment: PIG-1842_2.patch

Attaching the patch again

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-1842:
----------------------------

    Affects Version/s: 0.9.0
                       0.7.0
                       0.8.0

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vivek Padmanabhan updated PIG-1842:
-----------------------------------

    Status: Patch Available  (was: In Progress)

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0, 0.7.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.9.0, 0.8.0, 0.7.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1842:
----------------------------

    Attachment: TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992088#comment-12992088 ] 

Alan Gates commented on PIG-1842:
---------------------------------

The patch does not apply cleanly against the trunk.  Can you regenerate the patch against the latest trunk?

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vivek Padmanabhan updated PIG-1842:
-----------------------------------

    Patch Info: [Patch Available]

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1842:
----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.9.0, 0.8.0, 0.7.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999053#comment-12999053 ] 

Alan Gates commented on PIG-1842:
---------------------------------

>From reviewing the code it is not clear to me how this splits the XML file.  Let's say we have an XML file that looks like:

{code}
<a>
    <b>
        <c>
        </c>
        <c1>
        </c1>
    </b>
</a>
<a1>
    <b1>
    </b1>
    <b2>
    </b2>
</a2>
{code}

and the split falls on line "</c1>".  

How far will split 1 read?  It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document.  Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point.  

How does split 2 know where to start?  I don't see any code that is telling split 2 to fast forward to the point where split 1 ends.

All the tests pass just fine.


> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000794#comment-13000794 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

The errors are because PIG-1839(XMLLoader will always add an extra empty tuple even if no tags are matched) was not applied to 0.8 branch which corrects these test cases. 

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vivek Padmanabhan updated PIG-1842:
-----------------------------------

    Attachment: PIG-1842_1.patch

Attaching an initial patch.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991832#comment-12991832 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

The below are some of the issues addressed in the patch :
a) Marking splittable of the loader as true except for gz formats
a) Changing XMLLoader to read for splits rather than entire file.
b) Handling scenarios regarding split/record boundaries
c) Using CBZip2InputStream to handle bzip2 files
d) An improvement on logic of collectTag (ie, skip unnecessary reads to find end tag if no start tags are found)

Manual tests for scalability and functional verification were done for the patch.
Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) the new loader completed within 3 minutes,while the older version took more than 35minutes for a simple load-filter null-store script.



> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-1842:
----------------------------

      Component/s: impl
    Fix Version/s: 0.7.0
                   0.8.0
                   0.9.0

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999315#comment-12999315 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

Hi Alan ,
 The below is how I have handled these cases :

Note :-
The XMLLoader will consider one record from begining tag to end tag just like a line record reader searching for new line char .
Split start and end locations are provided by the default FileInputFormat.




Describing the entire steps in a simple way ;

*The loader will collect the start and end tags and create a record out of it. (XMLLoaderBufferedPositionedInputStream.collectTag)
	*For begin tag 
		*Read till the tag is found in this block 
			*If tag not found and split end has reached then no rec found in this split (return empty array)
			*If partial tag is found in the current split then even though split end has reached 
			 continue reading rest of the file , beyond the split end location (handled by cond in while loop)
	*For end tag
		*Read till the end tag is found even if the split end location is reached.	
	
		
>>How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document. 
>>Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point.

The other condition will keep the reading going on. (matchBuf.size() > 0 )

Here in this case lets say my tag identifier is <a> .  Then the loader will read till the split end to search for begining tag. 
Now for the end tag, it reads the rest of file starting from the last read position.Lets say split end has reached in between,
it will check whether it has found a match/or partial match. If not proceed with the reading till it finds a end tag.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999316#comment-12999316 ] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

I have done manual test for split boundary conditions. Please suggest whether/how I can do the same with unit tests.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002190#comment-13002190 ] 

Alan Gates commented on PIG-1842:
---------------------------------

Patch 2 checked into 0.8 branch.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1842:
----------------------------

    Fix Version/s:     (was: 0.7.0)

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Work started: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on PIG-1842 started by Vivek Padmanabhan.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999506#comment-12999506 ] 

Alan Gates commented on PIG-1842:
---------------------------------

I have checked the patch into trunk.  I applied it to the 0.8 branch, but got errors in the unit tests.  I will attach the results of the 0.8 test run.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira