You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Herbert Mühlburger <he...@gmail.com> on 2012/03/28 15:11:59 UTC

XMLLoader does not work with BIG wikipedia dump

Hi,

I would like to use pig to work with wikipedia dump files. It works 
successfully with an input file of around 8GB of size but not too big 
xml element content.

In my current case I would like to use the file 
"enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2" (around 
2GB of compressed size) which can be found here:

http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2

Is it possible that due to the fact that the content of the 
<page></page> xml element could potentially become very large (several 
GB for instance) XMLLoader of Piggybank has problems loading elements 
splitted by <page>?

Hopefully anybody could help me with this.

I've tried to call the following PIG Latin script:

=========
register piggybank.jar;

pages = load '/user/herbert/enwiki-latest-pages-meta-history1.xml- 
p000000010p000002162.bz2' using 
org.apache.pig.piggybank.storage.XMLLoader('page') as (page:chararray);
pages = limit pages 1;
dump pages;
=========

and always get the following error (the generated logfile is attached):

=========

2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig 
version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error 
messages to: 
/Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log
2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils - 
Default bootup file /Users/herbert/.pigbootup not found
2012-03-28 14:49:55,189 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to hadoop file system at: hdfs://localhost:9000
2012-03-28 14:49:55,403 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to map-reduce job tracker at: localhost:9001
2012-03-28 14:49:55,845 [main] INFO 
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the 
script: LIMIT
2012-03-28 14:49:56,021 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler 
- File concatenation threshold: 100 optimistic? false
2012-03-28 14:49:56,067 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
- MR plan size before optimization: 1
2012-03-28 14:49:56,067 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
- MR plan size after optimization: 1
2012-03-28 14:49:56,171 [main] INFO 
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are 
added to the job
2012-03-28 14:49:56,187 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-03-28 14:49:56,274 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- creating jar file Job5733074907123320640.jar
2012-03-28 14:49:59,720 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- jar file Job5733074907123320640.jar created
2012-03-28 14:49:59,736 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2012-03-28 14:49:59,795 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map-reduce job(s) waiting for submission.
****hdfs://localhost:9000/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2
2012-03-28 14:50:00,152 [Thread-11] INFO 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input 
paths to process : 1
2012-03-28 14:50:00,169 [Thread-11] INFO 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total 
input paths (combined) to process : 35
2012-03-28 14:50:00,299 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2012-03-28 14:50:01,277 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- HadoopJobId: job_201203281105_0009
2012-03-28 14:50:01,278 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- More information at: 
http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009
2012-03-28 14:50:23,145 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1% complete
2012-03-28 14:50:29,206 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 2% complete
2012-03-28 14:50:38,288 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 4% complete
2012-03-28 14:53:17,686 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 7% complete
2012-03-28 14:53:41,529 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 9% complete
2012-03-28 14:55:05,775 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 10% complete
2012-03-28 14:55:32,685 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 12% complete
2012-03-28 14:56:21,754 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 13% complete
2012-03-28 14:58:36,797 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- job job_201203281105_0009 has failed! Stop running all dependent jobs
2012-03-28 14:58:36,799 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2012-03-28 14:58:36,850 [main] ERROR 
org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to 
recreate exception from backed error: Error: Java heap space
2012-03-28 14:58:36,850 [main] ERROR 
org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2012-03-28 14:58:36,854 [main] INFO 
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
1.0.1	0.11.0-SNAPSHOT	herbert	2012-03-28 14:49:56	2012-03-28 14:58:36	LIMIT

Failed!

Failed Jobs:
JobId	Alias	Feature	Message	Outputs
job_201203281105_0009	pages		Message: Job failed! Error - # of failed 
Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: 
task_201203281105_0009_m_000003 
hdfs://localhost:9000/tmp/temp1813558187/tmp250990633,

Input(s):
Failed to read data from 
"/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2"

Output(s):
Failed to produce result in 
"hdfs://localhost:9000/tmp/temp1813558187/tmp250990633"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201203281105_0009


2012-03-28 14:58:36,855 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 2997: Unable to recreate exception from backed error: Error: Java 
heap space
Details at logfile: 
/Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log
pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total

=========

Thank you very much and kind reagards,
Herbert

Re: XMLLoader does not work with BIG wikipedia dump

Posted by Herbert Mühlburger <he...@gmail.com>.

Hi,

Am 28.03.12 18:28, schrieb Jonathan Coveney:
> - dev@pig
> + user@pig

You are right, fits better to user@pig.

> What command are you using to run this? Are you upping the max heap?

I created a pig script wiki.pig with the following content:

====
register piggybank.jar;

pages = load 
'/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2' 
using org.apache.pig.piggybank.storage.XMLLoader('page') as 
(page:chararray);
pages = limit pages 1;
dump pages;
====

and used the command:

   % pig wiki.pig

to run the pig script.

I use current Hadoop 1.0.1. My version of PIG is checked out from trunk 
and build by myself.

Everything that I customized was setting HADOOP_HEAPSIZE=2000 in 
hadoop-env.sh (default heap size was was 1000MB).

Kind regards,
Herbert

> 2012/3/28 Herbert Mühlburger<he...@gmail.com>
>
>> Hi,
>>
>> I would like to use pig to work with wikipedia dump files. It works
>> successfully with an input file of around 8GB of size but not too big xml
>> element content.
>>
>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>> size) which can be found here:
>>
>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>
>> Is it possible that due to the fact that the content of the<page></page>
>> xml element could potentially become very large (several GB for instance)
>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>
>> Hopefully anybody could help me with this.
>>
>> I've tried to call the following PIG Latin script:
>>
>> =========
>> register piggybank.jar;
>>
>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>> as (page:chararray);
>> pages = limit pages 1;
>> dump pages;
>> =========
>>
>> and always get the following error (the generated logfile is attached):
>>
>> =========
>>
>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>> 1332938994693.log
>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>> Default bootup file /Users/herbert/.pigbootup not found
>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>> hdfs://localhost:9000
>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>> at: localhost:9001
>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>> - Pig features used in the script: LIMIT
>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>> threshold: 100 optimistic? false
>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>> before optimization: 1
>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>> after optimization: 1
>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>> - Pig script settings are added to the job
>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**JobControlCompiler -
>> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
>> 0.3
>> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
>> Job5733074907123320640.jar
>> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
>> Job5733074907123320640.jar created
>> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
>> store job
>> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
>> job(s) waiting for submission.
>> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
>> pages-meta-history1.xml-**p000000010p000002162.bz2
>> 2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
>> - Total input paths to process : 1
>> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
>> executionengine.util.**MapRedUtil - Total input paths (combined) to
>> process : 35
>> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
>> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
>> job_201203281105_0009
>> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
>> at: http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
>> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
>> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
>> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
>> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
>> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
>> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
>> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
>> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
>> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - job
>> job_201203281105_0009 has failed! Stop running all dependent jobs
>> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
>> heap space
>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**PigStatsUtil
>> - 1 map reduce job(s) failed!
>> 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.**SimplePigStats
>> - Script Statistics:
>>
>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>   Features
>> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
>> 14:58:36     LIMIT
>>
>> Failed!
>>
>> Failed Jobs:
>> JobId   Alias   Feature Message Outputs
>> job_201203281105_0009   pages           Message: Job failed! Error - # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
>> temp1813558187/tmp250990633,
>>
>> Input(s):
>> Failed to read data from "/user/herbert/enwiki-latest-**
>> pages-meta-history1.xml-**p000000010p000002162.bz2"
>>
>> Output(s):
>> Failed to produce result in "hdfs://localhost:9000/tmp/**
>> temp1813558187/tmp250990633"
>>
>> Counters:
>> Total records written : 0
>> Total bytes written : 0
>> Spillable Memory Manager spill count : 0
>> Total bags proactively spilled: 0
>> Total records proactively spilled: 0
>>
>> Job DAG:
>> job_201203281105_0009
>>
>>
>> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
>> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
>> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
>> ERROR 2997: Unable to recreate exception from backed error: Error: Java
>> heap space
>> Details at logfile: /Users/herbert/Documents/**
>> workspace/pig-wikipedia/pig_**1332938994693.log
>> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>>
>> =========
>>
>> Thank you very much and kind reagards,
>> Herbert
>>
>

-- 
=================================================================
Herbert Muehlburger  Software Development and Business Management
                                     Graz University of Technology
www.muehlburger.at                   www.twitter.com/hmuehlburger
=================================================================

Re: XMLLoader does not work with BIG wikipedia dump

Posted by Herbert Mühlburger <he...@gmail.com>.

Sorry, typing error. The heap size was set to 2000 not to 00.

Am 28.03.12 21:14, schrieb Prashant Kommireddi:
> Did you set heap size to 0?
>
> Sent from my iPhone
>
> On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
> <he...@gmail.com>  wrote:
>
>> Hi,
>>
>> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>>> - dev@pig
>>> + user@pig
>>
>> You are right, fits better to user@pig.
>>
>>> What command are you using to run this? Are you upping the max heap?
>>
>> I created a pig script wiki.pig with the following content:
>>
>> ===register piggybank.jar;
>>
>> pages = load
>> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
>> using org.apache.pig.piggybank.storage.XMLLoader('page') as
>> (page:chararray);
>> pages = limit pages 1;
>> dump pages;
>> ===
>> and used the command:
>>
>>   % pig wiki.pig
>>
>> to run the pig script.
>>
>> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
>> and build by myself.
>>
>> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
>> hadoop-env.sh (default heap size was was 1000MB).
>>
>> Kind regards,
>> Herbert
>>
>>> 2012/3/28 Herbert Mühlburger<he...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> I would like to use pig to work with wikipedia dump files. It works
>>>> successfully with an input file of around 8GB of size but not too big xml
>>>> element content.
>>>>
>>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>>> size) which can be found here:
>>>>
>>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>>
>>>> Is it possible that due to the fact that the content of the<page></page>
>>>> xml element could potentially become very large (several GB for instance)
>>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>>
>>>> Hopefully anybody could help me with this.
>>>>
>>>> I've tried to call the following PIG Latin script:
>>>>
>>>> ========>>  register piggybank.jar;
>>>>
>>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>>>> as (page:chararray);
>>>> pages = limit pages 1;
>>>> dump pages;
>>>> ========>>
>>>> and always get the following error (the generated logfile is attached):
>>>>
>>>> ========>>
>>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>>> 1332938994693.log
>>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>>> Default bootup file /Users/herbert/.pigbootup not found
>>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>>> hdfs://localhost:9000
>>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>>> at: localhost:9001
>>>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>>> - Pig features used in the script: LIMIT
>>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>>> threshold: 100 optimistic? false
>>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>>> before optimization: 1
>>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>>> after optimization: 1
>>>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>>> - Pig script settings are added to the job
>>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler -
>>>> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
>>>> 0.3
>>>> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
>>>> Job5733074907123320640.jar
>>>> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
>>>> Job5733074907123320640.jar created
>>>> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
>>>> store job
>>>> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
>>>> job(s) waiting for submission.
>>>> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2
>>>> 2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
>>>> - Total input paths to process : 1
>>>> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.util.**MapRedUtil - Total input paths (combined) to
>>>> process : 35
>>>> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
>>>> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
>>>> job_201203281105_0009
>>>> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
>>>> at: http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
>>>> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
>>>> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
>>>> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
>>>> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
>>>> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
>>>> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
>>>> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
>>>> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
>>>> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - job
>>>> job_201203281105_0009 has failed! Stop running all dependent jobs
>>>> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
>>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>>>> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>>> heap space
>>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**PigStatsUtil
>>>> - 1 map reduce job(s) failed!
>>>> 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.**SimplePigStats
>>>> - Script Statistics:
>>>>
>>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>>   Features
>>>> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
>>>> 14:58:36     LIMIT
>>>>
>>>> Failed!
>>>>
>>>> Failed Jobs:
>>>> JobId   Alias   Feature Message Outputs
>>>> job_201203281105_0009   pages           Message: Job failed! Error - # of
>>>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>>>> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
>>>> temp1813558187/tmp250990633,
>>>>
>>>> Input(s):
>>>> Failed to read data from "/user/herbert/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2"
>>>>
>>>> Output(s):
>>>> Failed to produce result in "hdfs://localhost:9000/tmp/**
>>>> temp1813558187/tmp250990633"
>>>>
>>>> Counters:
>>>> Total records written : 0
>>>> Total bytes written : 0
>>>> Spillable Memory Manager spill count : 0
>>>> Total bags proactively spilled: 0
>>>> Total records proactively spilled: 0
>>>>
>>>> Job DAG:
>>>> job_201203281105_0009
>>>>
>>>>
>>>> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
>>>> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
>>>> ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>>> heap space
>>>> Details at logfile: /Users/herbert/Documents/**
>>>> workspace/pig-wikipedia/pig_**1332938994693.log
>>>> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>>>>
>>>> ========>>
>>>> Thank you very much and kind reagards,
>>>> Herbert
>>>>
>>>
>>
>> --
>> ================================================================Herbert Muehlburger  Software Development and Business Management
>>                                     Graz University of Technology
>> www.muehlburger.at                   www.twitter.com/hmuehlburger
>> ================================================================

-- 
=================================================================
Herbert Muehlburger  Software Development and Business Management
                                     Graz University of Technology
www.muehlburger.at                   www.twitter.com/hmuehlburger
=================================================================

Re: XMLLoader does not work with BIG wikipedia dump

Posted by Herbert Mühlburger <he...@gmail.com>.

Hi,

The sentence below should be:

"Everything that I customized was setting HADOOP_HEAPSIZE 2000 in 
hadoop-env.sh (default heap size was was 1000MB)."

Sorry for the typo in the message. I still get the same error.

Kind regards,
Herbert

Am 28.03.12 21:14, schrieb Prashant Kommireddi:
> Did you set heap size to 0?
>
> Sent from my iPhone
>
> On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
> <he...@gmail.com>  wrote:
>
>> Hi,
>>
>> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>>> - dev@pig
>>> + user@pig
>>
>> You are right, fits better to user@pig.
>>
>>> What command are you using to run this? Are you upping the max heap?
>>
>> I created a pig script wiki.pig with the following content:
>>
>> ===register piggybank.jar;
>>
>> pages = load
>> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
>> using org.apache.pig.piggybank.storage.XMLLoader('page') as
>> (page:chararray);
>> pages = limit pages 1;
>> dump pages;
>> ===
>> and used the command:
>>
>>   % pig wiki.pig
>>
>> to run the pig script.
>>
>> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
>> and build by myself.
>>
>> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
>> hadoop-env.sh (default heap size was was 1000MB).
>>
>> Kind regards,
>> Herbert
>>
>>> 2012/3/28 Herbert Mühlburger<he...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> I would like to use pig to work with wikipedia dump files. It works
>>>> successfully with an input file of around 8GB of size but not too big xml
>>>> element content.
>>>>
>>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>>> size) which can be found here:
>>>>
>>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>>
>>>> Is it possible that due to the fact that the content of the<page></page>
>>>> xml element could potentially become very large (several GB for instance)
>>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>>
>>>> Hopefully anybody could help me with this.
>>>>
>>>> I've tried to call the following PIG Latin script:
>>>>
>>>> ========>>  register piggybank.jar;
>>>>
>>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>>>> as (page:chararray);
>>>> pages = limit pages 1;
>>>> dump pages;
>>>> ========>>
>>>> and always get the following error (the generated logfile is attached):
>>>>
>>>> ========>>
>>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>>> 1332938994693.log
>>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>>> Default bootup file /Users/herbert/.pigbootup not found
>>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>>> hdfs://localhost:9000
>>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>>> at: localhost:9001
>>>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>>> - Pig features used in the script: LIMIT
>>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>>> threshold: 100 optimistic? false
>>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>>> before optimization: 1
>>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>>> after optimization: 1
>>>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>>> - Pig script settings are added to the job
>>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler -
>>>> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
>>>> 0.3
>>>> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
>>>> Job5733074907123320640.jar
>>>> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
>>>> Job5733074907123320640.jar created
>>>> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
>>>> store job
>>>> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
>>>> job(s) waiting for submission.
>>>> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2
>>>> 2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
>>>> - Total input paths to process : 1
>>>> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.util.**MapRedUtil - Total input paths (combined) to
>>>> process : 35
>>>> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
>>>> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
>>>> job_201203281105_0009
>>>> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
>>>> at: http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
>>>> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
>>>> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
>>>> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
>>>> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
>>>> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
>>>> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
>>>> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
>>>> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
>>>> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - job
>>>> job_201203281105_0009 has failed! Stop running all dependent jobs
>>>> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
>>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>>>> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>>> heap space
>>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**PigStatsUtil
>>>> - 1 map reduce job(s) failed!
>>>> 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.**SimplePigStats
>>>> - Script Statistics:
>>>>
>>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>>   Features
>>>> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
>>>> 14:58:36     LIMIT
>>>>
>>>> Failed!
>>>>
>>>> Failed Jobs:
>>>> JobId   Alias   Feature Message Outputs
>>>> job_201203281105_0009   pages           Message: Job failed! Error - # of
>>>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>>>> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
>>>> temp1813558187/tmp250990633,
>>>>
>>>> Input(s):
>>>> Failed to read data from "/user/herbert/enwiki-latest-**
>>>> pages-meta-history1.xml-**p000000010p000002162.bz2"
>>>>
>>>> Output(s):
>>>> Failed to produce result in "hdfs://localhost:9000/tmp/**
>>>> temp1813558187/tmp250990633"
>>>>
>>>> Counters:
>>>> Total records written : 0
>>>> Total bytes written : 0
>>>> Spillable Memory Manager spill count : 0
>>>> Total bags proactively spilled: 0
>>>> Total records proactively spilled: 0
>>>>
>>>> Job DAG:
>>>> job_201203281105_0009
>>>>
>>>>
>>>> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
>>>> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
>>>> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
>>>> ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>>> heap space
>>>> Details at logfile: /Users/herbert/Documents/**
>>>> workspace/pig-wikipedia/pig_**1332938994693.log
>>>> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>>>>
>>>> ========>>
>>>> Thank you very much and kind reagards,
>>>> Herbert
>>>>
>>>
>>
>> --
>> ================================================================Herbert Muehlburger  Software Development and Business Management
>>                                     Graz University of Technology
>> www.muehlburger.at                   www.twitter.com/hmuehlburger
>> ================================================================

Re: XMLLoader does not work with BIG wikipedia dump

Posted by Prashant Kommireddi <pr...@gmail.com>.

Did you set heap size to 0?

Sent from my iPhone

On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
<he...@gmail.com> wrote:

> Hi,
>
> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>> - dev@pig
>> + user@pig
>
> You are right, fits better to user@pig.
>
>> What command are you using to run this? Are you upping the max heap?
>
> I created a pig script wiki.pig with the following content:
>
> ===register piggybank.jar;
>
> pages = load
> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
> using org.apache.pig.piggybank.storage.XMLLoader('page') as
> (page:chararray);
> pages = limit pages 1;
> dump pages;
> ===
> and used the command:
>
>  % pig wiki.pig
>
> to run the pig script.
>
> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
> and build by myself.
>
> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
> hadoop-env.sh (default heap size was was 1000MB).
>
> Kind regards,
> Herbert
>
>> 2012/3/28 Herbert Mühlburger<he...@gmail.com>
>>
>>> Hi,
>>>
>>> I would like to use pig to work with wikipedia dump files. It works
>>> successfully with an input file of around 8GB of size but not too big xml
>>> element content.
>>>
>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>> size) which can be found here:
>>>
>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>
>>> Is it possible that due to the fact that the content of the<page></page>
>>> xml element could potentially become very large (several GB for instance)
>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>
>>> Hopefully anybody could help me with this.
>>>
>>> I've tried to call the following PIG Latin script:
>>>
>>> ========>> register piggybank.jar;
>>>
>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>>> as (page:chararray);
>>> pages = limit pages 1;
>>> dump pages;
>>> ========>>
>>> and always get the following error (the generated logfile is attached):
>>>
>>> ========>>
>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>> 1332938994693.log
>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>> Default bootup file /Users/herbert/.pigbootup not found
>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>> hdfs://localhost:9000
>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>> at: localhost:9001
>>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig features used in the script: LIMIT
>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>> threshold: 100 optimistic? false
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> before optimization: 1
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> after optimization: 1
>>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig script settings are added to the job
>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler -
>>> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
>>> 0.3
>>> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
>>> Job5733074907123320640.jar
>>> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
>>> Job5733074907123320640.jar created
>>> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
>>> store job
>>> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
>>> job(s) waiting for submission.
>>> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2
>>> 2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
>>> - Total input paths to process : 1
>>> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.util.**MapRedUtil - Total input paths (combined) to
>>> process : 35
>>> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
>>> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
>>> job_201203281105_0009
>>> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
>>> at: http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
>>> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
>>> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
>>> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
>>> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
>>> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
>>> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
>>> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
>>> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
>>> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - job
>>> job_201203281105_0009 has failed! Stop running all dependent jobs
>>> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
>>> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>> heap space
>>> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**PigStatsUtil
>>> - 1 map reduce job(s) failed!
>>> 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.**SimplePigStats
>>> - Script Statistics:
>>>
>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>  Features
>>> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
>>> 14:58:36     LIMIT
>>>
>>> Failed!
>>>
>>> Failed Jobs:
>>> JobId   Alias   Feature Message Outputs
>>> job_201203281105_0009   pages           Message: Job failed! Error - # of
>>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>>> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
>>> temp1813558187/tmp250990633,
>>>
>>> Input(s):
>>> Failed to read data from "/user/herbert/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2"
>>>
>>> Output(s):
>>> Failed to produce result in "hdfs://localhost:9000/tmp/**
>>> temp1813558187/tmp250990633"
>>>
>>> Counters:
>>> Total records written : 0
>>> Total bytes written : 0
>>> Spillable Memory Manager spill count : 0
>>> Total bags proactively spilled: 0
>>> Total records proactively spilled: 0
>>>
>>> Job DAG:
>>> job_201203281105_0009
>>>
>>>
>>> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
>>> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
>>> ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>> heap space
>>> Details at logfile: /Users/herbert/Documents/**
>>> workspace/pig-wikipedia/pig_**1332938994693.log
>>> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>>>
>>> ========>>
>>> Thank you very much and kind reagards,
>>> Herbert
>>>
>>
>
> --
> ================================================================Herbert Muehlburger  Software Development and Business Management
>                                    Graz University of Technology
> www.muehlburger.at                   www.twitter.com/hmuehlburger
> ================================================================

Re: XMLLoader does not work with BIG wikipedia dump

Posted by Jonathan Coveney <jc...@gmail.com>.

- dev@pig
+ user@pig

What command are you using to run this? Are you upping the max heap?

2012/3/28 Herbert Mühlburger <he...@gmail.com>

> Hi,
>
> I would like to use pig to work with wikipedia dump files. It works
> successfully with an input file of around 8GB of size but not too big xml
> element content.
>
> In my current case I would like to use the file "enwiki-latest-pages-meta-
> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
> size) which can be found here:
>
> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>
> Is it possible that due to the fact that the content of the <page></page>
> xml element could potentially become very large (several GB for instance)
> XMLLoader of Piggybank has problems loading elements splitted by <page>?
>
> Hopefully anybody could help me with this.
>
> I've tried to call the following PIG Latin script:
>
> =========
> register piggybank.jar;
>
> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
> as (page:chararray);
> pages = limit pages 1;
> dump pages;
> =========
>
> and always get the following error (the generated logfile is attached):
>
> =========
>
> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
> 1332938994693.log
> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
> Default bootup file /Users/herbert/.pigbootup not found
> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
> hdfs://localhost:9000
> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
> at: localhost:9001
> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
> - Pig features used in the script: LIMIT
> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.MRCompiler - File concatenation
> threshold: 100 optimistic? false
> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
> before optimization: 1
> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
> after optimization: 1
> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
> - Pig script settings are added to the job
> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler -
> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
> 0.3
> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
> Job5733074907123320640.jar
> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
> Job5733074907123320640.jar created
> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
> store job
> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
> job(s) waiting for submission.
> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2
> 2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
> - Total input paths to process : 1
> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
> executionengine.util.**MapRedUtil - Total input paths (combined) to
> process : 35
> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
> job_201203281105_0009
> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
> at: http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - job
> job_201203281105_0009 has failed! Stop running all dependent jobs
> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**SimplePigStats
> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
> heap space
> 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.**PigStatsUtil
> - 1 map reduce job(s) failed!
> 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.**SimplePigStats
> - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>  Features
> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
> 14:58:36     LIMIT
>
> Failed!
>
> Failed Jobs:
> JobId   Alias   Feature Message Outputs
> job_201203281105_0009   pages           Message: Job failed! Error - # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
> temp1813558187/tmp250990633,
>
> Input(s):
> Failed to read data from "/user/herbert/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2"
>
> Output(s):
> Failed to produce result in "hdfs://localhost:9000/tmp/**
> temp1813558187/tmp250990633"
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201203281105_0009
>
>
> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
> ERROR 2997: Unable to recreate exception from backed error: Error: Java
> heap space
> Details at logfile: /Users/herbert/Documents/**
> workspace/pig-wikipedia/pig_**1332938994693.log
> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>
> =========
>
> Thank you very much and kind reagards,
> Herbert
>