You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/02/21 18:32:16 UTC

Help with XMLLoader

I am trying to use XMLLoader to process the files but it doesn't seem to be
quite working. For the first pass I am just trying to dump all the contents
but it's saying 0 records found:

bash-3.2$ hadoop fs -cat /examples/testfile.txt

<abc><def></def><abc>

<abc><def></def><abc>

register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile.txt' using
org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);

dump raw;

2012-02-21 09:22:18,947 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete

2012-02-21 09:22:24,998 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete

2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats
- Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
UNKNOWN

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
MinReduceTime AvgReduceTime Alias Feature Outputs

job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,

Input(s):

Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"

Output(s):

Successfully stored 0 records in:
"hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"

Counters:

Total records written : 0

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_201202201638_0012



2012-02-21 09:22:25,004 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!

2012-02-21 09:22:25,011 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1

2012-02-21 09:22:25,011 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1

grunt> quit

Re: Help with XMLLoader

Posted by Mohit Anchlia <mo...@gmail.com>.

It looks like when I have a big file it doesn't read the records. Is it
because of how split is occurring that causes it to fail?

On Tue, Feb 21, 2012 at 9:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> I am trying to use XMLLoader to process the files but it doesn't seem to
> be quite working. For the first pass I am just trying to dump all the
> contents but it's saying 0 records found:
>
> bash-3.2$ hadoop fs -cat /examples/testfile.txt
>
> <abc><def></def><abc>
>
> <abc><def></def><abc>
>
> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
>
> raw = LOAD '/examples/testfile.txt' using
> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
>
> dump raw;
>
> 2012-02-21 09:22:18,947 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 50% complete
>
> 2012-02-21 09:22:24,998 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
>
> 2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats
> - Script Statistics:
>
> HadoopVersion PigVersion UserId StartedAt FinishedAt Features
>
> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
> UNKNOWN
>
> Success!
>
> Job Stats (time in seconds):
>
> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
> MinReduceTime AvgReduceTime Alias Feature Outputs
>
> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
>
> Input(s):
>
> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
>
> Output(s):
>
> Successfully stored 0 records in:
> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
>
> Counters:
>
> Total records written : 0
>
> Total bytes written : 0
>
> Spillable Memory Manager spill count : 0
>
> Total bags proactively spilled: 0
>
> Total records proactively spilled: 0
>
> Job DAG:
>
> job_201202201638_0012
>
>
>
> 2012-02-21 09:22:25,004 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
>
> 2012-02-21 09:22:25,011 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
>
> 2012-02-21 09:22:25,011 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : 1
>
> grunt> quit
>

Re: Help with XMLLoader

Posted by Vivek Padmanabhan <pv...@yahoo-inc.com>.

Hi Mohit,
  We use XMLLoader for wiki data which is around 52g (uncompressed) file.
Not sure what is causing this problem here. Can you give a try with Pig 0.9
Thanks
Vivek


On 2/22/12 9:19 PM, "Mohit Anchlia" <mo...@gmail.com> wrote:

> On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan
> <pv...@yahoo-inc.com>wrote:
> 
>> Hi Mohit,
>>  XMLLoader looks for the start and end tag for a given string argument. In
>> the given input there are no end tags and hence it read 0 records.
>> 
>> Example:
>> raw = LOAD 'sample_xml' using
>> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
>> dump raw;
>> 
>> cat sample_xml
>> <abc><def></def></abc>
>> <abc><def></def></abc>
>> 
> 
> Thanks! I got past this. But I am facing a different problem. When I have a
> big file that splits into multiple nodes then pig is not able to read the
> records. It returns 0 records found.
> 
> I create a big file 2G with lots of xml root like above. Then I do hadoop
> fs -copyFromLocal bigfile /examples
> 
> But when I run pig script it return 0 records. If I reduce the size of file
> to few MB then it works fine. How can I resolve this?
> 
>> 
>> Thanks
>> Vivek
>>  On 2/21/12 11:02 PM, "Mohit Anchlia" <mo...@gmail.com> wrote:
>> 
>>> I am trying to use XMLLoader to process the files but it doesn't seem to
>> be
>>> quite working. For the first pass I am just trying to dump all the
>> contents
>>> but it's saying 0 records found:
>>> 
>>> bash-3.2$ hadoop fs -cat /examples/testfile.txt
>>> 
>>> <abc><def></def><abc>
>>> 
>>> <abc><def></def><abc>
>>> 
>>> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
>>> 
>>> raw = LOAD '/examples/testfile.txt' using
>>> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as
>> (document:chararray);
>>> 
>>> dump raw;
>>> 
>>> 2012-02-21 09:22:18,947 [main] INFO
>>> 
>> 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 50% complete
>>> 
>>> 2012-02-21 09:22:24,998 [main] INFO
>>> 
>> 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 100% complete
>>> 
>>> 2012-02-21 09:22:24,999 [main] INFO
>> org.apache.pig.tools.pigstats.PigStats
>>> - Script Statistics:
>>> 
>>> HadoopVersion PigVersion UserId StartedAt FinishedAt Features
>>> 
>>> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
>>> UNKNOWN
>>> 
>>> Success!
>>> 
>>> Job Stats (time in seconds):
>>> 
>>> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
>>> MinReduceTime AvgReduceTime Alias Feature Outputs
>>> 
>>> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
>>> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
>>> 
>>> Input(s):
>>> 
>>> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
>>> 
>>> Output(s):
>>> 
>>> Successfully stored 0 records in:
>>> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
>>> 
>>> Counters:
>>> 
>>> Total records written : 0
>>> 
>>> Total bytes written : 0
>>> 
>>> Spillable Memory Manager spill count : 0
>>> 
>>> Total bags proactively spilled: 0
>>> 
>>> Total records proactively spilled: 0
>>> 
>>> Job DAG:
>>> 
>>> job_201202201638_0012
>>> 
>>> 
>>> 
>>> 2012-02-21 09:22:25,004 [main] INFO
>>> 
>> 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - Success!
>>> 
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>>> to process : 1
>>> 
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>>> paths to process : 1
>>> 
>>> grunt> quit
>> 
>>

Re: Help with XMLLoader

Posted by Mohit Anchlia <mo...@gmail.com>.

On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan <pv...@yahoo-inc.com>wrote:

> Hi Mohit,
>  XMLLoader looks for the start and end tag for a given string argument. In
> the given input there are no end tags and hence it read 0 records.
>
> Example:
> raw = LOAD 'sample_xml' using
> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
> dump raw;
>
> cat sample_xml
> <abc><def></def></abc>
> <abc><def></def></abc>
>

Thanks! I got past this. But I am facing a different problem. When I have a
big file that splits into multiple nodes then pig is not able to read the
records. It returns 0 records found.

I create a big file 2G with lots of xml root like above. Then I do hadoop
fs -copyFromLocal bigfile /examples

But when I run pig script it return 0 records. If I reduce the size of file
to few MB then it works fine. How can I resolve this?

>
> Thanks
> Vivek
>  On 2/21/12 11:02 PM, "Mohit Anchlia" <mo...@gmail.com> wrote:
>
> > I am trying to use XMLLoader to process the files but it doesn't seem to
> be
> > quite working. For the first pass I am just trying to dump all the
> contents
> > but it's saying 0 records found:
> >
> > bash-3.2$ hadoop fs -cat /examples/testfile.txt
> >
> > <abc><def></def><abc>
> >
> > <abc><def></def><abc>
> >
> > register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
> >
> > raw = LOAD '/examples/testfile.txt' using
> > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as
> (document:chararray);
> >
> > dump raw;
> >
> > 2012-02-21 09:22:18,947 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 50% complete
> >
> > 2012-02-21 09:22:24,998 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 100% complete
> >
> > 2012-02-21 09:22:24,999 [main] INFO
> org.apache.pig.tools.pigstats.PigStats
> > - Script Statistics:
> >
> > HadoopVersion PigVersion UserId StartedAt FinishedAt Features
> >
> > 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
> > UNKNOWN
> >
> > Success!
> >
> > Job Stats (time in seconds):
> >
> > JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
> > MinReduceTime AvgReduceTime Alias Feature Outputs
> >
> > job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
> > hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
> >
> > Input(s):
> >
> > Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
> >
> > Output(s):
> >
> > Successfully stored 0 records in:
> > "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
> >
> > Counters:
> >
> > Total records written : 0
> >
> > Total bytes written : 0
> >
> > Spillable Memory Manager spill count : 0
> >
> > Total bags proactively spilled: 0
> >
> > Total records proactively spilled: 0
> >
> > Job DAG:
> >
> > job_201202201638_0012
> >
> >
> >
> > 2012-02-21 09:22:25,004 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!
> >
> > 2012-02-21 09:22:25,011 [main] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> > to process : 1
> >
> > 2012-02-21 09:22:25,011 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> > paths to process : 1
> >
> > grunt> quit
>
>

Re: Help with XMLLoader

Posted by Vivek Padmanabhan <pv...@yahoo-inc.com>.

Hi Mohit,
 XMLLoader looks for the start and end tag for a given string argument. In
the given input there are no end tags and hence it read 0 records.

Example: 
raw = LOAD 'sample_xml' using
org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
dump raw;

cat sample_xml
<abc><def></def></abc>
<abc><def></def></abc>

Thanks
Vivek
On 2/21/12 11:02 PM, "Mohit Anchlia" <mo...@gmail.com> wrote:

> I am trying to use XMLLoader to process the files but it doesn't seem to be
> quite working. For the first pass I am just trying to dump all the contents
> but it's saying 0 records found:
> 
> bash-3.2$ hadoop fs -cat /examples/testfile.txt
> 
> <abc><def></def><abc>
> 
> <abc><def></def><abc>
> 
> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
> 
> raw = LOAD '/examples/testfile.txt' using
> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
> 
> dump raw;
> 
> 2012-02-21 09:22:18,947 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 50% complete
> 
> 2012-02-21 09:22:24,998 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 
> 2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats
> - Script Statistics:
> 
> HadoopVersion PigVersion UserId StartedAt FinishedAt Features
> 
> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
> UNKNOWN
> 
> Success!
> 
> Job Stats (time in seconds):
> 
> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
> MinReduceTime AvgReduceTime Alias Feature Outputs
> 
> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
> 
> Input(s):
> 
> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
> 
> Output(s):
> 
> Successfully stored 0 records in:
> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
> 
> Counters:
> 
> Total records written : 0
> 
> Total bytes written : 0
> 
> Spillable Memory Manager spill count : 0
> 
> Total bags proactively spilled: 0
> 
> Total records proactively spilled: 0
> 
> Job DAG:
> 
> job_201202201638_0012
> 
> 
> 
> 2012-02-21 09:22:25,004 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> 
> 2012-02-21 09:22:25,011 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
> 
> 2012-02-21 09:22:25,011 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : 1
> 
> grunt> quit