You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Zak, Richard [USA]" <za...@bah.com> on 2009/01/02 21:50:36 UTC

Concatenating PDF files

All, I have a project that I am working on involving PDF files in HDFS.
There are X number of directories and each directory contains Y number
of PDFs, and per directory all the PDFs are to be concatenated.  At the
moment I am running a test with 5 directories and 15 PDFs in each
directory.  I am also using iText to handle the PDFs, and I wrote a
wrapper class to take PDFs and add them to an internal PDF that grows. I
am running this on Amazon's EC2 using Extra Large instances, which have
a total of 15 GB RAM.  Each Java process, two per Instance, has 7GB
maximum (-Xmx7000m).  There is one Master Instance and 4 Slave
instances.  I am able to confirm that the Slave processes are connected
to the Master and have been working.  I am using Hadoop 0.19.0.
 
The problem is that I run out of memory when the concatenation class
reads in a PDF.  I have tried both the iText library version 2.1.4 and
the Faceless PDF library, and both have the error in the middle of
concatenating the documents.  I looked into Multivalent, but that one
just uses Strings to determine paths and it opens the files directly,
while I am using a wrapper class to interact with items in HDFS, so
Multivalent is out.
 
Since the PDFs aren't enourmous (17 MB or less) and each Instance has
tons of memory, so why am I running out of memory?
 
The mapper works like this.  It gets a text file with a list of
directories, and per directory it reads in the contents and adds them to
the concatenation class.  The reducer pretty much does nothing.  Is this
the best way to do this, or is there a better way?
 
Thank you!
 
Richard J. Zak
 

RE: Concatenating PDF files

Posted by "Zak, Richard [USA]" <za...@bah.com>.
I was able to process 100 pdfs in 4 directories.  How I have moved up to
500 pdfs (started with 700 and I'm working backwards) in 6 directories,
and I am getting this error in the console:

09/01/07 14:04:41 INFO mapred.JobClient: Task Id :
attempt_200812311556_0034_m_000000_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)



2 of the 6 directories are able to get their PDFs concatenated, while
the rest aren't, and there isn't any other error message except this in
the logs:

Exception closing file
/user/root/output/_temporary/_attempt_200812311556_0032_m_000000_2/part-
00000.deflate
java.io.IOException: Filesystem closed
	at
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
	at
org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient
.java:3084)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:30
53)
	at
org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
	at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
	at
org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem
.java:243)
	at
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
	at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
	at
org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)

Is there a way to fix this, or at least make Hadoop not care about this
error?  Also, I don't care about the output directory, as my output is
written to the root of the HDFS.  How can I find out why 4 of the
directories aren't working?

It's also kind of weird that the map progress goes backwards:

attempt_200812311556_0034_m_000000_1: Output will be /51.pdf
attempt_200812311556_0034_m_000000_1: Successfully concatenated 22143
pages from 501 PDFs for 51
attempt_200812311556_0034_m_000000_1: Output will be /52.pdf
attempt_200812311556_0034_m_000000_1: Reached loop limit 501
09/01/07 14:06:58 INFO mapred.JobClient:  map 16% reduce 0%
09/01/07 14:07:48 INFO mapred.JobClient:  map 0% reduce 0%
09/01/07 14:07:48 INFO mapred.JobClient: Task Id :
attempt_200812311556_0034_m_000001_1, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

attempt_200812311556_0034_m_000001_1: Output will be /54.pdf
attempt_200812311556_0034_m_000001_1: Successfully concatenated 20442
pages from 501 PDFs for 54
attempt_200812311556_0034_m_000001_1: Output will be /55.pdf
attempt_200812311556_0034_m_000001_1: Reached loop limit 501
09/01/07 14:08:12 INFO mapred.JobClient:  map 16% reduce 0%
09/01/07 14:08:50 INFO mapred.JobClient:  map 0% reduce 0%
09/01/07 14:08:50 INFO mapred.JobClient: Task Id :
attempt_200812311556_0034_m_000000_2, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

attempt_200812311556_0034_m_000000_2: Output will be /51.pdf
attempt_200812311556_0034_m_000000_2: Successfully concatenated 22143
pages from 501 PDFs for 51
attempt_200812311556_0034_m_000000_2: Output will be /52.pdf
attempt_200812311556_0034_m_000000_2: Reached loop limit 501
09/01/07 14:09:51 INFO mapred.JobClient:  map 16% reduce 0%
09/01/07 14:10:34 INFO mapred.JobClient:  map 33% reduce 0%
09/01/07 14:10:41 INFO mapred.JobClient:  map 16% reduce 0%
09/01/07 14:10:41 INFO mapred.JobClient: Task Id :
attempt_200812311556_0034_m_000001_2, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

attempt_200812311556_0034_m_000001_2: Output will be /54.pdf
attempt_200812311556_0034_m_000001_2: Successfully concatenated 20442
pages from 501 PDFs for 54
attempt_200812311556_0034_m_000001_2: Output will be /55.pdf
attempt_200812311556_0034_m_000001_2: Reached loop limit 501
09/01/07 14:11:06 INFO mapred.JobClient:  map 0% reduce 0%

Thank you!

Richard J. Zak

-----Original Message-----
From: Zak, Richard [USA] 
Sent: Tuesday, January 06, 2009 17:31
To: 'core-user@hadoop.apache.org'
Subject: RE: Concatenating PDF files

Thank you very much Tom, that seems to have done the trick!

conf.set("mapred.child.java.opts","-Xmx7000m");
conf.setNumReduceTasks(0);

And I was able to churn through 4 directories each with 100 PDFs.  And
yes, from ps I could see that the processes were using the "-Xmx7000m"
option.


Richard J. Zak

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com]
Sent: Monday, January 05, 2009 06:47
To: core-user@hadoop.apache.org
Subject: Re: Concatenating PDF files

Hi Richard,

Are you running out of memory after many PDFs have been processed by one
mapper, or during the first? The former would suggest that memory isn't
being released; the latter that the task VM doesn't have enough memory
to start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the processes
are using by logging into a machine when the job is running and running
'top' or 'ps'.

It won't help the memory problems, but it sounds like you could run with
zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2 XL
instances can run more than two tasks per node (they have 4 virtual
cores, see http://aws.amazon.com/ec2/instance-types/). And you should
configure them to take advantage of multiple disks -
https://issues.apache.org/jira/browse/HADOOP-4745.

Tom

On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] <za...@bah.com>
wrote:
> All, I have a project that I am working on involving PDF files in
HDFS.
> There are X number of directories and each directory contains Y number

> of PDFs, and per directory all the PDFs are to be concatenated.  At 
> the moment I am running a test with 5 directories and 15 PDFs in each 
> directory.  I am also using iText to handle the PDFs, and I wrote a 
> wrapper class to take PDFs and add them to an internal PDF that grows.
> I am running this on Amazon's EC2 using Extra Large instances, which 
> have a total of 15 GB RAM.  Each Java process, two per Instance, has 
> 7GB maximum (-Xmx7000m).  There is one Master Instance and 4 Slave 
> instances.  I am able to confirm that the Slave processes are 
> connected to the Master and have been working.  I am using Hadoop
0.19.0.
>
> The problem is that I run out of memory when the concatenation class 
> reads in a PDF.  I have tried both the iText library version 2.1.4 and

> the Faceless PDF library, and both have the error in the middle of 
> concatenating the documents.  I looked into Multivalent, but that one 
> just uses Strings to determine paths and it opens the files directly, 
> while I am using a wrapper class to interact with items in HDFS, so 
> Multivalent is out.
>
> Since the PDFs aren't enourmous (17 MB or less) and each Instance has 
> tons of memory, so why am I running out of memory?
>
> The mapper works like this.  It gets a text file with a list of 
> directories, and per directory it reads in the contents and adds them 
> to the concatenation class.  The reducer pretty much does nothing.  Is

> this the best way to do this, or is there a better way?
>
> Thank you!
>
> Richard J. Zak
>
>

RE: Concatenating PDF files

Posted by "Zak, Richard [USA]" <za...@bah.com>.
Thank you very much Tom, that seems to have done the trick!

conf.set("mapred.child.java.opts","-Xmx7000m"); 
conf.setNumReduceTasks(0);

And I was able to churn through 4 directories each with 100 PDFs.  And yes,
from ps I could see that the processes were using the "-Xmx7000m" option.


Richard J. Zak

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Monday, January 05, 2009 06:47
To: core-user@hadoop.apache.org
Subject: Re: Concatenating PDF files

Hi Richard,

Are you running out of memory after many PDFs have been processed by one
mapper, or during the first? The former would suggest that memory isn't
being released; the latter that the task VM doesn't have enough memory to
start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the processes are
using by logging into a machine when the job is running and running 'top' or
'ps'.

It won't help the memory problems, but it sounds like you could run with
zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2 XL
instances can run more than two tasks per node (they have 4 virtual cores,
see http://aws.amazon.com/ec2/instance-types/). And you should configure
them to take advantage of multiple disks -
https://issues.apache.org/jira/browse/HADOOP-4745.

Tom

On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] <za...@bah.com>
wrote:
> All, I have a project that I am working on involving PDF files in HDFS.
> There are X number of directories and each directory contains Y number 
> of PDFs, and per directory all the PDFs are to be concatenated.  At 
> the moment I am running a test with 5 directories and 15 PDFs in each 
> directory.  I am also using iText to handle the PDFs, and I wrote a 
> wrapper class to take PDFs and add them to an internal PDF that grows. 
> I am running this on Amazon's EC2 using Extra Large instances, which 
> have a total of 15 GB RAM.  Each Java process, two per Instance, has 
> 7GB maximum (-Xmx7000m).  There is one Master Instance and 4 Slave 
> instances.  I am able to confirm that the Slave processes are 
> connected to the Master and have been working.  I am using Hadoop 0.19.0.
>
> The problem is that I run out of memory when the concatenation class 
> reads in a PDF.  I have tried both the iText library version 2.1.4 and 
> the Faceless PDF library, and both have the error in the middle of 
> concatenating the documents.  I looked into Multivalent, but that one 
> just uses Strings to determine paths and it opens the files directly, 
> while I am using a wrapper class to interact with items in HDFS, so 
> Multivalent is out.
>
> Since the PDFs aren't enourmous (17 MB or less) and each Instance has 
> tons of memory, so why am I running out of memory?
>
> The mapper works like this.  It gets a text file with a list of 
> directories, and per directory it reads in the contents and adds them 
> to the concatenation class.  The reducer pretty much does nothing.  Is 
> this the best way to do this, or is there a better way?
>
> Thank you!
>
> Richard J. Zak
>
>

Re: Concatenating PDF files

Posted by Tom White <to...@cloudera.com>.
Hi Richard,

Are you running out of memory after many PDFs have been processed by
one mapper, or during the first? The former would suggest that memory
isn't being released; the latter that the task VM doesn't have enough
memory to start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the
processes are using by logging into a machine when the job is running
and running 'top' or 'ps'.

It won't help the memory problems, but it sounds like you could run
with zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2
XL instances can run more than two tasks per node (they have 4 virtual
cores, see http://aws.amazon.com/ec2/instance-types/). And you should
configure them to take advantage of multiple disks -
https://issues.apache.org/jira/browse/HADOOP-4745.

Tom

On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] <za...@bah.com> wrote:
> All, I have a project that I am working on involving PDF files in HDFS.
> There are X number of directories and each directory contains Y number
> of PDFs, and per directory all the PDFs are to be concatenated.  At the
> moment I am running a test with 5 directories and 15 PDFs in each
> directory.  I am also using iText to handle the PDFs, and I wrote a
> wrapper class to take PDFs and add them to an internal PDF that grows. I
> am running this on Amazon's EC2 using Extra Large instances, which have
> a total of 15 GB RAM.  Each Java process, two per Instance, has 7GB
> maximum (-Xmx7000m).  There is one Master Instance and 4 Slave
> instances.  I am able to confirm that the Slave processes are connected
> to the Master and have been working.  I am using Hadoop 0.19.0.
>
> The problem is that I run out of memory when the concatenation class
> reads in a PDF.  I have tried both the iText library version 2.1.4 and
> the Faceless PDF library, and both have the error in the middle of
> concatenating the documents.  I looked into Multivalent, but that one
> just uses Strings to determine paths and it opens the files directly,
> while I am using a wrapper class to interact with items in HDFS, so
> Multivalent is out.
>
> Since the PDFs aren't enourmous (17 MB or less) and each Instance has
> tons of memory, so why am I running out of memory?
>
> The mapper works like this.  It gets a text file with a list of
> directories, and per directory it reads in the contents and adds them to
> the concatenation class.  The reducer pretty much does nothing.  Is this
> the best way to do this, or is there a better way?
>
> Thank you!
>
> Richard J. Zak
>
>