You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/08/25 11:13:10 UTC

[jira] Created: (NUTCH-85) pdf parser caused fetcher hangs.

pdf parser caused fetcher hangs.
--------------------------------

         Key: NUTCH-85
         URL: http://issues.apache.org/jira/browse/NUTCH-85
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


We notice that fetcher hangs caused by pdfbox.
A thread handles a pdf parsing and may hangs and is never again available. 
This happens as many times as threads are active and than the complete fetch process hangs.
 


Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):

"fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable [b1669000..b166a238]
	at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)



"fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
	at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
	at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
	- locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
	at java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
	at java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
	at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
	at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
	at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
	at java.lang.StringCoding.decode(StringCoding.java:230)
	at java.lang.String.<init>(String.java:320)
	at java.lang.String.<init>(String.java:346)
	at org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-85) pdf parser caused fetcher hangs.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-85?page=all ]
     
Andrzej Bialecki  closed NUTCH-85:
----------------------------------

    Resolution: Fixed

The parser has been updated to use PDFBox-0.7.2, which should solve this issue. Please re-open if that's not the case.

> pdf parser caused fetcher hangs.
> --------------------------------
>
>          Key: NUTCH-85
>          URL: http://issues.apache.org/jira/browse/NUTCH-85
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev

>
> We notice that fetcher hangs caused by pdfbox.
> A thread handles a pdf parsing and may hangs and is never again available. 
> This happens as many times as threads are active and than the complete fetch process hangs.
>  
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> "fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable [b1669000..b166a238]
> 	at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> "fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
> 	at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
> 	at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
> 	- locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
> 	at java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
> 	at java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
> 	at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
> 	at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
> 	at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
> 	at java.lang.StringCoding.decode(StringCoding.java:230)
> 	at java.lang.String.<init>(String.java:320)
> 	at java.lang.String.<init>(String.java:346)
> 	at org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-85) pdf parser caused fetcher hangs.

Posted by Andrzej Bialecki <ab...@getopt.org>.

EM wrote:
> So, just replace PDFBox-0.7.2-dev.jar from the plugin directory with the
> PDFBox-0.7.2-dev-20050825.jar (Renaming the file of course.) ? 

Yes.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: [jira] Commented: (NUTCH-85) pdf parser caused fetcher hangs.

Posted by EM <em...@cpuedge.com>.

So, just replace PDFBox-0.7.2-dev.jar from the plugin directory with the
PDFBox-0.7.2-dev-20050825.jar (Renaming the file of course.) ? 

Regards,
EM



-----Original Message-----
From: Andrzej Bialecki (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, August 25, 2005 6:28 AM
To: nutch-dev@incubator.apache.org
Subject: [jira] Commented: (NUTCH-85) pdf parser caused fetcher hangs.

    [
http://issues.apache.org/jira/browse/NUTCH-85?page=comments#action_12319987
] 

Andrzej Bialecki  commented on NUTCH-85:
----------------------------------------

This has been reported and fixed in the newer versions of PDFBox. These
versions haven't been released yet as an official release, so I decided not
to bring this into our repository, until it's released. In the meantime, you
can avoid this issue by replacing the PDFBox library with the nightly build
downloaded from http://www.pdfbox.org/dist/ .

> pdf parser caused fetcher hangs.
> --------------------------------
>
>          Key: NUTCH-85
>          URL: http://issues.apache.org/jira/browse/NUTCH-85
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev

>
> We notice that fetcher hangs caused by pdfbox.
> A thread handles a pdf parsing and may hangs and is never again available.

> This happens as many times as threads are active and than the complete
fetch process hangs.
>  
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> "fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable
[b1669000..b166a238]
> 	at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> "fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
> 	at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
> 	at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
> 	- locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
> 	at
java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
> 	at
java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
> 	at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
> 	at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
> 	at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
> 	at java.lang.StringCoding.decode(StringCoding.java:230)
> 	at java.lang.String.<init>(String.java:320)
> 	at java.lang.String.<init>(String.java:346)
> 	at
org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-85) pdf parser caused fetcher hangs.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-85?page=comments#action_12319987 ] 

Andrzej Bialecki  commented on NUTCH-85:
----------------------------------------

This has been reported and fixed in the newer versions of PDFBox. These versions haven't been released yet as an official release, so I decided not to bring this into our repository, until it's released. In the meantime, you can avoid this issue by replacing the PDFBox library with the nightly build downloaded from http://www.pdfbox.org/dist/ .

> pdf parser caused fetcher hangs.
> --------------------------------
>
>          Key: NUTCH-85
>          URL: http://issues.apache.org/jira/browse/NUTCH-85
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev

>
> We notice that fetcher hangs caused by pdfbox.
> A thread handles a pdf parsing and may hangs and is never again available. 
> This happens as many times as threads are active and than the complete fetch process hangs.
>  
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> "fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable [b1669000..b166a238]
> 	at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> "fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
> 	at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
> 	at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
> 	- locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
> 	at java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
> 	at java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
> 	at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
> 	at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
> 	at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
> 	at java.lang.StringCoding.decode(StringCoding.java:230)
> 	at java.lang.String.<init>(String.java:320)
> 	at java.lang.String.<init>(String.java:346)
> 	at org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
> 	at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
> 	at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> 	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> 	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> 	at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> 	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> 	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> 	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> 	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> 	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> 	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> 	at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira