You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2005/09/20 09:10:29 UTC
[jira] Closed: (NUTCH-85) pdf parser caused fetcher hangs.
[ http://issues.apache.org/jira/browse/NUTCH-85?page=all ]
Andrzej Bialecki closed NUTCH-85:
----------------------------------
Resolution: Fixed
The parser has been updated to use PDFBox-0.7.2, which should solve this issue. Please re-open if that's not the case.
> pdf parser caused fetcher hangs.
> --------------------------------
>
> Key: NUTCH-85
> URL: http://issues.apache.org/jira/browse/NUTCH-85
> Project: Nutch
> Type: Bug
> Components: fetcher
> Versions: 0.7, 0.8-dev
> Reporter: Stefan Groschupf
> Fix For: 0.8-dev
>
> We notice that fetcher hangs caused by pdfbox.
> A thread handles a pdf parsing and may hangs and is never again available.
> This happens as many times as threads are active and than the complete fetch process hangs.
>
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> "fetcher160" prio=1 tid=0x083c9720 nid=0x16de runnable [b1669000..b166a238]
> at org.pdfbox.cmaptypes.CMap.addMapping(CMap.java:119)
> at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:183)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> "fetcher82" prio=1 tid=0xb4637d78 nid=0x59aa runnable [b4379000..b437a238]
> at java.nio.charset.CoderResult$1.create(CoderResult.java:207)
> at java.nio.charset.CoderResult$Cache.get(CoderResult.java:196)
> - locked <0xb94fa908> (a java.nio.charset.CoderResult$1)
> at java.nio.charset.CoderResult$Cache.access$200(CoderResult.java:178)
> at java.nio.charset.CoderResult.malformedForLength(CoderResult.java:217)
> at sun.nio.cs.UnicodeDecoder.decodeLoop(UnicodeDecoder.java:71)
> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:538)
> at java.lang.StringCoding$CharsetSD.decode(StringCoding.java:192)
> at java.lang.StringCoding.decode(StringCoding.java:230)
> at java.lang.String.<init>(String.java:320)
> at java.lang.String.<init>(String.java:346)
> at org.pdfbox.cmapparser.CMapParser.createStringFromBytes(CMapParser.java:230)
> at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:182)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:532)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:358)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:261)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:63)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:405)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:385)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:168)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:232)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:205)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:108)
> at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:123)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:239)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira