You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/09/15 21:44:52 UTC

Unknown encoding for 'WinAnsiEncoding' when parsing PDF files using Tika

All,

 

Did anybody encounter the following error with parsing PDF files using
Tika parser?  Online search seems to indicate PDFBox should support this
encoding.  Am I doing something wrong?  Any help is appreciated.

 

Thanks so much for your help in advance

Raj

 

----------------------------------------------------------------------

 

2010-09-14 23:19:19,630 WARN  util.PDFStreamEngine -
java.io.IOException: Unknown encoding for 'WinAnsEncoding'

java.io.IOException: Unknown encoding for 'WinAnsEncoding'

                at
org.apache.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.j
ava:69)

                at
org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:626)

                at
org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)

                at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngin
e.java:372)

                at
org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)

                at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.j
ava:552)

                at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.
java:248)

                at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.jav
a:207)

                at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
367)

                at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java
:291)

                at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:24
7)

                at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)

                at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)

                at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)

                at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)

                at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)

                at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)

                at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
Source)

                at java.util.concurrent.FutureTask.run(Unknown Source)

                at java.lang.Thread.run(Unknown Source)


nutch crawling page question

Posted by Andy Cranfill <An...@careerbuilder.com>.
Hi All,

I am using nutch for a new crawling project and have run into a quandary (for me).   When i get a page to HTML parse it, i need a datum from the page that had the link to this page (the one i am parsing now).  The page previous to the one i need has a list of links and i need to get some data with the link so when i parse the page (the target of one of these links) i can get the data i need.

Any ideas on how to pass the data from the preceding page to the linked-to page?

Thanks!
Andy Cranfill

RE: Unknown encoding for 'WinAnsiEncoding' when parsing PDF files using Tika

Posted by "Nemani, Raj" <Ra...@turner.com>.
Ken,
Thank you so much.  I was reading the error completely differently!!.  I
never expected that the user could have typed a wrong encoding. That
makes a lot of sense.

Thanks again
Raj


-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Thursday, September 16, 2010 12:39 PM
To: user@nutch.apache.org
Subject: Re: Unknown encoding for 'WinAnsiEncoding' when parsing PDF
files using Tika

Hi Raj,

On Sep 15, 2010, at 12:44pm, Nemani, Raj wrote:

> Did anybody encounter the following error with parsing PDF files using
> Tika parser?  Online search seems to indicate PDFBox should support  
> this
> encoding.  Am I doing something wrong?  Any help is appreciated.

My guess is that somebody created this PDF using an invalid encoding  
name.

If you want PDFBox to be more lenient about handling unknown  
encodings, and/or allow this as a typo for "WinAnsiEncoding", then  
please file an issue in the PDFBox Jira system [1]

Regards,

-- Ken

[1] https://issues.apache.org/jira/browse/PDFBOX

> ----------------------------------------------------------------------
>
>
>
> 2010-09-14 23:19:19,630 WARN  util.PDFStreamEngine -
> java.io.IOException: Unknown encoding for 'WinAnsEncoding'
>
> java.io.IOException: Unknown encoding for 'WinAnsEncoding'
>
>                at
> org 
> .apache.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.j
> ava:69)
>
>                at
> org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:626)
>
>                at
> org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngin
> e.java:372)
>
>                at
> org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.j
> ava:552)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.
> java:248)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.jav
> a:207)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
> 367)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java
> :291)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:24
> 7)
>
>                at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java: 
> 180)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>
>                at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)
>
>                at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>
>                at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>
>                at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>
>                at java.util.concurrent.FutureTask 
> $Sync.innerRun(Unknown
> Source)
>
>                at java.util.concurrent.FutureTask.run(Unknown Source)
>
>                at java.lang.Thread.run(Unknown Source)
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Unknown encoding for 'WinAnsiEncoding' when parsing PDF files using Tika

Posted by Ken Krugler <kk...@transpac.com>.
Hi Raj,

On Sep 15, 2010, at 12:44pm, Nemani, Raj wrote:

> Did anybody encounter the following error with parsing PDF files using
> Tika parser?  Online search seems to indicate PDFBox should support  
> this
> encoding.  Am I doing something wrong?  Any help is appreciated.

My guess is that somebody created this PDF using an invalid encoding  
name.

If you want PDFBox to be more lenient about handling unknown  
encodings, and/or allow this as a typo for "WinAnsiEncoding", then  
please file an issue in the PDFBox Jira system [1]

Regards,

-- Ken

[1] https://issues.apache.org/jira/browse/PDFBOX

> ----------------------------------------------------------------------
>
>
>
> 2010-09-14 23:19:19,630 WARN  util.PDFStreamEngine -
> java.io.IOException: Unknown encoding for 'WinAnsEncoding'
>
> java.io.IOException: Unknown encoding for 'WinAnsEncoding'
>
>                at
> org 
> .apache.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.j
> ava:69)
>
>                at
> org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:626)
>
>                at
> org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngin
> e.java:372)
>
>                at
> org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.j
> ava:552)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.
> java:248)
>
>                at
> org 
> .apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.jav
> a:207)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
> 367)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java
> :291)
>
>                at
> org 
> .apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:24
> 7)
>
>                at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java: 
> 180)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>
>                at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)
>
>                at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>
>                at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>
>                at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>
>                at java.util.concurrent.FutureTask 
> $Sync.innerRun(Unknown
> Source)
>
>                at java.util.concurrent.FutureTask.run(Unknown Source)
>
>                at java.lang.Thread.run(Unknown Source)
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






RE: Unknown encoding for 'WinAnsiEncoding' when parsing PDF files using Tika

Posted by "Nemani, Raj" <Ra...@turner.com>.
Also, the term "WinAnsEncoding" does not seem correct.  It is more like
"WinAnsiEnoding"  Is there bug in Tika/PDFBox somewhere.

Can anybody please throw more light ( Julien, markus, Chris, Andrzej) on
this error?

Thanks
Raj

-----Original Message-----
From: Nemani, Raj [mailto:Raj.Nemani@turner.com] 
Sent: Wednesday, September 15, 2010 3:45 PM
To: user@nutch.apache.org
Subject: Unknown encoding for 'WinAnsiEncoding' when parsing PDF files
using Tika

All,

 

Did anybody encounter the following error with parsing PDF files using
Tika parser?  Online search seems to indicate PDFBox should support this
encoding.  Am I doing something wrong?  Any help is appreciated.

 

Thanks so much for your help in advance

Raj

 

----------------------------------------------------------------------

 

2010-09-14 23:19:19,630 WARN  util.PDFStreamEngine -
java.io.IOException: Unknown encoding for 'WinAnsEncoding'

java.io.IOException: Unknown encoding for 'WinAnsEncoding'

                at
org.apache.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.j
ava:69)

                at
org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:626)

                at
org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:438)

                at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngin
e.java:372)

                at
org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)

                at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.j
ava:552)

                at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.
java:248)

                at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.jav
a:207)

                at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
367)

                at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java
:291)

                at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:24
7)

                at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)

                at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)

                at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)

                at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)

                at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)

                at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)

                at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
Source)

                at java.util.concurrent.FutureTask.run(Unknown Source)

                at java.lang.Thread.run(Unknown Source)