You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Morkus <mo...@protonmail.com> on 2018/08/06 17:27:34 UTC

PDF Extraction Failed for scientific document

Hello all,

For the first time ever, a PDF I tried to extract with Tika, failed.

A scientific article with lots of symbols and such, by these authors:

Beyond the Words: Predicting User Personality from

Heterogeneous Information

Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,

Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May

yMicrosoft ResearchzMicrosoft

Department of Computer Science and Technology, Tsinghua University

weihh12@mails.tsinghua.edu.cn,

{fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com

------------

I have tika-core 1.18 and tika-parsers 1.18.

Is it unusual to have a failed PDF translation?

Suggestions?

I can include the PDF in an email, but wanted to ask first.

Thanks!

Sent from [ProtonMail](https://protonmail.com), Swiss-based encrypted email.

Sent from [ProtonMail](https://protonmail.com), Swiss-based encrypted email.

Re: PDF Extraction Failed for scientific document

Posted by Chris Mattmann <ma...@apache.org>.

Try this as well: http://wiki.apache.org/tika/GrobidJournalParser 

 

 

 

From: Tim Allison <ta...@apache.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, August 6, 2018 at 10:52 AM
To: "user@tika.apache.org" <us...@tika.apache.org>, "morkus@protonmail.com" <mo...@protonmail.com>
Subject: Re: PDF Extraction Failed for scientific document

 

Well...um...it isn't common, but it does happen, and PDFs are

notoriously bad transport containers for text.

 

Some things are fixable, and some things aren't.

 

I downloaded this pdf:

https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf

I opened it in AdobeDC and "saved as text".  There are some definite,

um, areas for improvement.

 

Typically, if Adobe didn't do a good job, then we can assume that

there are some underlying, er, features that we can't expect Tika or

PDFBox to fix.  Adobe has problems with spacing: "isa

psychologicallexicon,hasbeenusedtoevaluate user personality".  This

does happen with PDFs because sometimes spaces aren't stored, but

rather are calculated based on font widths etc.

 

When I compared the output with Tika, it looks like we (and PDFBox!)

are actually doing better in this case and several others.

Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens

extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In

short, our extract has more common words in it than Adobe does.

 

And "where Ti;m;Ei;n;Ai;o;Si;p

representsan instanceofatweet"  suggests that there are no Unicode

equivalents stored in the PDF for some fonts.

 

PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in

font RBRLOC+CMEX9"

 

 

On Mon, Aug 6, 2018 at 1:27 PM Morkus <mo...@protonmail.com> wrote:

 

Hello all,

 

For the first time ever, a PDF I tried to extract with Tika, failed.

 

A scientific article with lots of symbols and such, by these authors:

 

Beyond the Words: Predicting User Personality from

Heterogeneous Information

Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,

Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May

yMicrosoft ResearchzMicrosoft

Department of Computer Science and Technology, Tsinghua University

weihh12@mails.tsinghua.edu.cn,

{fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com

 

------------

 

I have tika-core 1.18 and tika-parsers 1.18.

 

Is it unusual to have a failed PDF translation?

 

Suggestions?

 

I can include the PDF in an email, but wanted to ask first.

 

Thanks!

 

 

Sent from ProtonMail, Swiss-based encrypted email.

 

Sent from ProtonMail, Swiss-based encrypted email.

Re: PDF Extraction Failed for scientific document

Posted by Robert Neal Clayton <ro...@gmail.com>.

Speaking of scientific papers, you’re right, you are doing a better job than most. A couple of German comp-sci professors have done a study and published it ;)

http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf <http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf>


> On Aug 6, 2018, at 12:51 PM, Tim Allison <ta...@apache.org> wrote:
> 
> Well...um...it isn't common, but it does happen, and PDFs are
> notoriously bad transport containers for text.
> 
> Some things are fixable, and some things aren't.
> 
> I downloaded this pdf:
> https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf
> I opened it in AdobeDC and "saved as text".  There are some definite,
> um, areas for improvement.
> 
> Typically, if Adobe didn't do a good job, then we can assume that
> there are some underlying, er, features that we can't expect Tika or
> PDFBox to fix.  Adobe has problems with spacing: "isa
> psychologicallexicon,hasbeenusedtoevaluate user personality".  This
> does happen with PDFs because sometimes spaces aren't stored, but
> rather are calculated based on font widths etc.
> 
> When I compared the output with Tika, it looks like we (and PDFBox!)
> are actually doing better in this case and several others.
> Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens
> extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In
> short, our extract has more common words in it than Adobe does.
> 
> And "where Ti;m;Ei;n;Ai;o;Si;p
> representsan instanceofatweet"  suggests that there are no Unicode
> equivalents stored in the PDF for some fonts.
> 
> PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in
> font RBRLOC+CMEX9"
> 
> 
> On Mon, Aug 6, 2018 at 1:27 PM Morkus <mo...@protonmail.com> wrote:
>> 
>> Hello all,
>> 
>> For the first time ever, a PDF I tried to extract with Tika, failed.
>> 
>> A scientific article with lots of symbols and such, by these authors:
>> 
>> Beyond the Words: Predicting User Personality from
>> Heterogeneous Information
>> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
>> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
>> yMicrosoft ResearchzMicrosoft
>> Department of Computer Science and Technology, Tsinghua University
>> weihh12@mails.tsinghua.edu.cn,
>> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
>> 
>> ------------
>> 
>> I have tika-core 1.18 and tika-parsers 1.18.
>> 
>> Is it unusual to have a failed PDF translation?
>> 
>> Suggestions?
>> 
>> I can include the PDF in an email, but wanted to ask first.
>> 
>> Thanks!
>> 
>> 
>> Sent from ProtonMail, Swiss-based encrypted email.
>> 
>> Sent from ProtonMail, Swiss-based encrypted email.
>> 
>>

Re: PDF Extraction Failed for scientific document

Posted by Tim Allison <ta...@apache.org>.

Well...um...it isn't common, but it does happen, and PDFs are
notoriously bad transport containers for text.

Some things are fixable, and some things aren't.

I downloaded this pdf:
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/WSDM_personality.pdf
I opened it in AdobeDC and "saved as text".  There are some definite,
um, areas for improvement.

Typically, if Adobe didn't do a good job, then we can assume that
there are some underlying, er, features that we can't expect Tika or
PDFBox to fix.  Adobe has problems with spacing: "isa
psychologicallexicon,hasbeenusedtoevaluate user personality".  This
does happen with PDFs because sometimes spaces aren't stored, but
rather are calculated based on font widths etc.

When I compared the output with Tika, it looks like we (and PDFBox!)
are actually doing better in this case and several others.
Tika-eval reports 7215 tokens extracted by Tika and 6473 tokens
extracted by Adobe, with a drop of 374 "common tokens" in Adobe.  In
short, our extract has more common words in it than Adobe does.

And "where Ti;m;Ei;n;Ai;o;Si;p
representsan instanceofatweet"  suggests that there are no Unicode
equivalents stored in the PDF for some fonts.

PDFBox notes: "WARN  No Unicode mapping for summationdisplay (88) in
font RBRLOC+CMEX9"

On Mon, Aug 6, 2018 at 1:27 PM Morkus <mo...@protonmail.com> wrote:
>
> Hello all,
>
> For the first time ever, a PDF I tried to extract with Tika, failed.
>
> A scientific article with lots of symbols and such, by these authors:
>
> Beyond the Words: Predicting User Personality from
> Heterogeneous Information
> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
> yMicrosoft ResearchzMicrosoft
> Department of Computer Science and Technology, Tsinghua University
> weihh12@mails.tsinghua.edu.cn,
> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
>
> ------------
>
> I have tika-core 1.18 and tika-parsers 1.18.
>
> Is it unusual to have a failed PDF translation?
>
> Suggestions?
>
> I can include the PDF in an email, but wanted to ask first.
>
> Thanks!
>
>
> Sent from ProtonMail, Swiss-based encrypted email.
>
> Sent from ProtonMail, Swiss-based encrypted email.
>
>

Re: PDF Extraction Failed for scientific document

Posted by Robert Neal Clayton <ro...@gmail.com>.

If Tesseract is installed and was triggered: 

There is an ‘.equ’ language file on their GitHub page which should be installed with the other language files. It does what its name suggests: detect equation symbols. The ‘.equ’ and ‘.osd’ languages are universal for Tesseract versions.  I’m looking at FreeBSD’s port which includes them by default, but not sure about Linux distros.  It seems that Debian breaks them into individual languages so you could potentially not have those unless they’re in the base package.

> On Aug 6, 2018, at 12:27 PM, Morkus <mo...@protonmail.com> wrote:
> 
> Hello all,
> 
> For the first time ever, a PDF I tried to extract with Tika, failed.
> 
> A scientific article with lots of symbols and such, by these authors:
> 
> Beyond the Words: Predicting User Personality from
> Heterogeneous Information
> Honghao Weiy;, Fuzheng Zhangy, Nicholas Jing Yuanz,
> Chuan Caoz, Hao Fuz, Xing Xiey, Yong Ruiy, Wei-Ying May
> yMicrosoft ResearchzMicrosoft
> Department of Computer Science and Technology, Tsinghua University
> weihh12@mails.tsinghua.edu.cn <ma...@mails.tsinghua.edu.cn>,
> {fuzzhang, nicholas.yuan, chcao, fuha, xingx, yongrui, wyma}@microsoft.com
> 
> ------------
> 
> I have tika-core 1.18 and tika-parsers 1.18.
> 
> Is it unusual to have a failed PDF translation?
> 
> Suggestions?
> 
> I can include the PDF in an email, but wanted to ask first.
> 
> Thanks!
> 
> 
> Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email.
> 
> Sent from ProtonMail <https://protonmail.com/>, Swiss-based encrypted email.
> 
>