You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/10/12 02:30:15 UTC

Problematic PDF

Hi Folks,
I have a problematic PDF which I keeps on crashing my Nutch crawl.
I am trying to get all data from the PDF, so content is not truncated at
all.
http://www.who.int/about/who_reform/who-internal-control-framework.pdf
Can someone please try to see if they have any issues parsing this document
with Tika 1.6?
I have tried it locally, and it seems OK. If I can confirm this with some
other folks then I can isolate this to my Nutch crawl.
Thank you
Lewis

-- 
*Lewis*

Re: Problematic PDF

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Lewis,

This parsed for me using 1.7-SNAPSHOT:

[chipotle:~/tmp/tika] mattmann% tika -t
"http://www.who.int/about/who_reform/who-internal-control-framework.pdf"
WARN - Count in xref table is 0 at offset 651997

 
 

Internal Control Framework

 

 

November 2013 

 
  



 
  

2  

 

 

  

ANNEX 

ANNEXES 

   

 

Table of Contents 

Table of Contents 
...........................................................................
....................................... 2

1. INTRODUCTION 
...........................................................................
................................................................ 3

2. SCOPE AND DEFINITION OF INTERNAL CONTROL
.........................................................................
4 

3. THE FIVE COMPONENTS AND EIGHTEEN PRINCIPLES OF INTERNAL CONTROL:
............... 5 

I/   Internal Environment
...........................................................................
............................ 5
II/  Risk Assessment
...........................................................................
................................... 6

III/ Control Activities
...........................................................................
................................. 6
IV/ Information and Communication
...........................................................................
......... 7 

..more snipped


Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Saturday, October 11, 2014 at 5:30 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Problematic PDF

>Hi Folks,
>
>I have a problematic PDF which I keeps on crashing my Nutch crawl.
>I am trying to get all data from the PDF, so content is not truncated at
>all.
>http://www.who.int/about/who_reform/who-internal-control-framework.pdf
>
>Can someone please try to see if they have any issues parsing this
>document with Tika 1.6?
>
>I have tried it locally, and it seems OK. If I can confirm this with some
>other folks then I can isolate this to my Nutch crawl.
>Thank you
>Lewis
>
>-- 
>Lewis
>
>
>
>
>