You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/10/12 02:30:15 UTC
Problematic PDF
Hi Folks,
I have a problematic PDF which I keeps on crashing my Nutch crawl.
I am trying to get all data from the PDF, so content is not truncated at
all.
http://www.who.int/about/who_reform/who-internal-control-framework.pdf
Can someone please try to see if they have any issues parsing this document
with Tika 1.6?
I have tried it locally, and it seems OK. If I can confirm this with some
other folks then I can isolate this to my Nutch crawl.
Thank you
Lewis
--
*Lewis*
Re: Problematic PDF
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Lewis,
This parsed for me using 1.7-SNAPSHOT:
[chipotle:~/tmp/tika] mattmann% tika -t
"http://www.who.int/about/who_reform/who-internal-control-framework.pdf"
WARN - Count in xref table is 0 at offset 651997
Internal Control Framework
November 2013
2
ANNEX
ANNEXES
Table of Contents
Table of Contents
...........................................................................
....................................... 2
1. INTRODUCTION
...........................................................................
................................................................ 3
2. SCOPE AND DEFINITION OF INTERNAL CONTROL
.........................................................................
4
3. THE FIVE COMPONENTS AND EIGHTEEN PRINCIPLES OF INTERNAL CONTROL:
............... 5
I/ Internal Environment
...........................................................................
............................ 5
II/ Risk Assessment
...........................................................................
................................... 6
III/ Control Activities
...........................................................................
................................. 6
IV/ Information and Communication
...........................................................................
......... 7
..more snipped
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Saturday, October 11, 2014 at 5:30 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Problematic PDF
>Hi Folks,
>
>I have a problematic PDF which I keeps on crashing my Nutch crawl.
>I am trying to get all data from the PDF, so content is not truncated at
>all.
>http://www.who.int/about/who_reform/who-internal-control-framework.pdf
>
>Can someone please try to see if they have any issues parsing this
>document with Tika 1.6?
>
>I have tried it locally, and it seems OK. If I can confirm this with some
>other folks then I can isolate this to my Nutch crawl.
>Thank you
>Lewis
>
>--
>Lewis
>
>
>
>
>