You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "A.M. Sabuncu" <am...@gmail.com> on 2014/12/24 21:30:53 UTC

Parsing PDF files

I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
using the following curl command to test text extraction from PDF files:

curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
"Content-type: application/pdf"

On trivial PDF files (e.g. created using Word 2010's convert-to-pdf
functionality and containing only the text "Testing", about 81 KB in size),
I get errors in that there's nothing returned from the curl command, and on
the tika-server end, I see the following errors:

<lots of garbage characters displayed on screen, followed by>

WARNING: Did not found XRef object at specified startxref position 0

Being new to Tika, I would like to know whether I am doing something wrong,
or if PDF parsing is not yet an exact science.

Many thanks in advance.

Sabuncu

Re: Parsing PDF files

Posted by "A.M. Sabuncu" <am...@gmail.com>.

Dave, thank you so much.

Here's the information you requested:

*Java version:*

java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

*JDK version:*

javac 1.7.0_67

*EC2 Instance type:*

t1.micro

*Memory available:*

MemTotal:        1016244 kB
MemFree:          214948 kB
Buffers:          154664 kB
Cached:           485152 kB
SwapCached:        12116 kB
Active:           328832 kB
Inactive:         380976 kB
Active(anon):       3756 kB
Inactive(anon):    69924 kB
Active(file):     325076 kB
Inactive(file):   311052 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        649996 kB
SwapFree:         506040 kB
Dirty:                36 kB
Writeback:             0 kB
AnonPages:         60096 kB
Mapped:            17140 kB
Shmem:              3688 kB
Slab:              72996 kB
SReclaimable:      61700 kB
SUnreclaim:        11296 kB
KernelStack:        1296 kB
PageTables:         4912 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1158116 kB
Committed_AS:     411964 kB
VmallocTotal: 4359738367 kB
VmallocUsed:        2596 kB
VmallocChunk:34359729907 kB
HardwareCorrupted:     0 kB
AnonHugePages:     47104 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       28672 kB
DirectMap2M:     1150976 kB


On Sun, Dec 28, 2014 at 6:48 PM, David Meikle <lo...@gmail.com> wrote:

> Hello,
>
> On 25 Dec 2014, at 09:59, A.M. Sabuncu <am...@gmail.com> wrote:
>
> Have a feeling maybe I am missing something rudimentary.
>
> I am running tika-server on an AWS Ubuntu instance, and issueing the curl
> commands from a Windows 7 system.  I downloaded and built Tika 1.6 from
> apache.org/dist/tika, with timestamp 2014-09-05 05:42.
>
> Thanks so much, happy holidays.
>
>
> I have performed a couple of tests on OS X, Ubuntu and Windows box
> building the Tika 1.6 source[1] with either Sun Java or Open JDK 7.  Each
> time this works as expected.
>
> Also, I am able to process the PDF using an instance of the Apache Tika
> OpenShift cartridge[2] based on the tika-1.6-server.jar[3]:
> curl -T GeoSPARQL.pdf http://tikaserver-logicalspark.rhcloud.com/tika
>
> Given the above, I am wondering if there is something environmental within
> your EC2 instance.
>
> Are you able to share the following:
>
>    - Java version - i.e. java -version
>    - JDK version - i.e. javac -version
>    - EC2 Instance type - e.g. t1.micro, t2.small, etc
>    - Memory Available - i.e. output of less /proc/meminfo
>
>
> Thanks,
> Dave
>
> [1] http://www.apache.org/dist/tika/tika-1.6-src.zip
> [2] https://github.com/LogicalSpark/openshift-tika-cartridge
> [3] http://www.apache.org/dist/tika/tika-server-1.6.jar
>

Re: Parsing PDF files

Posted by David Meikle <lo...@gmail.com>.

Hello,

> On 25 Dec 2014, at 09:59, A.M. Sabuncu <am...@gmail.com> wrote:
> 
> Have a feeling maybe I am missing something rudimentary.
> 
> I am running tika-server on an AWS Ubuntu instance, and issueing the curl commands from a Windows 7 system.  I downloaded and built Tika 1.6 from apache.org/dist/tika <http://apache.org/dist/tika>, with timestamp 2014-09-05 05:42.
> 
> Thanks so much, happy holidays.
> 

I have performed a couple of tests on OS X, Ubuntu and Windows box building the Tika 1.6 source[1] with either Sun Java or Open JDK 7.  Each time this works as expected.

Also, I am able to process the PDF using an instance of the Apache Tika OpenShift cartridge[2] based on the tika-1.6-server.jar[3]:
curl -T GeoSPARQL.pdf http://tikaserver-logicalspark.rhcloud.com/tika

Given the above, I am wondering if there is something environmental within your EC2 instance.

Are you able to share the following:
Java version - i.e. java -version
JDK version - i.e. javac -version
EC2 Instance type - e.g. t1.micro, t2.small, etc
Memory Available - i.e. output of less /proc/meminfo

Thanks,
Dave

[1] http://www.apache.org/dist/tika/tika-1.6-src.zip
[2] https://github.com/LogicalSpark/openshift-tika-cartridge <https://github.com/LogicalSpark/openshift-tika-cartridge>
[3] http://www.apache.org/dist/tika/tika-server-1.6.jar

Re: Parsing PDF files

Posted by "A.M. Sabuncu" <am...@gmail.com>.

OK, I obtained GeoSPARQL.pdf file from here:
http://www.w3.org/2011/02/GeoSPARQL.pdf

I first tried the following command line:

*curl -T GeoSPARQL.pdf http://localhost:9998/tika
<http://localhost:9998/tika> --header "Content-type: application/pdf"*

I got nothing back from the above curl command, and the server dumped the
following on screen, part of a longer trace:

*Caused by: java.io.IOException: Push back buffer is full*

Did research, and tried starting tika-server as follows to increase the
property in question to 1 GB:

*java -Dorg.apache.pdfbox.baseParser.pushBackSize=1073741824 -jar
tika-server-1.6.jar*
I still got nothing back from the curl command, but the server did not
produce a stack trace, instead just the following output:

*Dec 25, 2014 9:40:33 AM org.apache.tika.server.TikaResource
logRequestINFO: tika (application/pdf)*

Have a feeling maybe I am missing something rudimentary.

I am running tika-server on an AWS Ubuntu instance, and issueing the curl
commands from a Windows 7 system.  I downloaded and built Tika 1.6 from
apache.org/dist/tika, with timestamp 2014-09-05 05:42.

Thanks so much, happy holidays.

On Thu, Dec 25, 2014 at 8:02 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 24 Dec 2014, A.M. Sabuncu wrote:
>
>> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
>> using the following curl command to test text extraction from PDF files:
>>
>> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>> "Content-type: application/pdf"
>>
>
> What happens if you try
>
> curl -T GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type:
> application/pdf"
>
> ? THat works fine for me for a test pdf
>
> Nick
>

Re: Parsing PDF files

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 24 Dec 2014, A.M. Sabuncu wrote:
> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
> using the following curl command to test text extraction from PDF files:
>
> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
> "Content-type: application/pdf"

What happens if you try

curl -T GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: 
application/pdf"

? THat works fine for me for a test pdf

Nick

Re: Parsing PDF files

Posted by Chris Mattmann <ch...@gmail.com>.

Thanks Dave!
------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: David Meikle <lo...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Monday, December 29, 2014 at 2:50 PM
To: <us...@tika.apache.org>
Subject: Re: Parsing PDF files

>Hello,
>
>On 24 Dec 2014, at 20:30, A.M. Sabuncu <am...@gmail.com> wrote:
>
>I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
>using the following curl command to test text extraction from PDF files:
>curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>"Content-type: application/pdf"On trivial PDF files (e.g. created using
>Word 2010's convert-to-pdf functionality and containing only the text
>"Testing", about 81 KB in size), I get errors in that there's nothing
>returned from the curl command, and on the tika-server end, I see the
>following errors:
>
>
><lots of garbage characters displayed on screen, followed by>
>
>WARNING: Did not found XRef object at specified startxref position 0
>
>
>Being new to Tika, I would like to know whether I am doing something
>wrong, or if PDF parsing is not yet an exact science.
>
>Many thanks in advance.
>
>
>Sabuncu
>
>
>
>
>
>
>
>
>Working through this we have discovered we were using different commands,
>which then uncovered an error in the example on the TikaJAXRS wiki page
>where all examples, regardless of the nature of the content, use the -d
>flag (effectively --data-ascii) in the curl commands.  This means that
>binary files are being processed as ASCII content.
>
>Based on the above, all that was required was to change the command from:
>
>curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>"Content-type: application/pdf”
>
>To:
>
>curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika
>--header "Content-type: application/pdf”
>
>I have updated the TikaJAXRS wiki page accordingly but felt it was worth
>posting back to the list for future reference.
>
>Cheers,
>Dave

Re: Parsing PDF files

Posted by "A.M. Sabuncu" <am...@gmail.com>.

I'd like to thank David Meikle for his persistent assistance in resolving
this problem.  Much appreciated.

Todd

On Tue, Dec 30, 2014 at 12:50 AM, David Meikle <lo...@gmail.com> wrote:

> Hello,
>
> On 24 Dec 2014, at 20:30, A.M. Sabuncu <am...@gmail.com> wrote:
>
> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
> using the following curl command to test text extraction from PDF files:
>
> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
>
> On trivial PDF files (e.g. created using Word 2010's convert-to-pdf
> functionality and containing only the text "Testing", about 81 KB in size),
> I get errors in that there's nothing returned from the curl command, and on
> the tika-server end, I see the following errors:
>
> <lots of garbage characters displayed on screen, followed by>
>
> WARNING: Did not found XRef object at specified startxref position 0
>
> Being new to Tika, I would like to know whether I am doing something
> wrong, or if PDF parsing is not yet an exact science.
>
> Many thanks in advance.
>
> Sabuncu
>
>
> Working through this we have discovered we were using different commands,
> which then uncovered an error in the example on the TikaJAXRS wiki page
> where all examples, regardless of the nature of the content, use the -d
> flag (effectively --data-ascii) in the curl commands.  This means that
> binary files are being processed as ASCII content.
>
> Based on the above, all that was required was to change the command from:
>
> *curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika
> <http://localhost:9998/tika> --header "Content-type: application/pdf”*
>
> To:
>
> curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header
> "Content-type: application/pdf”
>
> I have updated the TikaJAXRS wiki page accordingly but felt it was worth
> posting back to the list for future reference.
>
> Cheers,
> Dave
>
>

Re: Parsing PDF files

Posted by David Meikle <lo...@gmail.com>.

Hello,

> On 24 Dec 2014, at 20:30, A.M. Sabuncu <am...@gmail.com> wrote:
> 
> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS <http://wiki.apache.org/tika/TikaJAXRS> and using the following curl command to test text extraction from PDF files:
> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika <http://localhost:9998/tika> --header "Content-type: application/pdf"
> On trivial PDF files (e.g. created using Word 2010's convert-to-pdf functionality and containing only the text "Testing", about 81 KB in size), I get errors in that there's nothing returned from the curl command, and on the tika-server end, I see the following errors:
> 
> <lots of garbage characters displayed on screen, followed by>
> 
> WARNING: Did not found XRef object at specified startxref position 0
> 
> Being new to Tika, I would like to know whether I am doing something wrong, or if PDF parsing is not yet an exact science.
> 
> Many thanks in advance.
> 
> Sabuncu

Working through this we have discovered we were using different commands, which then uncovered an error in the example on the TikaJAXRS wiki page where all examples, regardless of the nature of the content, use the -d flag (effectively --data-ascii) in the curl commands.  This means that binary files are being processed as ASCII content.

Based on the above, all that was required was to change the command from:

curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf”

To:

curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf”

I have updated the TikaJAXRS wiki page accordingly but felt it was worth posting back to the list for future reference.

Cheers,
Dave