You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Scott Lipcon <sl...@gmail.com> on 2008/12/08 17:36:32 UTC

ExtractText problem

I've installed pdfbox 0.7.3 on gentoo Linux via portage.   gentoo installs a
'pdfextracttext' wrapper which calls java appropriately, with the right
classpaths, etc.   I'm using java 1.5, but have tried with java 1.6 as
well.   I've written a script to download and convert local police reports
from PDF to TXT, and it works fine on about 200 of the 225 reports available
this year.   On the other 25, I'm getting NullPointerExceptions:

$ pdfextracttext 011608.pdf
Exception in thread "main" java.lang.NullPointerException
        at org.pdfbox.ExtractText.main(ExtractText.java:208)


http://www.co.ho.md.us/Police/DOCS/011608.pdf is an example of a PDF that
causes the NPE.

http://www.co.ho.md.us/Police/DOCS/011008.pdf is one (from the same week)
that works fine.

I don't see anything obvious in the PDF (special characters, graphics, etc)
- both view fine in acrobat reader.

any ideas?

Thanks,
Scott

Re: ExtractText problem

Posted by Scott Lipcon <sl...@gmail.com>.
I'm a user of the data, not the producer.  Apparently arrests are public
record here in Maryland.   I've just written a small script to pull the PDF
files, convert them to TXT and alert me when there is crime in my
neighborhood.
Scott



On Mon, Dec 8, 2008 at 5:32 PM, Patrick Simon <pa...@heypatty.com> wrote:

> my 2 cents (and slightly off topic) - are you sure you want to publish
> police data on the www? That document mentions people's names and dates.
>
> I work with credit cards so coming from security point of view here.
>
> On Tue, Dec 9, 2008 at 5:09 AM, Daniel Manzke
> <da...@googlemail.com>wrote:
>
> > That is a question for the developers. :) Until now I only know a little
> > bit
> > of the code, but nothing about the model. I hope I will find sometime to
> > look deeper. ;)
> >
> > Bye,
> > Daniel
> >
> > 2008/12/8 Scott Lipcon <sl...@gmail.com>
> >
> > > I just checked out svn trunk, and that is able to extract the text from
> > all
> > > of the PDFs - thanks!   Any plans for a formal release?  I'd rather use
> > the
> > > packaged version from gentoo, but will use my local copy for now.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> > > On Mon, Dec 8, 2008 at 1:33 PM, Daniel Manzke
> > > <da...@googlemail.com>wrote:
> > >
> > > > Did you thought about using the a developer version? In subversion
> you
> > > will
> > > > find a 0.8 developer version. 0.7.3 was build in 2006. ;)
> > > > Maybe give it try...
> > > >
> > > >
> > > >
> > > > Daniel
> > > >
> > > > 2008/12/8 <ke...@quarter-flash.com>
> > > >
> > > > > Does pdfbox 0.7.3 handle xrefstream at this point?  That's one
> > obvious
> > > > > difference that has caused me issues (with other toolkits) in the
> > past.
> > > > >
> > > > > On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <
> slipcon@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I don't see anything obvious in the PDF (special characters,
> > > graphics,
> > > > > > etc)
> > > > > > - both view fine in acrobat reader.
> > > > > >
> > > > > > any ideas?
> > > > >
> > > > >
> > > > > Ken
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Mit freundlichen Grüßen
> > > >
> > > > Daniel Manzke
> > > >
> > >
> >
> >
> >
> > --
> > Mit freundlichen Grüßen
> >
> > Daniel Manzke
> >
>

Re: ExtractText problem

Posted by Patrick Simon <pa...@heypatty.com>.
my 2 cents (and slightly off topic) - are you sure you want to publish
police data on the www? That document mentions people's names and dates.

I work with credit cards so coming from security point of view here.

On Tue, Dec 9, 2008 at 5:09 AM, Daniel Manzke
<da...@googlemail.com>wrote:

> That is a question for the developers. :) Until now I only know a little
> bit
> of the code, but nothing about the model. I hope I will find sometime to
> look deeper. ;)
>
> Bye,
> Daniel
>
> 2008/12/8 Scott Lipcon <sl...@gmail.com>
>
> > I just checked out svn trunk, and that is able to extract the text from
> all
> > of the PDFs - thanks!   Any plans for a formal release?  I'd rather use
> the
> > packaged version from gentoo, but will use my local copy for now.
> >
> > Thanks,
> > Scott
> >
> >
> > On Mon, Dec 8, 2008 at 1:33 PM, Daniel Manzke
> > <da...@googlemail.com>wrote:
> >
> > > Did you thought about using the a developer version? In subversion you
> > will
> > > find a 0.8 developer version. 0.7.3 was build in 2006. ;)
> > > Maybe give it try...
> > >
> > >
> > >
> > > Daniel
> > >
> > > 2008/12/8 <ke...@quarter-flash.com>
> > >
> > > > Does pdfbox 0.7.3 handle xrefstream at this point?  That's one
> obvious
> > > > difference that has caused me issues (with other toolkits) in the
> past.
> > > >
> > > > On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <slipcon@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I don't see anything obvious in the PDF (special characters,
> > graphics,
> > > > > etc)
> > > > > - both view fine in acrobat reader.
> > > > >
> > > > > any ideas?
> > > >
> > > >
> > > > Ken
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Mit freundlichen Grüßen
> > >
> > > Daniel Manzke
> > >
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>

Re: ExtractText problem

Posted by Daniel Manzke <da...@googlemail.com>.
That is a question for the developers. :) Until now I only know a little bit
of the code, but nothing about the model. I hope I will find sometime to
look deeper. ;)

Bye,
Daniel

2008/12/8 Scott Lipcon <sl...@gmail.com>

> I just checked out svn trunk, and that is able to extract the text from all
> of the PDFs - thanks!   Any plans for a formal release?  I'd rather use the
> packaged version from gentoo, but will use my local copy for now.
>
> Thanks,
> Scott
>
>
> On Mon, Dec 8, 2008 at 1:33 PM, Daniel Manzke
> <da...@googlemail.com>wrote:
>
> > Did you thought about using the a developer version? In subversion you
> will
> > find a 0.8 developer version. 0.7.3 was build in 2006. ;)
> > Maybe give it try...
> >
> >
> >
> > Daniel
> >
> > 2008/12/8 <ke...@quarter-flash.com>
> >
> > > Does pdfbox 0.7.3 handle xrefstream at this point?  That's one obvious
> > > difference that has caused me issues (with other toolkits) in the past.
> > >
> > > On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <sl...@gmail.com>
> > > wrote:
> > >
> > > > I don't see anything obvious in the PDF (special characters,
> graphics,
> > > > etc)
> > > > - both view fine in acrobat reader.
> > > >
> > > > any ideas?
> > >
> > >
> > > Ken
> > >
> > >
> > >
> >
> >
> > --
> > Mit freundlichen Grüßen
> >
> > Daniel Manzke
> >
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Re: ExtractText problem

Posted by Scott Lipcon <sl...@gmail.com>.
I just checked out svn trunk, and that is able to extract the text from all
of the PDFs - thanks!   Any plans for a formal release?  I'd rather use the
packaged version from gentoo, but will use my local copy for now.

Thanks,
Scott


On Mon, Dec 8, 2008 at 1:33 PM, Daniel Manzke
<da...@googlemail.com>wrote:

> Did you thought about using the a developer version? In subversion you will
> find a 0.8 developer version. 0.7.3 was build in 2006. ;)
> Maybe give it try...
>
>
>
> Daniel
>
> 2008/12/8 <ke...@quarter-flash.com>
>
> > Does pdfbox 0.7.3 handle xrefstream at this point?  That's one obvious
> > difference that has caused me issues (with other toolkits) in the past.
> >
> > On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <sl...@gmail.com>
> > wrote:
> >
> > > I don't see anything obvious in the PDF (special characters, graphics,
> > > etc)
> > > - both view fine in acrobat reader.
> > >
> > > any ideas?
> >
> >
> > Ken
> >
> >
> >
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>

Re: ExtractText problem

Posted by Daniel Manzke <da...@googlemail.com>.
Did you thought about using the a developer version? In subversion you will
find a 0.8 developer version. 0.7.3 was build in 2006. ;)
Maybe give it try...



Daniel

2008/12/8 <ke...@quarter-flash.com>

> Does pdfbox 0.7.3 handle xrefstream at this point?  That's one obvious
> difference that has caused me issues (with other toolkits) in the past.
>
> On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <sl...@gmail.com>
> wrote:
>
> > I don't see anything obvious in the PDF (special characters, graphics,
> > etc)
> > - both view fine in acrobat reader.
> >
> > any ideas?
>
>
> Ken
>
>
>


-- 
Mit freundlichen Grüßen

Daniel Manzke

Re: ExtractText problem

Posted by ke...@quarter-flash.com.
Does pdfbox 0.7.3 handle xrefstream at this point?  That's one obvious
difference that has caused me issues (with other toolkits) in the past.

On Mon, 8 Dec 2008 11:36:32 -0500, "Scott Lipcon" <sl...@gmail.com>
wrote:

> I don't see anything obvious in the PDF (special characters, graphics,
> etc)
> - both view fine in acrobat reader.
> 
> any ideas?


Ken