You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by kameron cole <kc...@gmail.com> on 2014/05/20 18:43:05 UTC

some PDF documents will not parse

I get parsing errors on certain PDFs - and this causes my other processes
to halt.  I would like to find some kind of PDF testing utility in this
group, so that I can either
1) test the document before sending it to the parser, and skip it, log it,
for later
or
2) Find a "fix-it" PDF utility, that would correct the doc, and put it back
in the queue to be parsed.

Re: some PDF documents will not parse

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 20.05.2014 18:43, schrieb kameron cole:
> I get parsing errors on certain PDFs

What version are you using, and what are the "parsing errors"?

Re: some PDF documents will not parse

Posted by Qingchao Kong <kq...@gmail.com>.
kameron cole,
Hi, I also suggest you offer the specific parse errors and we may find out
what exactly the type of error is. PS: You say "parse pdfs", I presume it
is extracting text from pdfs, am I right?

Regards,


On Thu, May 22, 2014 at 12:28 AM, kameron cole <kc...@gmail.com> wrote:

> We are using PDFBox to parse the documents.  We also used Stellent
> OutsideIn (oracle), which parsed the same documents that PDFBox failed to
> parse.  Unfortunately, we can not share the documents because they are
> confidential.
>
> I agree that parsing is the best test for parsing.  I am looking for a
> shortcut, a kind of pre-test.  or, even better, a PDFBox utility that fixes
> bad docs.
>
>
> On Tue, May 20, 2014 at 3:10 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >wrote:
>
> > Hi,
> >
> > the parsing errors are occurring within PDFBox or is it a different
> > application you are using for parsing? What kind of parsing errors do you
> > get? Would you have a sample pdf?
> >
> > For testing a PDF document to make sure that a parser can parse it
> > typically you need to parse it - so …
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 20.05.2014 um 18:43 schrieb kameron cole <kc...@gmail.com>:
> >
> > > I get parsing errors on certain PDFs - and this causes my other
> processes
> > > to halt.  I would like to find some kind of PDF testing utility in this
> > > group, so that I can either
> > > 1) test the document before sending it to the parser, and skip it, log
> > it,
> > > for later
> > > or
> > > 2) Find a "fix-it" PDF utility, that would correct the doc, and put it
> > back
> > > in the queue to be parsed.
> >
> >
>
>
> --
> ** -- **
> yours truly,
> kameron
>
> PMA® Certified Pilates Teacher
> RYT, Yoga Alliance <http://www.yogaalliance.org/>
> Kontrology Pilates and Yoga <http://www.kontrology.com> ཀ
> SoBe Violoncello <http://www.sobevc.com/sobevc/Welcome.html>
> <http://www.sobevc.com/sobevc/Welcome.html>♮
>
> -- ** --
>



-- 
Qingchao Kong

Ph.D. Candidate
State Key Laboratory of Management and Control for Complex Systems
Institute of Automation, Chinese Academy of Sciences

No. 95 Zhongguancun East Road
Haidian District, Beijing 100190 China

Re: some PDF documents will not parse

Posted by kameron cole <kc...@gmail.com>.
We are using PDFBox to parse the documents.  We also used Stellent
OutsideIn (oracle), which parsed the same documents that PDFBox failed to
parse.  Unfortunately, we can not share the documents because they are
confidential.

I agree that parsing is the best test for parsing.  I am looking for a
shortcut, a kind of pre-test.  or, even better, a PDFBox utility that fixes
bad docs.


On Tue, May 20, 2014 at 3:10 PM, Maruan Sahyoun <sa...@fileaffairs.de>wrote:

> Hi,
>
> the parsing errors are occurring within PDFBox or is it a different
> application you are using for parsing? What kind of parsing errors do you
> get? Would you have a sample pdf?
>
> For testing a PDF document to make sure that a parser can parse it
> typically you need to parse it - so …
>
> BR
> Maruan Sahyoun
>
> Am 20.05.2014 um 18:43 schrieb kameron cole <kc...@gmail.com>:
>
> > I get parsing errors on certain PDFs - and this causes my other processes
> > to halt.  I would like to find some kind of PDF testing utility in this
> > group, so that I can either
> > 1) test the document before sending it to the parser, and skip it, log
> it,
> > for later
> > or
> > 2) Find a "fix-it" PDF utility, that would correct the doc, and put it
> back
> > in the queue to be parsed.
>
>


-- 
** -- **
yours truly,
kameron

PMA® Certified Pilates Teacher
RYT, Yoga Alliance <http://www.yogaalliance.org/>
Kontrology Pilates and Yoga <http://www.kontrology.com> ཀ
SoBe Violoncello <http://www.sobevc.com/sobevc/Welcome.html>
<http://www.sobevc.com/sobevc/Welcome.html>♮

-- ** --

Re: some PDF documents will not parse

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

the parsing errors are occurring within PDFBox or is it a different application you are using for parsing? What kind of parsing errors do you get? Would you have a sample pdf?

For testing a PDF document to make sure that a parser can parse it typically you need to parse it - so …

BR 
Maruan Sahyoun

Am 20.05.2014 um 18:43 schrieb kameron cole <kc...@gmail.com>:

> I get parsing errors on certain PDFs - and this causes my other processes
> to halt.  I would like to find some kind of PDF testing utility in this
> group, so that I can either
> 1) test the document before sending it to the parser, and skip it, log it,
> for later
> or
> 2) Find a "fix-it" PDF utility, that would correct the doc, and put it back
> in the queue to be parsed.