You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/03/23 14:20:46 UTC

shading/relocating 1.8.x?

All,
  We've upgraded to 2.0.0 on Tika.  Many thanks again!
  One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0].
  Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is there a better solution?

  Thank you!

              Cheers,

                         Tim

[0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: shading/relocating 1.8.x?

Posted by John Hewson <jo...@jahewson.com>.
> On 29 Mar 2016, at 04:11, Andreas Lehmkühler <an...@lehmi.de> wrote:
> 
>> "Allison, Timothy B." <tallison@mitre.org <ma...@mitre.org>> hat am 28. März 2016 um 21:02
>> geschrieben:
>> 
>> 
>> Oh, wow, so it really might be possible without too much work?  I'm more than
>> happy to supply examples. :) 
> Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox
> most likel runs into a NPE. IMHO we have to implement some sort of an on demand
> parser which is able to handle null-values for specific parts of a pdf without
> throwing any exception.

One thought: instead of null it might be possible to return an empty string, empty
dictionary, empty array, empty stream, etc. That way we don’t have to look for null
everywhere.

— John

> 
>> Should I open an issue?
> Thanks, but I'm going to do that soon, as some other things should be done as
> well.
> 
> BR
> Andreas
>> 
>> 
>> -----Original Message-----
>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de] 
>> Sent: Monday, March 28, 2016 10:58 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: shading/relocating 1.8.x?
>> 
>> Am 25.03.2016 um 17:39 schrieb John Hewson:
>>> 
>>>> On 23 Mar 2016, at 06:20, Allison, Timothy B. <ta...@mitre.org> wrote:
>>>> 
>>>> All,
>>>>  We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>>>  One of our users is interested in continuing to use the
>>>> classic/SequentialParser, or at least having it available as a back-off
>>>> parser for corrupt pdfs [0].
>>> 
>>> Using the old parser really isn’t a good idea, it’s known to be pretty
>>> broken. I think that we would be much better off making sure the new parser
>>> can handle truncated files. We already do a lot of repair in the new parser,
>>> so this doesn’t seem like to much work? Maybe Andreas can comment further?
>> The biggest issue here is the truncated stream or dictionary. The current
>> version simply throws an exception when running into such constellations. We
>> have to implement some algorithm to ignore such incomplete parts of a pdf if
>> possible.
>> 
>> BR
>> Andreas
>> 
>>> 
>>> Do we have some JIRA issues which identify some of these cases?
>>> 
>>> — John
>>> 
>>>>  Would you be willing to distribute a shaded/relocated 1.8.x app so that
>>>> we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or,
>>>> is there a better solution?
>>> 
>>> I wouldn’t recommend doing that, because you’re going to be stuck with using
>>> 1.8 for everything, not just parsing, at least as far as corrupt/truncated
>>> files are concerned.
>>> 
>>> — John
>>> 
>>>>  Thank you!
>>>> 
>>>>              Cheers,
>>>> 
>>>>                         Tim
>>>> 
>>>> [0]
>>>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: dev-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

RE: shading/relocating 1.8.x?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Got it.  That's what I had assumed.

I'll hold off on opening truncated file issue(s) on PDFBox's JIRA...  I opened TIKA-1912 to track this on our side.

Thank you, again!

Best,

          Tim

-----Original Message-----
From: Andreas Lehmkühler [mailto:andreas@lehmi.de] 
Sent: Tuesday, March 29, 2016 7:12 AM
To: dev@pdfbox.apache.org
Subject: RE: shading/relocating 1.8.x?

> "Allison, Timothy B." <ta...@mitre.org> hat am 28. März 2016 um 
> 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm 
> more than happy to supply examples. :)
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox most likel runs into a NPE. IMHO we have to implement some sort of an on demand parser which is able to handle null-values for specific parts of a pdf without throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as well.

BR
Andreas


RE: shading/relocating 1.8.x?

Posted by Andreas Lehmkühler <an...@lehmi.de>.
> "Allison, Timothy B." <ta...@mitre.org> hat am 28. März 2016 um 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm more than
> happy to supply examples. :) 
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox
most likel runs into a NPE. IMHO we have to implement some sort of an on demand
parser which is able to handle null-values for specific parts of a pdf without
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as
well.

BR
Andreas
> 
> 
> -----Original Message-----
> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de] 
> Sent: Monday, March 28, 2016 10:58 AM
> To: dev@pdfbox.apache.org
> Subject: Re: shading/relocating 1.8.x?
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
> >
> >> On 23 Mar 2016, at 06:20, Allison, Timothy B. <ta...@mitre.org> wrote:
> >>
> >> All,
> >>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
> >>   One of our users is interested in continuing to use the
> >> classic/SequentialParser, or at least having it available as a back-off
> >> parser for corrupt pdfs [0].
> >
> > Using the old parser really isn’t a good idea, it’s known to be pretty
> > broken. I think that we would be much better off making sure the new parser
> > can handle truncated files. We already do a lot of repair in the new parser,
> > so this doesn’t seem like to much work? Maybe Andreas can comment further?
> The biggest issue here is the truncated stream or dictionary. The current
> version simply throws an exception when running into such constellations. We
> have to implement some algorithm to ignore such incomplete parts of a pdf if
> possible.
> 
> BR
> Andreas
> 
> >
> > Do we have some JIRA issues which identify some of these cases?
> >
> > — John
> >
> >>   Would you be willing to distribute a shaded/relocated 1.8.x app so that
> >> we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or,
> >> is there a better solution?
> >
> > I wouldn’t recommend doing that, because you’re going to be stuck with using
> > 1.8 for everything, not just parsing, at least as far as corrupt/truncated
> > files are concerned.
> >
> > — John
> >
> >>   Thank you!
> >>
> >>               Cheers,
> >>
> >>                          Tim
> >>
> >> [0]
> >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: shading/relocating 1.8.x?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Oh, wow, so it really might be possible without too much work?  I'm more than happy to supply examples. :) 

Should I open an issue?


-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de] 
Sent: Monday, March 28, 2016 10:58 AM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?

Am 25.03.2016 um 17:39 schrieb John Hewson:
>
>> On 23 Mar 2016, at 06:20, Allison, Timothy B. <ta...@mitre.org> wrote:
>>
>> All,
>>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>   One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0].
>
> Using the old parser really isn’t a good idea, it’s known to be pretty broken. I think that we would be much better off making sure the new parser can handle truncated files. We already do a lot of repair in the new parser, so this doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current version simply throws an exception when running into such constellations. We have to implement some algorithm to ignore such incomplete parts of a pdf if possible.

BR
Andreas

>
> Do we have some JIRA issues which identify some of these cases?
>
> — John
>
>>   Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is there a better solution?
>
> I wouldn’t recommend doing that, because you’re going to be stuck with using 1.8 for everything, not just parsing, at least as far as corrupt/truncated files are concerned.
>
> — John
>
>>   Thank you!
>>
>>               Cheers,
>>
>>                          Tim
>>
>> [0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: shading/relocating 1.8.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 25.03.2016 um 17:39 schrieb John Hewson:
>
>> On 23 Mar 2016, at 06:20, Allison, Timothy B. <ta...@mitre.org> wrote:
>>
>> All,
>>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>   One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0].
>
> Using the old parser really isn’t a good idea, it’s known to be pretty broken. I think that we would be much better off making sure the new parser can handle truncated files. We already do a lot of repair in the new parser, so this doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current 
version simply throws an exception when running into such constellations. We 
have to implement some algorithm to ignore such incomplete parts of a pdf if 
possible.

BR
Andreas

>
> Do we have some JIRA issues which identify some of these cases?
>
> — John
>
>>   Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is there a better solution?
>
> I wouldn’t recommend doing that, because you’re going to be stuck with using 1.8 for everything, not just parsing, at least as far as corrupt/truncated files are concerned.
>
> — John
>
>>   Thank you!
>>
>>               Cheers,
>>
>>                          Tim
>>
>> [0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: shading/relocating 1.8.x?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
See:

https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111 

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: shading/relocating 1.8.x?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Hi John,

  Normally, I'd agree.  And, y, I've been extremely grateful for the effort put into dealing with noisy PDFs in 2.0.0.  However, I think that the Tika user requesting this is interested in getting what he can from truncated and truly broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB or may have had an interrupt during download.  My basic rule for opening an issue is if AR or another pdf parser can't parse it, I'm not going to ask for help.
 
   I wouldn't want to direct your all's efforts to dealing with the edge cases of truncated files.  If the old PDFParser is able to get something out because it parsed sequentially, then it would be neat to be able to have that available with very little effort.  In Tika, we envision allowing users to configure combinations of parsers for a given file, this would be the perfect case for the back-off-on-exception strategy -- if there's an exception with 2.0.0, try again with 1.8.x.

  I'll try shading/relocating next week, and see whether that works as expected.

  Thank you, all, again!

              Cheers,

                        Tim


-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: shading/relocating 1.8.x?

Posted by John Hewson <jo...@jahewson.com>.
> On 25 Mar 2016, at 09:44, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: shading/relocating 1.8.x?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 25.03.2016 um 17:39 schrieb John Hewson:
> Do we have some JIRA issues which identify some of these cases?

https://issues.apache.org/jira/browse/PDFBOX-3265

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: shading/relocating 1.8.x?

Posted by John Hewson <jo...@jahewson.com>.
> On 23 Mar 2016, at 06:20, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
>  We've upgraded to 2.0.0 on Tika.  Many thanks again!
>  One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0].

Using the old parser really isn’t a good idea, it’s known to be pretty broken. I think that we would be much better off making sure the new parser can handle truncated files. We already do a lot of repair in the new parser, so this doesn’t seem like to much work? Maybe Andreas can comment further?

Do we have some JIRA issues which identify some of these cases?

— John

>  Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is there a better solution?

I wouldn’t recommend doing that, because you’re going to be stuck with using 1.8 for everything, not just parsing, at least as far as corrupt/truncated files are concerned.

— John

>  Thank you!
> 
>              Cheers,
> 
>                         Tim
> 
> [0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org