You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Simon Steiner <si...@gmail.com> on 2014/10/10 17:08:33 UTC

2.0

Hi,

 

Could you set a target date for 2.0 release. What's missing to make a
release?

 

Thanks

Re: 2.0

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

yes, because the interactive forms API in 2.0 changed :-) And the improvement is a hack for a specific case.

Maruan

Am 11.10.2014 um 13:21 schrieb Tilman Hausherr <TH...@t-online.de>:

> Sure, go ahead. That is one thing I that "must" be in 2.0, because you improved it for 1.8 only :-)
> 
> Tilman
> 
> Am 11.10.2014 um 13:14 schrieb Maruan Sahyoun:
>> If no one else wants to work on the interactive forms part it’l take me at least another month to implement that correctly i.e. resolve the short comings of the 1.x approach. That’s mainly because appearance generation is only generally documented but part of the individual implementation of the various tools used (the ’styles’ used for the appearance like margins used, padding …). This is not documented. And we don’t have a good set of test files for various interactive form aspects.
>> 
>> There is also a dependency on handling characters consistently when generating new content.  IMHO I think we are still limited here when it comes to characters outside the ISO-8859-1 range.
>> 
>> Maruan
>> 
>> 
>> 
>> Am 11.10.2014 um 12:50 schrieb Tilman Hausherr <TH...@t-online.de>:
>> 
>>> I disagree with this. We fixed or closed about 80 issues  this month but most are new issues. The older an issue is, the most unlikely it can be fixed.
>>> 
>>> You labeled many a "fix version" which would mean they "must" be fixed for 2.0. One example: PDFBOX-2402 <https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser improvement related to some bad PDFs that is relevant to one user only (I will fix that one soon, I just need to create a test file, but we could as well live without it). Another example was a color problem I had opened (PDFBOX-2142) which is probably only relevant to rendering advertising flyers.
>>> 
>>> 2.0 is more and more becoming the "Duke Nukem Forever" of open source. I'm also thinking about the new Berlin airport. Although there is one difference: the people of "Duke Nukem Forever" and the new Berlin airport made the mistake to announce release dates.
>>> 
>>> I agree with Simon. 2.0 is already a massive improvement.
>>> 
>>> We should name maybe 10 issues that "must" be solved before 2.0. I'm thinking about regressions, issues were we are close to success (patterns), and issues where somebody attached his name (with the meaning "I can fix that and I know what has to be done"). And a short documentation about what has changed.
>>> 
>>> Tilman
>>> 
>>> Am 11.10.2014 um 04:37 schrieb John Hewson:
>>>> Hi All,
>>>> 
>>>> I really want to give a better answer to this question, but the JIRA issues were not
>>>> labelled with enough version-related information to allow me to simply view a list
>>>> of issues which are due to be fixed in 2.0.
>>>> 
>>>> As you’re probably aware, I went through pretty much all the issues and made sure
>>>> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
>>>> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
>>>> we have a way to see which issues are due to be fixed. The end result is here:
>>>> 
>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>>>> 
>>>> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
>>>> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
>>>> and to maybe take a look at issues which overlap with current development such as
>>>> xrefs, rendering, and parsing.
>>>> 
>>>> Cheers
>>>> 
>>>> -- John
>>>> 
>>>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> Simon,
>>>>> 
>>>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>>>> making breaking API changes and for the code to have been stable for a while before
>>>>> making a 2.0 release.
>>>>> 
>>>>> Improvements and fixes which still need breaking API changes include:
>>>>> 	- Pattern rendering
>>>>> 	- Pages resource caching (significant memory usage issues)
>>>>> 	- Font embedding (particularly TTF)
>>>>> 	- Parsing (Andreas?)
>>>>> 	- Page Tree (needs completely re-writing)
>>>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>>>> 
>>>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>>>> example code for 2.0 due to all the changes.
>>>>> 
>>>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>>>> of the code is a pretty good indicator - we’re not there yet.
>>>>> 
>>>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>>>> 2.0 features which you think we need.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>>>> release?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks
>>>>>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

Sure, go ahead. That is one thing I that "must" be in 2.0, because you 
improved it for 1.8 only :-)

Tilman

Am 11.10.2014 um 13:14 schrieb Maruan Sahyoun:
> If no one else wants to work on the interactive forms part it’l take me at least another month to implement that correctly i.e. resolve the short comings of the 1.x approach. That’s mainly because appearance generation is only generally documented but part of the individual implementation of the various tools used (the ’styles’ used for the appearance like margins used, padding …). This is not documented. And we don’t have a good set of test files for various interactive form aspects.
>
> There is also a dependency on handling characters consistently when generating new content.  IMHO I think we are still limited here when it comes to characters outside the ISO-8859-1 range.
>
> Maruan
>
>
>
> Am 11.10.2014 um 12:50 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> I disagree with this. We fixed or closed about 80 issues  this month but most are new issues. The older an issue is, the most unlikely it can be fixed.
>>
>> You labeled many a "fix version" which would mean they "must" be fixed for 2.0. One example: PDFBOX-2402 <https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser improvement related to some bad PDFs that is relevant to one user only (I will fix that one soon, I just need to create a test file, but we could as well live without it). Another example was a color problem I had opened (PDFBOX-2142) which is probably only relevant to rendering advertising flyers.
>>
>> 2.0 is more and more becoming the "Duke Nukem Forever" of open source. I'm also thinking about the new Berlin airport. Although there is one difference: the people of "Duke Nukem Forever" and the new Berlin airport made the mistake to announce release dates.
>>
>> I agree with Simon. 2.0 is already a massive improvement.
>>
>> We should name maybe 10 issues that "must" be solved before 2.0. I'm thinking about regressions, issues were we are close to success (patterns), and issues where somebody attached his name (with the meaning "I can fix that and I know what has to be done"). And a short documentation about what has changed.
>>
>> Tilman
>>
>> Am 11.10.2014 um 04:37 schrieb John Hewson:
>>> Hi All,
>>>
>>> I really want to give a better answer to this question, but the JIRA issues were not
>>> labelled with enough version-related information to allow me to simply view a list
>>> of issues which are due to be fixed in 2.0.
>>>
>>> As you’re probably aware, I went through pretty much all the issues and made sure
>>> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
>>> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
>>> we have a way to see which issues are due to be fixed. The end result is here:
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>>>
>>> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
>>> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
>>> and to maybe take a look at issues which overlap with current development such as
>>> xrefs, rendering, and parsing.
>>>
>>> Cheers
>>>
>>> -- John
>>>
>>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>>>
>>>> Simon,
>>>>
>>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>>> making breaking API changes and for the code to have been stable for a while before
>>>> making a 2.0 release.
>>>>
>>>> Improvements and fixes which still need breaking API changes include:
>>>> 	- Pattern rendering
>>>> 	- Pages resource caching (significant memory usage issues)
>>>> 	- Font embedding (particularly TTF)
>>>> 	- Parsing (Andreas?)
>>>> 	- Page Tree (needs completely re-writing)
>>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>>>
>>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>>> example code for 2.0 due to all the changes.
>>>>
>>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>>> of the code is a pretty good indicator - we’re not there yet.
>>>>
>>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>>> 2.0 features which you think we need.
>>>>
>>>> Thanks
>>>>
>>>> -- John
>>>>
>>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>>> release?
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>

Re: 2.0

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

If no one else wants to work on the interactive forms part it’l take me at least another month to implement that correctly i.e. resolve the short comings of the 1.x approach. That’s mainly because appearance generation is only generally documented but part of the individual implementation of the various tools used (the ’styles’ used for the appearance like margins used, padding …). This is not documented. And we don’t have a good set of test files for various interactive form aspects.

There is also a dependency on handling characters consistently when generating new content.  IMHO I think we are still limited here when it comes to characters outside the ISO-8859-1 range.

Maruan



Am 11.10.2014 um 12:50 schrieb Tilman Hausherr <TH...@t-online.de>:

> I disagree with this. We fixed or closed about 80 issues  this month but most are new issues. The older an issue is, the most unlikely it can be fixed.
> 
> You labeled many a "fix version" which would mean they "must" be fixed for 2.0. One example: PDFBOX-2402 <https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser improvement related to some bad PDFs that is relevant to one user only (I will fix that one soon, I just need to create a test file, but we could as well live without it). Another example was a color problem I had opened (PDFBOX-2142) which is probably only relevant to rendering advertising flyers.
> 
> 2.0 is more and more becoming the "Duke Nukem Forever" of open source. I'm also thinking about the new Berlin airport. Although there is one difference: the people of "Duke Nukem Forever" and the new Berlin airport made the mistake to announce release dates.
> 
> I agree with Simon. 2.0 is already a massive improvement.
> 
> We should name maybe 10 issues that "must" be solved before 2.0. I'm thinking about regressions, issues were we are close to success (patterns), and issues where somebody attached his name (with the meaning "I can fix that and I know what has to be done"). And a short documentation about what has changed.
> 
> Tilman
> 
> Am 11.10.2014 um 04:37 schrieb John Hewson:
>> Hi All,
>> 
>> I really want to give a better answer to this question, but the JIRA issues were not
>> labelled with enough version-related information to allow me to simply view a list
>> of issues which are due to be fixed in 2.0.
>> 
>> As you’re probably aware, I went through pretty much all the issues and made sure
>> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
>> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
>> we have a way to see which issues are due to be fixed. The end result is here:
>> 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>> 
>> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
>> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
>> and to maybe take a look at issues which overlap with current development such as
>> xrefs, rendering, and parsing.
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> Simon,
>>> 
>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>> making breaking API changes and for the code to have been stable for a while before
>>> making a 2.0 release.
>>> 
>>> Improvements and fixes which still need breaking API changes include:
>>> 	- Pattern rendering
>>> 	- Pages resource caching (significant memory usage issues)
>>> 	- Font embedding (particularly TTF)
>>> 	- Parsing (Andreas?)
>>> 	- Page Tree (needs completely re-writing)
>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>> 
>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>> example code for 2.0 due to all the changes.
>>> 
>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>> of the code is a pretty good indicator - we’re not there yet.
>>> 
>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>> 2.0 features which you think we need.
>>> 
>>> Thanks
>>> 
>>> -- John
>>> 
>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>> release?
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> 
>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 14.10.2014 um 07:59 schrieb John Hewson:
> Hi Tilman
>
>> You labeled many a "fix version" which would mean they "must" be fixed for 2.0. One example: PDFBOX-2402 <https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser improvement related to some bad PDFs that is relevant to one user only (I will fix that one soon, I just need to create a test file, but we could as well live without it). Another example was a color problem I had opened (PDFBOX-2142) which is probably only relevant to rendering advertising flyers.
> Yep, this is the first time we’ve tried release management with JIRA, so the starting point is that all of the issues affecting 2.0 are now scheduled to be fixed in 2.0. Obviously that’s silly, but it forces us to now examine the issues we have and actively defer them to later versions, rather than forgetting about them, or loosing them in the hundreds of old 1.8 and earlier issues which don’t apply to 2.0.
>
> Andreas - can we get a 2.1 and 3.0 version in JIRA (for breaking / non-breaking), so that the deferring can begin? These would just be estimates of course, we can always re-defer something to 2.2, etc. The idea being that issues are now actively assigned to releases, so we’re doing release management with JIRA, as well as just bug tracking.

That is a good idea!

Tilman

>
>> We should name maybe 10 issues that "must" be solved before 2.0. I'm thinking about regressions, issues were we are close to success (patterns), and issues where somebody attached his name (with the meaning "I can fix that and I know what has to be done"). And a short documentation about what has changed.
> My list of “must do’s” is fairly short: resource caching, pattern rendering, and page trees are pretty much it. Breaking API changes are really the only blockers, it’s better to wait a bit longer for 2.0 than to have say the next 5 release make breaking changes to important APIs. (Minor or niche APIs are more flexible).
>
> Cheers
>
> -- John
>
>> Tilman
>>
>> Am 11.10.2014 um 04:37 schrieb John Hewson:
>>> Hi All,
>>>
>>> I really want to give a better answer to this question, but the JIRA issues were not
>>> labelled with enough version-related information to allow me to simply view a list
>>> of issues which are due to be fixed in 2.0.
>>>
>>> As you’re probably aware, I went through pretty much all the issues and made sure
>>> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
>>> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
>>> we have a way to see which issues are due to be fixed. The end result is here:
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>>>
>>> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
>>> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
>>> and to maybe take a look at issues which overlap with current development such as
>>> xrefs, rendering, and parsing.
>>>
>>> Cheers
>>>
>>> -- John
>>>
>>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>>>
>>>> Simon,
>>>>
>>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>>> making breaking API changes and for the code to have been stable for a while before
>>>> making a 2.0 release.
>>>>
>>>> Improvements and fixes which still need breaking API changes include:
>>>> 	- Pattern rendering
>>>> 	- Pages resource caching (significant memory usage issues)
>>>> 	- Font embedding (particularly TTF)
>>>> 	- Parsing (Andreas?)
>>>> 	- Page Tree (needs completely re-writing)
>>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>>>
>>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>>> example code for 2.0 due to all the changes.
>>>>
>>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>>> of the code is a pretty good indicator - we’re not there yet.
>>>>
>>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>>> 2.0 features which you think we need.
>>>>
>>>> Thanks
>>>>
>>>> -- John
>>>>
>>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>>> release?
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Hi Tilman

> You labeled many a "fix version" which would mean they "must" be fixed for 2.0. One example: PDFBOX-2402 <https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser improvement related to some bad PDFs that is relevant to one user only (I will fix that one soon, I just need to create a test file, but we could as well live without it). Another example was a color problem I had opened (PDFBOX-2142) which is probably only relevant to rendering advertising flyers.

Yep, this is the first time we’ve tried release management with JIRA, so the starting point is that all of the issues affecting 2.0 are now scheduled to be fixed in 2.0. Obviously that’s silly, but it forces us to now examine the issues we have and actively defer them to later versions, rather than forgetting about them, or loosing them in the hundreds of old 1.8 and earlier issues which don’t apply to 2.0.

Andreas - can we get a 2.1 and 3.0 version in JIRA (for breaking / non-breaking), so that the deferring can begin? These would just be estimates of course, we can always re-defer something to 2.2, etc. The idea being that issues are now actively assigned to releases, so we’re doing release management with JIRA, as well as just bug tracking.

> We should name maybe 10 issues that "must" be solved before 2.0. I'm thinking about regressions, issues were we are close to success (patterns), and issues where somebody attached his name (with the meaning "I can fix that and I know what has to be done"). And a short documentation about what has changed.

My list of “must do’s” is fairly short: resource caching, pattern rendering, and page trees are pretty much it. Breaking API changes are really the only blockers, it’s better to wait a bit longer for 2.0 than to have say the next 5 release make breaking changes to important APIs. (Minor or niche APIs are more flexible).

Cheers

-- John

> Tilman
> 
> Am 11.10.2014 um 04:37 schrieb John Hewson:
>> Hi All,
>> 
>> I really want to give a better answer to this question, but the JIRA issues were not
>> labelled with enough version-related information to allow me to simply view a list
>> of issues which are due to be fixed in 2.0.
>> 
>> As you’re probably aware, I went through pretty much all the issues and made sure
>> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
>> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
>> we have a way to see which issues are due to be fixed. The end result is here:
>> 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>> 
>> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
>> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
>> and to maybe take a look at issues which overlap with current development such as
>> xrefs, rendering, and parsing.
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> Simon,
>>> 
>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>> making breaking API changes and for the code to have been stable for a while before
>>> making a 2.0 release.
>>> 
>>> Improvements and fixes which still need breaking API changes include:
>>> 	- Pattern rendering
>>> 	- Pages resource caching (significant memory usage issues)
>>> 	- Font embedding (particularly TTF)
>>> 	- Parsing (Andreas?)
>>> 	- Page Tree (needs completely re-writing)
>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>> 
>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>> example code for 2.0 due to all the changes.
>>> 
>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>> of the code is a pretty good indicator - we’re not there yet.
>>> 
>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>> 2.0 features which you think we need.
>>> 
>>> Thanks
>>> 
>>> -- John
>>> 
>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>> release?
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> 
>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

I disagree with this. We fixed or closed about 80 issues  this month but 
most are new issues. The older an issue is, the most unlikely it can be 
fixed.

You labeled many a "fix version" which would mean they "must" be fixed 
for 2.0. One example: PDFBOX-2402 
<https://issues.apache.org/jira/browse/PDFBOX-2402> is about a parser 
improvement related to some bad PDFs that is relevant to one user only 
(I will fix that one soon, I just need to create a test file, but we 
could as well live without it). Another example was a color problem I 
had opened (PDFBOX-2142) which is probably only relevant to rendering 
advertising flyers.

2.0 is more and more becoming the "Duke Nukem Forever" of open source. 
I'm also thinking about the new Berlin airport. Although there is one 
difference: the people of "Duke Nukem Forever" and the new Berlin 
airport made the mistake to announce release dates.

I agree with Simon. 2.0 is already a massive improvement.

We should name maybe 10 issues that "must" be solved before 2.0. I'm 
thinking about regressions, issues were we are close to success 
(patterns), and issues where somebody attached his name (with the 
meaning "I can fix that and I know what has to be done"). And a short 
documentation about what has changed.

Tilman

Am 11.10.2014 um 04:37 schrieb John Hewson:
> Hi All,
>
> I really want to give a better answer to this question, but the JIRA issues were not
> labelled with enough version-related information to allow me to simply view a list
> of issues which are due to be fixed in 2.0.
>
> As you’re probably aware, I went through pretty much all the issues and made sure
> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
> we have a way to see which issues are due to be fixed. The end result is here:
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
>
> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
> and to maybe take a look at issues which overlap with current development such as
> xrefs, rendering, and parsing.
>
> Cheers
>
> -- John
>
> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>
>> Simon,
>>
>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>> making breaking API changes and for the code to have been stable for a while before
>> making a 2.0 release.
>>
>> Improvements and fixes which still need breaking API changes include:
>> 	- Pattern rendering
>> 	- Pages resource caching (significant memory usage issues)
>> 	- Font embedding (particularly TTF)
>> 	- Parsing (Andreas?)
>> 	- Page Tree (needs completely re-writing)
>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>
>> There’s probably more, such as work on Acroforms, and we need to have much better
>> example code for 2.0 due to all the changes.
>>
>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>> of the code is a pretty good indicator - we’re not there yet.
>>
>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>> 2.0 features which you think we need.
>>
>> Thanks
>>
>> -- John
>>
>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Could you set a target date for 2.0 release. What's missing to make a
>>> release?
>>>
>>>
>>>
>>> Thanks
>>>
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Hi Simon

> How many of those actually are blockers, isn't 2.0 a massive improvement
> over 1.8 already. Seems most of those could be added to 2.1 or later. Making
> a beta release would allow for feedback on current code to fix major issues.

Hopefully not too many, the most important thing is breaking API changes, we’re
trying to make them in 2.0 and then we can stay stable afterwards. We’ve never
tried to use JIRA to schedule issues like this before, hopefully we can now defer
the less important issues to 2.1 or later on JIRA, as soon as we get a “2.1”
version set up there, and for the first time we’ll have a roadmap of what needs to
be done, and in which version it’s (estimated) to be released.

-- John

RE: 2.0

Posted by Simon Steiner <si...@gmail.com>.

Hi,

How many of those actually are blockers, isn't 2.0 a massive improvement
over 1.8 already. Seems most of those could be added to 2.1 or later. Making
a beta release would allow for feedback on current code to fix major issues.

Thanks

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: 11 October 2014 03:44
To: dev@pdfbox.apache.org
Subject: Re: 2.0

It's worth point out that there are still 131 issues without a Fix Version,
many of which could apply to 2.0

https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20re
solution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%20EMPTY%20ORDER%20BY%20
priority%20DESC

We could perhaps try to add Affected + Fix versions for these issues too,
until then the picture is still incomplete.

-- John

On 10 Oct 2014, at 19:37, John Hewson <jo...@jahewson.com> wrote:

> Hi All,
> 
> I really want to give a better answer to this question, but the JIRA 
> issues were not labelled with enough version-related information to 
> allow me to simply view a list of issues which are due to be fixed in 2.0.
> 
> As you're probably aware, I went through pretty much all the issues 
> and made sure that issues which definitely affect 2.0 had that in 
> their "Affects Version/s" field. I also set the "Fix Version/s" for 
> issues which are due to be fixed in 2.0, so for the first time we have a
way to see which issues are due to be fixed. The end result is here:
> 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AN
> D%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20O
> RDER%20BY%20priority%20DESC
> 
> So I can now say that we have 166 issues due to be fixed in 2.0. We 
> might want to choose to defer some of these (we'll need to add a 
> "Later" version to JIRA to do that) and to maybe take a look at issues 
> which overlap with current development such as xrefs, rendering, and
parsing.
> 
> Cheers
> 
> -- John
> 
> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
> 
>> Simon,
>> 
>> Andreas has the best handle on this, but off the top of my head what 
>> we need is to finish making breaking API changes and for the code to 
>> have been stable for a while before making a 2.0 release.
>> 
>> Improvements and fixes which still need breaking API changes include:
>> 	- Pattern rendering
>> 	- Pages resource caching (significant memory usage issues)
>> 	- Font embedding (particularly TTF)
>> 	- Parsing (Andreas?)
>> 	- Page Tree (needs completely re-writing)
>> 	- Text extraction on Java 8 (this might end up being a breaking 
>> change to the sort)
>> 
>> There's probably more, such as work on Acroforms, and we need to have 
>> much better example code for 2.0 due to all the changes.
>> 
>> This seems like a good time to explicitly try to make sure that we 
>> have JIRA issues open for all outstanding tasks, so that we can track 
>> how close 2.0 is to being ready. The stability of the code is a pretty
good indicator - we're not there yet.
>> 
>> I'm going to open some JIRA issues. Andreas, Tilman - please open 
>> issues for any
>> 2.0 features which you think we need.
>> 
>> Thanks
>> 
>> -- John
>> 
>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com>
wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Could you set a target date for 2.0 release. What's missing to make 
>>> a release?
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>> 
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

It’s worth point out that there are still 131 issues without a Fix Version, many
of which could apply to 2.0

https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%20EMPTY%20ORDER%20BY%20priority%20DESC

We could perhaps try to add Affected + Fix versions for these issues too, until
then the picture is still incomplete.

-- John

On 10 Oct 2014, at 19:37, John Hewson <jo...@jahewson.com> wrote:

> Hi All,
> 
> I really want to give a better answer to this question, but the JIRA issues were not
> labelled with enough version-related information to allow me to simply view a list
> of issues which are due to be fixed in 2.0.
> 
> As you’re probably aware, I went through pretty much all the issues and made sure
> that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
> set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
> we have a way to see which issues are due to be fixed. The end result is here:
> 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC
> 
> So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
> choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
> and to maybe take a look at issues which overlap with current development such as
> xrefs, rendering, and parsing.
> 
> Cheers
> 
> -- John
> 
> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
> 
>> Simon,
>> 
>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>> making breaking API changes and for the code to have been stable for a while before
>> making a 2.0 release.
>> 
>> Improvements and fixes which still need breaking API changes include:
>> 	- Pattern rendering
>> 	- Pages resource caching (significant memory usage issues)
>> 	- Font embedding (particularly TTF)
>> 	- Parsing (Andreas?)
>> 	- Page Tree (needs completely re-writing)
>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>> 
>> There’s probably more, such as work on Acroforms, and we need to have much better
>> example code for 2.0 due to all the changes.
>> 
>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>> of the code is a pretty good indicator - we’re not there yet.
>> 
>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>> 2.0 features which you think we need.
>> 
>> Thanks
>> 
>> -- John
>> 
>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Could you set a target date for 2.0 release. What's missing to make a
>>> release?
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>> 
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Hi All,

I really want to give a better answer to this question, but the JIRA issues were not
labelled with enough version-related information to allow me to simply view a list
of issues which are due to be fixed in 2.0.

As you’re probably aware, I went through pretty much all the issues and made sure
that issues which definitely affect 2.0 had that in their "Affects Version/s” field. I also
set the "Fix Version/s” for issues which are due to be fixed in 2.0, so for the first time
we have a way to see which issues are due to be fixed. The end result is here:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC

So I can now say that we have 166 issues due to be fixed in 2.0. We might want to
choose to defer some of these (we’ll need to add a “Later” version to JIRA to do that)
and to maybe take a look at issues which overlap with current development such as
xrefs, rendering, and parsing.

Cheers

-- John

On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:

> Simon,
> 
> Andreas has the best handle on this, but off the top of my head what we need is to finish
> making breaking API changes and for the code to have been stable for a while before
> making a 2.0 release.
> 
> Improvements and fixes which still need breaking API changes include:
> 	- Pattern rendering
> 	- Pages resource caching (significant memory usage issues)
> 	- Font embedding (particularly TTF)
> 	- Parsing (Andreas?)
> 	- Page Tree (needs completely re-writing)
> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
> 
> There’s probably more, such as work on Acroforms, and we need to have much better
> example code for 2.0 due to all the changes.
> 
> This seems like a good time to explicitly try to make sure that we have JIRA issues open
> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
> of the code is a pretty good indicator - we’re not there yet.
> 
> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
> 2.0 features which you think we need.
> 
> Thanks
> 
> -- John
> 
> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
> 
>> Hi,
>> 
>> 
>> 
>> Could you set a target date for 2.0 release. What's missing to make a
>> release?
>> 
>> 
>> 
>> Thanks
>> 
>

RE: 2.0

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi Tim,

first of all thanks for the offer, this is highly appreciated!

I already have a first fix for PDFBOX-2441, but there is another issue. I hope
to fix it soon.

I'm just curious, do you run that comparisons manually or do you plan to
implement some more or less automatic test which can be started without that
much effort?

BR
Andreas Lehmkühler

> "Allison, Timothy B." <ta...@mitre.org> hat am 21. Oktober 2014 um 22:19
> geschrieben:
>
>
> Hi Tilman,
>   Sounds good.  Should I wait for PDFBOX-2441?
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, October 21, 2014 1:42 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0
>
> Hi Tim,
>
> 2.0 doesn't seem to be released soon... what might be useful again is a
> comparison between seq v non-seq, Andreas recently resolved an issue
> (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't
> fully done, a follow-up issue PDFBOX-2441
> <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened
> which will improve a few more complex files.
>
> Tilman
>
>
>
> Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
> > Been too busy over in Tika-land...just noticing this now.
> >
> > Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v
> > non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any
> > time soon (Jeremy Anderson on TIKA-1285 has already started this), but I
> > could easily write a lightweight wrapper around PDFBox's TextStripper +
> > metadata inside of the tika-batch/tika-eval framework.
> >
> > Cheers,
> >
> >        Tim
> > ________________________________________
> > From: Andreas Lehmkühler [andreas@lehmi.de]
> > Sent: Wednesday, October 15, 2014 6:20 AM
> > To: dev@pdfbox.apache.org
> > Subject: Re: 2.0
> >
> > Hi,
> >
> >
> >> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
> >> geschrieben:
> >>
> >>
> >> What about keeping both for the 2.0 release and phase the old one out for 3
> >> but making the NonSequential the default parser.
> >> Would also give us some time to work with Tim (TIKA) on the test suite.
> > I agree, that's the only thing we can manage in a timely manner.
> >
> >
> >> Maybe we could simplify the variations of PDDocument.load to something like
> >>
> >> PDDocument.load(input, raf, enforce, useLegacyParser) or
> >> PDDocument.load(input, raf, enforce, withSignatureSupport) .
> >>
> >> and introduce PDDocument.load(input) to use the NonSequential
> >>
> >>
> >> WDYT?
> > Good idea, I've already created PDFBOX-2430 for this.
> >
> >> Maruan
> >
> > BR
> > Andreas Lehmkühler
> >> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
> >>
> >>> Hi,
> >>>
> >>> the difference between the parsers stems from the fact that the old parser
> >>> can cope with a completely broken xref table because it uses the objects
> >>> as
> >>> it finds them on its sequential way. What we need (as I proposed before)
> >>> is
> >>> a repair mechanism scanning the file for object start/end to be used for
> >>> re-creating the xref table.
> >>> I will see if I can find some time to do this.
> >>>
> >>> The only other stopper is as Andreas has pointed out the signing. I'm not
> >>> familiar with this and don't known what needs to be done here.
> >>>
> >>>
> >>> Best,
> >>> Timo
> >>>
> >>>
> >>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> >>>> Here are some:
> >>>>
> >>>> 055/055794.pdf
> >>>> 082/082463.pdf
> >>>> 108/108362.pdf
> >>>> 113/113223.pdf
> >>>> 115/115458.pdf
> >>>> 115/115463.pdf
> >>>> 122/122393.pdf
> >>>> 129/129416.pdf
> >>>> 133/133423.pdf
> >>>> 148/148020.pdf
> >>>> 152/152012.pdf
> >>>> 161/161466.pdf
> >>>>
> >>>> to be found here:
> >>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
> >>>>> Unless somebody provides us with a list of those files, then I think
> >>>>> this is an unreasonable request. As long as we continue to leave the
> >>>>> old parser in PDFBox, we won't get the bug reports which we need to
> >>>>> fix the new parser, and the situation will never resolve itself.
> >>>>> Falling back to the old parser is just as bad - we won't get bug
> >>>>> reports.
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
> >>>>>
> >>>>>> I prefer that the "old" parser not be removed, because there are many
> >>>>>> files that can only be parsed by the old parser. This came out in a
> >>>>>> large scale test with TIKA.
> >>>>>>
> >>>>>> The best idea (in my current opinion) is to use the nonSeq parser
> >>>>>> first, and the old parser if there is an exception.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>>>>> Hi,
> >>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
> >>>>>>>>>> geschrieben:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>          - Parsing (Andreas?)
> >>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>>>>> improve the XRef
> >>>>>>>>> and the COSStream stuff
> >>>>>>>> It would be great if we could get rid of the old parser and switch
> >>>>>>>> to the non-sequential
> >>>>>>>> parser, WDYT?
> >>>>>>> I would also propose to completely remove the old parser. That way
> >>>>>>> we are more flexible in parsing streams etc. since parts of the
> >>>>>>> non-sequential parser are a compromise to work side-by-side with the
> >>>>>>> old parser.
> >>>>>>> Possibly there are a small number of functions for which the old
> >>>>>>> parser is still needed - e.g. signing?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>>
> >>>
> >>> --
> >>>
> >>> Timo Boehme
> >>> OntoChem GmbH
> >>> H.-Damerow-Str. 4
> >>> 06120 Halle/Saale
> >>> T: +49 345 4780474
> >>> F: +49 345 4780471
> >>> timo.boehme@ontochem.com
> >>>
> >>> _____________________________________________________________________
> >>>
> >>> OntoChem GmbH
> >>> Geschäftsführer: Dr. Lutz Weber
> >>> Sitz: Halle / Saale
> >>> Registergericht: Stendal
> >>> Registernummer: HRB 215461
> >>> _____________________________________________________________________
> >>>
>

RE: 2.0

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi Tilman,
  Sounds good.  Should I wait for PDFBOX-2441?

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, October 21, 2014 1:42 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Hi Tim,

2.0 doesn't seem to be released soon... what might be useful again is a 
comparison between seq v non-seq, Andreas recently resolved an issue 
(PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
fully done, a follow-up issue PDFBOX-2441 
<https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened 
which will improve a few more complex files.

Tilman



Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
> Been too busy over in Tika-land...just noticing this now.
>
> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.
>
> Cheers,
>
>        Tim
> ________________________________________
> From: Andreas Lehmkühler [andreas@lehmi.de]
> Sent: Wednesday, October 15, 2014 6:20 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0
>
> Hi,
>
>
>> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
>> geschrieben:
>>
>>
>> What about keeping both for the 2.0 release and phase the old one out for 3
>> but making the NonSequential the default parser.
>> Would also give us some time to work with Tim (TIKA) on the test suite.
> I agree, that's the only thing we can manage in a timely manner.
>
>
>> Maybe we could simplify the variations of PDDocument.load to something like
>>
>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>> PDDocument.load(input, raf, enforce, withSignatureSupport) .
>>
>> and introduce PDDocument.load(input) to use the NonSequential
>>
>>
>> WDYT?
> Good idea, I've already created PDFBOX-2430 for this.
>
>> Maruan
>
> BR
> Andreas Lehmkühler
>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>>
>>> Hi,
>>>
>>> the difference between the parsers stems from the fact that the old parser
>>> can cope with a completely broken xref table because it uses the objects as
>>> it finds them on its sequential way. What we need (as I proposed before) is
>>> a repair mechanism scanning the file for object start/end to be used for
>>> re-creating the xref table.
>>> I will see if I can find some time to do this.
>>>
>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>> familiar with this and don't known what needs to be done here.
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>> Here are some:
>>>>
>>>> 055/055794.pdf
>>>> 082/082463.pdf
>>>> 108/108362.pdf
>>>> 113/113223.pdf
>>>> 115/115458.pdf
>>>> 115/115463.pdf
>>>> 122/122393.pdf
>>>> 129/129416.pdf
>>>> 133/133423.pdf
>>>> 148/148020.pdf
>>>> 152/152012.pdf
>>>> 161/161466.pdf
>>>>
>>>> to be found here:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>
>>>> Tilman
>>>>
>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>> Unless somebody provides us with a list of those files, then I think
>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>> old parser in PDFBox, we won't get the bug reports which we need to
>>>>> fix the new parser, and the situation will never resolve itself.
>>>>> Falling back to the old parser is just as bad - we won't get bug reports.
>>>>>
>>>>> -- John
>>>>>
>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>>
>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>> large scale test with TIKA.
>>>>>>
>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>> first, and the old parser if there is an exception.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>> Hi,
>>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>> geschrieben:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>          - Parsing (Andreas?)
>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>> improve the XRef
>>>>>>>>> and the COSStream stuff
>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>> to the non-sequential
>>>>>>>> parser, WDYT?
>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>> old parser.
>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>> parser is still needed - e.g. signing?
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>
>>> --
>>>
>>> Timo Boehme
>>> OntoChem GmbH
>>> H.-Damerow-Str. 4
>>> 06120 Halle/Saale
>>> T: +49 345 4780474
>>> F: +49 345 4780471
>>> timo.boehme@ontochem.com
>>>
>>> _____________________________________________________________________
>>>
>>> OntoChem GmbH
>>> Geschäftsführer: Dr. Lutz Weber
>>> Sitz: Halle / Saale
>>> Registergericht: Stendal
>>> Registernummer: HRB 215461
>>> _____________________________________________________________________
>>>

RE: 2.0

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Maruan,
  Sounds good.  I'll add it to my todo list to write the wrapper...probably be good for me to start moving to 2.0 anyways. :)

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Tuesday, October 21, 2014 1:50 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Tim, 

first many thanks for the offer. I'd add that a comparison between 1.8 and 2.0 would be useful too to detect differences might it be because of enhancements or regressions.

BR
Maruan


Am 21.10.2014 um 19:42 schrieb Tilman Hausherr <TH...@t-online.de>:

> Hi Tim,
> 
> 2.0 doesn't seem to be released soon... what might be useful again is a comparison between seq v non-seq, Andreas recently resolved an issue (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't fully done, a follow-up issue PDFBOX-2441 <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened which will improve a few more complex files.
> 
> Tilman
> 
> 
> 
> Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
>> Been too busy over in Tika-land...just noticing this now.
>> 
>> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.
>> 
>> Cheers,
>> 
>>       Tim
>> ________________________________________
>> From: Andreas Lehmkühler [andreas@lehmi.de]
>> Sent: Wednesday, October 15, 2014 6:20 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0
>> 
>> Hi,
>> 
>> 
>>> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
>>> geschrieben:
>>> 
>>> 
>>> What about keeping both for the 2.0 release and phase the old one out for 3
>>> but making the NonSequential the default parser.
>>> Would also give us some time to work with Tim (TIKA) on the test suite.
>> I agree, that's the only thing we can manage in a timely manner.
>> 
>> 
>>> Maybe we could simplify the variations of PDDocument.load to something like
>>> 
>>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>>> PDDocument.load(input, raf, enforce, withSignatureSupport) .
>>> 
>>> and introduce PDDocument.load(input) to use the NonSequential
>>> 
>>> 
>>> WDYT?
>> Good idea, I've already created PDFBOX-2430 for this.
>> 
>>> Maruan
>> 
>> BR
>> Andreas Lehmkühler
>>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>>> 
>>>> Hi,
>>>> 
>>>> the difference between the parsers stems from the fact that the old parser
>>>> can cope with a completely broken xref table because it uses the objects as
>>>> it finds them on its sequential way. What we need (as I proposed before) is
>>>> a repair mechanism scanning the file for object start/end to be used for
>>>> re-creating the xref table.
>>>> I will see if I can find some time to do this.
>>>> 
>>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>>> familiar with this and don't known what needs to be done here.
>>>> 
>>>> 
>>>> Best,
>>>> Timo
>>>> 
>>>> 
>>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>>> Here are some:
>>>>> 
>>>>> 055/055794.pdf
>>>>> 082/082463.pdf
>>>>> 108/108362.pdf
>>>>> 113/113223.pdf
>>>>> 115/115458.pdf
>>>>> 115/115463.pdf
>>>>> 122/122393.pdf
>>>>> 129/129416.pdf
>>>>> 133/133423.pdf
>>>>> 148/148020.pdf
>>>>> 152/152012.pdf
>>>>> 161/161466.pdf
>>>>> 
>>>>> to be found here:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>>> Unless somebody provides us with a list of those files, then I think
>>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>>> old parser in PDFBox, we won't get the bug reports which we need to
>>>>>> fix the new parser, and the situation will never resolve itself.
>>>>>> Falling back to the old parser is just as bad - we won't get bug reports.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>>> 
>>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>>> large scale test with TIKA.
>>>>>>> 
>>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>>> first, and the old parser if there is an exception.
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>>> Hi,
>>>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>>> geschrieben:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>         - Parsing (Andreas?)
>>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>>> improve the XRef
>>>>>>>>>> and the COSStream stuff
>>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>>> to the non-sequential
>>>>>>>>> parser, WDYT?
>>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>>> old parser.
>>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>>> parser is still needed - e.g. signing?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>> 
>>>>>>>> 
>>>> 
>>>> --
>>>> 
>>>> Timo Boehme
>>>> OntoChem GmbH
>>>> H.-Damerow-Str. 4
>>>> 06120 Halle/Saale
>>>> T: +49 345 4780474
>>>> F: +49 345 4780471
>>>> timo.boehme@ontochem.com
>>>> 
>>>> _____________________________________________________________________
>>>> 
>>>> OntoChem GmbH
>>>> Geschäftsführer: Dr. Lutz Weber
>>>> Sitz: Halle / Saale
>>>> Registergericht: Stendal
>>>> Registernummer: HRB 215461
>>>> _____________________________________________________________________
>>>> 
>

Re: 2.0

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Tim, 

first many thanks for the offer. I’d add that a comparison between 1.8 and 2.0 would be useful too to detect differences might it be because of enhancements or regressions.

BR
Maruan


Am 21.10.2014 um 19:42 schrieb Tilman Hausherr <TH...@t-online.de>:

> Hi Tim,
> 
> 2.0 doesn't seem to be released soon... what might be useful again is a comparison between seq v non-seq, Andreas recently resolved an issue (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't fully done, a follow-up issue PDFBOX-2441 <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened which will improve a few more complex files.
> 
> Tilman
> 
> 
> 
> Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
>> Been too busy over in Tika-land...just noticing this now.
>> 
>> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.
>> 
>> Cheers,
>> 
>>       Tim
>> ________________________________________
>> From: Andreas Lehmkühler [andreas@lehmi.de]
>> Sent: Wednesday, October 15, 2014 6:20 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0
>> 
>> Hi,
>> 
>> 
>>> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
>>> geschrieben:
>>> 
>>> 
>>> What about keeping both for the 2.0 release and phase the old one out for 3
>>> but making the NonSequential the default parser.
>>> Would also give us some time to work with Tim (TIKA) on the test suite.
>> I agree, that's the only thing we can manage in a timely manner.
>> 
>> 
>>> Maybe we could simplify the variations of PDDocument.load to something like
>>> 
>>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>>> PDDocument.load(input, raf, enforce, withSignatureSupport) …
>>> 
>>> and introduce PDDocument.load(input) to use the NonSequential
>>> 
>>> 
>>> WDYT?
>> Good idea, I've already created PDFBOX-2430 for this.
>> 
>>> Maruan
>> 
>> BR
>> Andreas Lehmkühler
>>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>>> 
>>>> Hi,
>>>> 
>>>> the difference between the parsers stems from the fact that the old parser
>>>> can cope with a completely broken xref table because it uses the objects as
>>>> it finds them on its sequential way. What we need (as I proposed before) is
>>>> a repair mechanism scanning the file for object start/end to be used for
>>>> re-creating the xref table.
>>>> I will see if I can find some time to do this.
>>>> 
>>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>>> familiar with this and don't known what needs to be done here.
>>>> 
>>>> 
>>>> Best,
>>>> Timo
>>>> 
>>>> 
>>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>>> Here are some:
>>>>> 
>>>>> 055/055794.pdf
>>>>> 082/082463.pdf
>>>>> 108/108362.pdf
>>>>> 113/113223.pdf
>>>>> 115/115458.pdf
>>>>> 115/115463.pdf
>>>>> 122/122393.pdf
>>>>> 129/129416.pdf
>>>>> 133/133423.pdf
>>>>> 148/148020.pdf
>>>>> 152/152012.pdf
>>>>> 161/161466.pdf
>>>>> 
>>>>> to be found here:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>>> Unless somebody provides us with a list of those files, then I think
>>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>>>>> fix the new parser, and the situation will never resolve itself.
>>>>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>>> 
>>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>>> large scale test with TIKA.
>>>>>>> 
>>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>>> first, and the old parser if there is an exception.
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>>> Hi,
>>>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>>> geschrieben:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>         - Parsing (Andreas?)
>>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>>> improve the XRef
>>>>>>>>>> and the COSStream stuff
>>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>>> to the non-sequential
>>>>>>>>> parser, WDYT?
>>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>>> old parser.
>>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>>> parser is still needed - e.g. signing?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>> 
>>>>>>>> 
>>>> 
>>>> --
>>>> 
>>>> Timo Boehme
>>>> OntoChem GmbH
>>>> H.-Damerow-Str. 4
>>>> 06120 Halle/Saale
>>>> T: +49 345 4780474
>>>> F: +49 345 4780471
>>>> timo.boehme@ontochem.com
>>>> 
>>>> _____________________________________________________________________
>>>> 
>>>> OntoChem GmbH
>>>> Geschäftsführer: Dr. Lutz Weber
>>>> Sitz: Halle / Saale
>>>> Registergericht: Stendal
>>>> Registernummer: HRB 215461
>>>> _____________________________________________________________________
>>>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi Tim,

2.0 doesn't seem to be released soon... what might be useful again is a 
comparison between seq v non-seq, Andreas recently resolved an issue 
(PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
fully done, a follow-up issue PDFBOX-2441 
<https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened 
which will improve a few more complex files.

Tilman



Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
> Been too busy over in Tika-land...just noticing this now.
>
> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.
>
> Cheers,
>
>        Tim
> ________________________________________
> From: Andreas Lehmkühler [andreas@lehmi.de]
> Sent: Wednesday, October 15, 2014 6:20 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0
>
> Hi,
>
>
>> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
>> geschrieben:
>>
>>
>> What about keeping both for the 2.0 release and phase the old one out for 3
>> but making the NonSequential the default parser.
>> Would also give us some time to work with Tim (TIKA) on the test suite.
> I agree, that's the only thing we can manage in a timely manner.
>
>
>> Maybe we could simplify the variations of PDDocument.load to something like
>>
>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>> PDDocument.load(input, raf, enforce, withSignatureSupport) …
>>
>> and introduce PDDocument.load(input) to use the NonSequential
>>
>>
>> WDYT?
> Good idea, I've already created PDFBOX-2430 for this.
>
>> Maruan
>
> BR
> Andreas Lehmkühler
>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>>
>>> Hi,
>>>
>>> the difference between the parsers stems from the fact that the old parser
>>> can cope with a completely broken xref table because it uses the objects as
>>> it finds them on its sequential way. What we need (as I proposed before) is
>>> a repair mechanism scanning the file for object start/end to be used for
>>> re-creating the xref table.
>>> I will see if I can find some time to do this.
>>>
>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>> familiar with this and don't known what needs to be done here.
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>> Here are some:
>>>>
>>>> 055/055794.pdf
>>>> 082/082463.pdf
>>>> 108/108362.pdf
>>>> 113/113223.pdf
>>>> 115/115458.pdf
>>>> 115/115463.pdf
>>>> 122/122393.pdf
>>>> 129/129416.pdf
>>>> 133/133423.pdf
>>>> 148/148020.pdf
>>>> 152/152012.pdf
>>>> 161/161466.pdf
>>>>
>>>> to be found here:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>
>>>> Tilman
>>>>
>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>> Unless somebody provides us with a list of those files, then I think
>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>>>> fix the new parser, and the situation will never resolve itself.
>>>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>>>>
>>>>> -- John
>>>>>
>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>>
>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>> large scale test with TIKA.
>>>>>>
>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>> first, and the old parser if there is an exception.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>> Hi,
>>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>> geschrieben:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>          - Parsing (Andreas?)
>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>> improve the XRef
>>>>>>>>> and the COSStream stuff
>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>> to the non-sequential
>>>>>>>> parser, WDYT?
>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>> old parser.
>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>> parser is still needed - e.g. signing?
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>
>>> --
>>>
>>> Timo Boehme
>>> OntoChem GmbH
>>> H.-Damerow-Str. 4
>>> 06120 Halle/Saale
>>> T: +49 345 4780474
>>> F: +49 345 4780471
>>> timo.boehme@ontochem.com
>>>
>>> _____________________________________________________________________
>>>
>>> OntoChem GmbH
>>> Geschäftsführer: Dr. Lutz Weber
>>> Sitz: Halle / Saale
>>> Registergericht: Stendal
>>> Registernummer: HRB 215461
>>> _____________________________________________________________________
>>>

RE: 2.0

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Been too busy over in Tika-land...just noticing this now.

Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.

Cheers,

      Tim
________________________________________
From: Andreas Lehmkühler [andreas@lehmi.de]
Sent: Wednesday, October 15, 2014 6:20 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Hi,


> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
> geschrieben:
>
>
> What about keeping both for the 2.0 release and phase the old one out for 3
> but making the NonSequential the default parser.
> Would also give us some time to work with Tim (TIKA) on the test suite.
I agree, that's the only thing we can manage in a timely manner.


> Maybe we could simplify the variations of PDDocument.load to something like
>
> PDDocument.load(input, raf, enforce, useLegacyParser) or
> PDDocument.load(input, raf, enforce, withSignatureSupport) …
>
> and introduce PDDocument.load(input) to use the NonSequential
>
>
> WDYT?
Good idea, I've already created PDFBOX-2430 for this.

>
> Maruan


BR
Andreas Lehmkühler
>
> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>
> > Hi,
> >
> > the difference between the parsers stems from the fact that the old parser
> > can cope with a completely broken xref table because it uses the objects as
> > it finds them on its sequential way. What we need (as I proposed before) is
> > a repair mechanism scanning the file for object start/end to be used for
> > re-creating the xref table.
> > I will see if I can find some time to do this.
> >
> > The only other stopper is as Andreas has pointed out the signing. I'm not
> > familiar with this and don't known what needs to be done here.
> >
> >
> > Best,
> > Timo
> >
> >
> > Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> >> Here are some:
> >>
> >> 055/055794.pdf
> >> 082/082463.pdf
> >> 108/108362.pdf
> >> 113/113223.pdf
> >> 115/115458.pdf
> >> 115/115463.pdf
> >> 122/122393.pdf
> >> 129/129416.pdf
> >> 133/133423.pdf
> >> 148/148020.pdf
> >> 152/152012.pdf
> >> 161/161466.pdf
> >>
> >> to be found here:
> >> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >>
> >> Tilman
> >>
> >> Am 14.10.2014 um 21:06 schrieb John Hewson:
> >>> Unless somebody provides us with a list of those files, then I think
> >>> this is an unreasonable request. As long as we continue to leave the
> >>> old parser in PDFBox, we won’t get the bug reports which we need to
> >>> fix the new parser, and the situation will never resolve itself.
> >>> Falling back to the old parser is just as bad - we won’t get bug reports.
> >>>
> >>> -- John
> >>>
> >>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
> >>>
> >>>> I prefer that the "old" parser not be removed, because there are many
> >>>> files that can only be parsed by the old parser. This came out in a
> >>>> large scale test with TIKA.
> >>>>
> >>>> The best idea (in my current opinion) is to use the nonSeq parser
> >>>> first, and the old parser if there is an exception.
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>>> Hi,
> >>>>>
> >>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>>> Hi,
> >>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
> >>>>>>>> geschrieben:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>        - Parsing (Andreas?)
> >>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>>> improve the XRef
> >>>>>>> and the COSStream stuff
> >>>>>> It would be great if we could get rid of the old parser and switch
> >>>>>> to the non-sequential
> >>>>>> parser, WDYT?
> >>>>> I would also propose to completely remove the old parser. That way
> >>>>> we are more flexible in parsing streams etc. since parts of the
> >>>>> non-sequential parser are a compromise to work side-by-side with the
> >>>>> old parser.
> >>>>> Possibly there are a small number of functions for which the old
> >>>>> parser is still needed - e.g. signing?
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Timo
> >>>>>
> >>>>>
> >>>
> >>
> >
> >
> > --
> >
> > Timo Boehme
> > OntoChem GmbH
> > H.-Damerow-Str. 4
> > 06120 Halle/Saale
> > T: +49 345 4780474
> > F: +49 345 4780471
> > timo.boehme@ontochem.com
> >
> > _____________________________________________________________________
> >
> > OntoChem GmbH
> > Geschäftsführer: Dr. Lutz Weber
> > Sitz: Halle / Saale
> > Registergericht: Stendal
> > Registernummer: HRB 215461
> > _____________________________________________________________________
> >
>

Re: 2.0

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,


> Maruan Sahyoun <sa...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
> geschrieben:
>
>
> What about keeping both for the 2.0 release and phase the old one out for 3
> but making the NonSequential the default parser.
> Would also give us some time to work with Tim (TIKA) on the test suite.
I agree, that's the only thing we can manage in a timely manner.

 
> Maybe we could simplify the variations of PDDocument.load to something like
>
> PDDocument.load(input, raf, enforce, useLegacyParser) or
> PDDocument.load(input, raf, enforce, withSignatureSupport) …
>
> and introduce PDDocument.load(input) to use the NonSequential
>
>
> WDYT?
Good idea, I've already created PDFBOX-2430 for this.

>
> Maruan


BR
Andreas Lehmkühler
>
> Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:
>
> > Hi,
> >
> > the difference between the parsers stems from the fact that the old parser
> > can cope with a completely broken xref table because it uses the objects as
> > it finds them on its sequential way. What we need (as I proposed before) is
> > a repair mechanism scanning the file for object start/end to be used for
> > re-creating the xref table.
> > I will see if I can find some time to do this.
> >
> > The only other stopper is as Andreas has pointed out the signing. I'm not
> > familiar with this and don't known what needs to be done here.
> >
> >
> > Best,
> > Timo
> >
> >
> > Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> >> Here are some:
> >>
> >> 055/055794.pdf
> >> 082/082463.pdf
> >> 108/108362.pdf
> >> 113/113223.pdf
> >> 115/115458.pdf
> >> 115/115463.pdf
> >> 122/122393.pdf
> >> 129/129416.pdf
> >> 133/133423.pdf
> >> 148/148020.pdf
> >> 152/152012.pdf
> >> 161/161466.pdf
> >>
> >> to be found here:
> >> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >>
> >> Tilman
> >>
> >> Am 14.10.2014 um 21:06 schrieb John Hewson:
> >>> Unless somebody provides us with a list of those files, then I think
> >>> this is an unreasonable request. As long as we continue to leave the
> >>> old parser in PDFBox, we won’t get the bug reports which we need to
> >>> fix the new parser, and the situation will never resolve itself.
> >>> Falling back to the old parser is just as bad - we won’t get bug reports.
> >>>
> >>> -- John
> >>>
> >>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
> >>>
> >>>> I prefer that the "old" parser not be removed, because there are many
> >>>> files that can only be parsed by the old parser. This came out in a
> >>>> large scale test with TIKA.
> >>>>
> >>>> The best idea (in my current opinion) is to use the nonSeq parser
> >>>> first, and the old parser if there is an exception.
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>>> Hi,
> >>>>>
> >>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>>> Hi,
> >>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
> >>>>>>>> geschrieben:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>        - Parsing (Andreas?)
> >>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>>> improve the XRef
> >>>>>>> and the COSStream stuff
> >>>>>> It would be great if we could get rid of the old parser and switch
> >>>>>> to the non-sequential
> >>>>>> parser, WDYT?
> >>>>> I would also propose to completely remove the old parser. That way
> >>>>> we are more flexible in parsing streams etc. since parts of the
> >>>>> non-sequential parser are a compromise to work side-by-side with the
> >>>>> old parser.
> >>>>> Possibly there are a small number of functions for which the old
> >>>>> parser is still needed - e.g. signing?
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Timo
> >>>>>
> >>>>>
> >>>
> >>
> >
> >
> > --
> >
> > Timo Boehme
> > OntoChem GmbH
> > H.-Damerow-Str. 4
> > 06120 Halle/Saale
> > T: +49 345 4780474
> > F: +49 345 4780471
> > timo.boehme@ontochem.com
> >
> > _____________________________________________________________________
> >
> > OntoChem GmbH
> > Geschäftsführer: Dr. Lutz Weber
> > Sitz: Halle / Saale
> > Registergericht: Stendal
> > Registernummer: HRB 215461
> > _____________________________________________________________________
> >
>

Re: 2.0

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

What about keeping both for the 2.0 release and phase the old one out for 3 but making the NonSequential the default parser.
Would also give us some time to work with Tim (TIKA) on the test suite.

Maybe we could simplify the variations of PDDocument.load to something like 

PDDocument.load(input, raf, enforce, useLegacyParser) or
PDDocument.load(input, raf, enforce, withSignatureSupport) …

and introduce PDDocument.load(input) to use the NonSequential 


WDYT?

Maruan

Am 15.10.2014 um 09:18 schrieb Timo Boehme <ti...@ontochem.com>:

> Hi,
> 
> the difference between the parsers stems from the fact that the old parser can cope with a completely broken xref table because it uses the objects as it finds them on its sequential way. What we need (as I proposed before) is a repair mechanism scanning the file for object start/end to be used for re-creating the xref table.
> I will see if I can find some time to do this.
> 
> The only other stopper is as Andreas has pointed out the signing. I'm not familiar with this and don't known what needs to be done here.
> 
> 
> Best,
> Timo
> 
> 
> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>> Here are some:
>> 
>> 055/055794.pdf
>> 082/082463.pdf
>> 108/108362.pdf
>> 113/113223.pdf
>> 115/115458.pdf
>> 115/115463.pdf
>> 122/122393.pdf
>> 129/129416.pdf
>> 133/133423.pdf
>> 148/148020.pdf
>> 152/152012.pdf
>> 161/161466.pdf
>> 
>> to be found here:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>> 
>> Tilman
>> 
>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>> Unless somebody provides us with a list of those files, then I think
>>> this is an unreasonable request. As long as we continue to leave the
>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>> fix the new parser, and the situation will never resolve itself.
>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>> 
>>> -- John
>>> 
>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>>> I prefer that the "old" parser not be removed, because there are many
>>>> files that can only be parsed by the old parser. This came out in a
>>>> large scale test with TIKA.
>>>> 
>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>> first, and the old parser if there is an exception.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>> Hi,
>>>>> 
>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>> Hi,
>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>> geschrieben:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>        - Parsing (Andreas?)
>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>> improve the XRef
>>>>>>> and the COSStream stuff
>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>> to the non-sequential
>>>>>> parser, WDYT?
>>>>> I would also propose to completely remove the old parser. That way
>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>> old parser.
>>>>> Possibly there are a small number of functions for which the old
>>>>> parser is still needed - e.g. signing?
>>>>> 
>>>>> 
>>>>> Best,
>>>>> Timo
>>>>> 
>>>>> 
>>> 
>> 
> 
> 
> -- 
> 
> Timo Boehme
> OntoChem GmbH
> H.-Damerow-Str. 4
> 06120 Halle/Saale
> T: +49 345 4780474
> F: +49 345 4780471
> timo.boehme@ontochem.com
> 
> _____________________________________________________________________
> 
> OntoChem GmbH
> Geschäftsführer: Dr. Lutz Weber
> Sitz: Halle / Saale
> Registergericht: Stendal
> Registernummer: HRB 215461
> _____________________________________________________________________
>

Re: 2.0

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Am 15.10.2014 um 12:12 schrieb Andreas Lehmkühler <an...@lehmi.de>:

> 
> 
>> Timo Boehme <ti...@ontochem.com> hat am 15. Oktober 2014 um 09:18
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> the difference between the parsers stems from the fact that the old
>> parser can cope with a completely broken xref table because it uses the
>> objects as it finds them on its sequential way. What we need (as I
>> proposed before) is a repair mechanism scanning the file for object
>> start/end to be used for re-creating the xref table.
>> I will see if I can find some time to do this.
> I already have a working prototype but I'm not yet happy with the
> implementation.
> 
>> The only other stopper is as Andreas has pointed out the signing. I'm
>> not familiar with this and don't known what needs to be done here.
> Me neither.
> 

If we keep the old parser side by side to the new one we can look at implementing incremental updates at a later stage correctly thus not only supporting signing but other important use cases too. Something we can do behind the scene.



>> Best,
>> Timo
> 
> BR
> Andreas Lehmkühler
> 
>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>> Here are some:
>>> 
>>> 055/055794.pdf
>>> 082/082463.pdf
>>> 108/108362.pdf
>>> 113/113223.pdf
>>> 115/115458.pdf
>>> 115/115463.pdf
>>> 122/122393.pdf
>>> 129/129416.pdf
>>> 133/133423.pdf
>>> 148/148020.pdf
>>> 152/152012.pdf
>>> 161/161466.pdf
>>> 
>>> to be found here:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>> Unless somebody provides us with a list of those files, then I think
>>>> this is an unreasonable request. As long as we continue to leave the
>>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>>> fix the new parser, and the situation will never resolve itself.
>>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>>> 
>>>> -- John
>>>> 
>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>>> 
>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>> files that can only be parsed by the old parser. This came out in a
>>>>> large scale test with TIKA.
>>>>> 
>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>> first, and the old parser if there is an exception.
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>> Hi,
>>>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>          - Parsing (Andreas?)
>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>> improve the XRef
>>>>>>>> and the COSStream stuff
>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>> to the non-sequential
>>>>>>> parser, WDYT?
>>>>>> I would also propose to completely remove the old parser. That way
>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>> old parser.
>>>>>> Possibly there are a small number of functions for which the old
>>>>>> parser is still needed - e.g. signing?
>>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> Timo
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> 
>>    Timo Boehme
>>    OntoChem GmbH
>>    H.-Damerow-Str. 4
>>    06120 Halle/Saale
>>    T: +49 345 4780474
>>    F: +49 345 4780471
>>    timo.boehme@ontochem.com
>> 
>> _____________________________________________________________________
>> 
>>    OntoChem GmbH
>>    Geschäftsführer: Dr. Lutz Weber
>>    Sitz: Halle / Saale
>>    Registergericht: Stendal
>>    Registernummer: HRB 215461
>> _____________________________________________________________________
>>

Re: 2.0

Posted by Andreas Lehmkühler <an...@lehmi.de>.


> Timo Boehme <ti...@ontochem.com> hat am 15. Oktober 2014 um 09:18
> geschrieben:
>
>
> Hi,
>
> the difference between the parsers stems from the fact that the old
> parser can cope with a completely broken xref table because it uses the
> objects as it finds them on its sequential way. What we need (as I
> proposed before) is a repair mechanism scanning the file for object
> start/end to be used for re-creating the xref table.
> I will see if I can find some time to do this.
I already have a working prototype but I'm not yet happy with the
implementation.

> The only other stopper is as Andreas has pointed out the signing. I'm
> not familiar with this and don't known what needs to be done here.
Me neither.

> Best,
> Timo

BR
Andreas Lehmkühler

> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> > Here are some:
> >
> > 055/055794.pdf
> > 082/082463.pdf
> > 108/108362.pdf
> > 113/113223.pdf
> > 115/115458.pdf
> > 115/115463.pdf
> > 122/122393.pdf
> > 129/129416.pdf
> > 133/133423.pdf
> > 148/148020.pdf
> > 152/152012.pdf
> > 161/161466.pdf
> >
> > to be found here:
> > http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >
> > Tilman
> >
> > Am 14.10.2014 um 21:06 schrieb John Hewson:
> >> Unless somebody provides us with a list of those files, then I think
> >> this is an unreasonable request. As long as we continue to leave the
> >> old parser in PDFBox, we won’t get the bug reports which we need to
> >> fix the new parser, and the situation will never resolve itself.
> >> Falling back to the old parser is just as bad - we won’t get bug reports.
> >>
> >> -- John
> >>
> >> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
> >>
> >>> I prefer that the "old" parser not be removed, because there are many
> >>> files that can only be parsed by the old parser. This came out in a
> >>> large scale test with TIKA.
> >>>
> >>> The best idea (in my current opinion) is to use the nonSeq parser
> >>> first, and the old parser if there is an exception.
> >>>
> >>> Tilman
> >>>
> >>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>> Hi,
> >>>>
> >>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>> Hi,
> >>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
> >>>>>>> geschrieben:
> >>>>>>>
> >>>>>>>
> >>>>>>>         - Parsing (Andreas?)
> >>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>> improve the XRef
> >>>>>> and the COSStream stuff
> >>>>> It would be great if we could get rid of the old parser and switch
> >>>>> to the non-sequential
> >>>>> parser, WDYT?
> >>>> I would also propose to completely remove the old parser. That way
> >>>> we are more flexible in parsing streams etc. since parts of the
> >>>> non-sequential parser are a compromise to work side-by-side with the
> >>>> old parser.
> >>>> Possibly there are a small number of functions for which the old
> >>>> parser is still needed - e.g. signing?
> >>>>
> >>>>
> >>>> Best,
> >>>> Timo
> >>>>
> >>>>
> >>
> >
>
>
> --
>
>   Timo Boehme
>   OntoChem GmbH
>   H.-Damerow-Str. 4
>   06120 Halle/Saale
>   T: +49 345 4780474
>   F: +49 345 4780471
>   timo.boehme@ontochem.com
>
> _____________________________________________________________________
>
>   OntoChem GmbH
>   Geschäftsführer: Dr. Lutz Weber
>   Sitz: Halle / Saale
>   Registergericht: Stendal
>   Registernummer: HRB 215461
> _____________________________________________________________________
>

Re: 2.0

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

the difference between the parsers stems from the fact that the old 
parser can cope with a completely broken xref table because it uses the 
objects as it finds them on its sequential way. What we need (as I 
proposed before) is a repair mechanism scanning the file for object 
start/end to be used for re-creating the xref table.
I will see if I can find some time to do this.

The only other stopper is as Andreas has pointed out the signing. I'm 
not familiar with this and don't known what needs to be done here.


Best,
Timo


Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> Here are some:
>
> 055/055794.pdf
> 082/082463.pdf
> 108/108362.pdf
> 113/113223.pdf
> 115/115458.pdf
> 115/115463.pdf
> 122/122393.pdf
> 129/129416.pdf
> 133/133423.pdf
> 148/148020.pdf
> 152/152012.pdf
> 161/161466.pdf
>
> to be found here:
> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>
> Tilman
>
> Am 14.10.2014 um 21:06 schrieb John Hewson:
>> Unless somebody provides us with a list of those files, then I think
>> this is an unreasonable request. As long as we continue to leave the
>> old parser in PDFBox, we won’t get the bug reports which we need to
>> fix the new parser, and the situation will never resolve itself.
>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>
>> -- John
>>
>> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>>
>>> I prefer that the "old" parser not be removed, because there are many
>>> files that can only be parsed by the old parser. This came out in a
>>> large scale test with TIKA.
>>>
>>> The best idea (in my current opinion) is to use the nonSeq parser
>>> first, and the old parser if there is an exception.
>>>
>>> Tilman
>>>
>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>> Hi,
>>>>
>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>> Hi,
>>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>> geschrieben:
>>>>>>>
>>>>>>>
>>>>>>>         - Parsing (Andreas?)
>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>> improve the XRef
>>>>>> and the COSStream stuff
>>>>> It would be great if we could get rid of the old parser and switch
>>>>> to the non-sequential
>>>>> parser, WDYT?
>>>> I would also propose to completely remove the old parser. That way
>>>> we are more flexible in parsing streams etc. since parts of the
>>>> non-sequential parser are a compromise to work side-by-side with the
>>>> old parser.
>>>> Possibly there are a small number of functions for which the old
>>>> parser is still needed - e.g. signing?
>>>>
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>
>


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

That’s very good news!

-- John

> On 23 Oct 2014, at 11:40, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> This is now obsolete, thanks to Andreas having resolved PDFBOX-2250.
> 
> Tilman
> 
> Am 23.10.2014 um 09:33 schrieb John Hewson:
>> Do we have a JIRA issue for these, or shall I create one?
>> 
>> -- John
>> 
>> On 14 Oct 2014, at 09:18, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
>> 
>>> Here are some:
>>> 
>>> 055/055794.pdf
>>> 082/082463.pdf
>>> 108/108362.pdf
>>> 113/113223.pdf
>>> 115/115458.pdf
>>> 115/115463.pdf
>>> 122/122393.pdf
>>> 129/129416.pdf
>>> 133/133423.pdf
>>> 148/148020.pdf
>>> 152/152012.pdf
>>> 161/161466.pdf
>>> 
>>> to be found here:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ <http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/>
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>> Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports.
>>>> 
>>>> -- John
>>>> 
>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>> 
>>>>> I prefer that the "old" parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a  large scale test with TIKA.
>>>>> 
>>>>> The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception.
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>> Hi,
>>>>>>>>> John Hewson <john@jahewson.com <ma...@jahewson.com>> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>        - Parsing (Andreas?)
>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>>>>>>>> and the COSStream stuff
>>>>>>> It would be great if we could get rid of the old parser and switch to the non-sequential
>>>>>>> parser, WDYT?
>>>>>> I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser.
>>>>>> Possibly there are a small number of functions for which the old parser is still needed - e.g. signing?
>>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> Timo
>>>>>> 
>>>>>> 
>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

This is now obsolete, thanks to Andreas having resolved PDFBOX-2250.

Tilman

Am 23.10.2014 um 09:33 schrieb John Hewson:
> Do we have a JIRA issue for these, or shall I create one?
>
> -- John
>
> On 14 Oct 2014, at 09:18, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
>
>> Here are some:
>>
>> 055/055794.pdf
>> 082/082463.pdf
>> 108/108362.pdf
>> 113/113223.pdf
>> 115/115458.pdf
>> 115/115463.pdf
>> 122/122393.pdf
>> 129/129416.pdf
>> 133/133423.pdf
>> 148/148020.pdf
>> 152/152012.pdf
>> 161/161466.pdf
>>
>> to be found here:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ <http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/>
>>
>> Tilman
>>
>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>> Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports.
>>>
>>> -- John
>>>
>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>
>>>> I prefer that the "old" parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a  large scale test with TIKA.
>>>>
>>>> The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception.
>>>>
>>>> Tilman
>>>>
>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>> Hi,
>>>>>
>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>> Hi,
>>>>>>>> John Hewson <john@jahewson.com <ma...@jahewson.com>> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>>>>>
>>>>>>>>
>>>>>>>>         - Parsing (Andreas?)
>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>>>>>>> and the COSStream stuff
>>>>>> It would be great if we could get rid of the old parser and switch to the non-sequential
>>>>>> parser, WDYT?
>>>>> I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser.
>>>>> Possibly there are a small number of functions for which the old parser is still needed - e.g. signing?
>>>>>
>>>>>
>>>>> Best,
>>>>> Timo
>>>>>
>>>>>
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Do we have a JIRA issue for these, or shall I create one?

-- John

On 14 Oct 2014, at 09:18, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:

> Here are some:
> 
> 055/055794.pdf
> 082/082463.pdf
> 108/108362.pdf
> 113/113223.pdf
> 115/115458.pdf
> 115/115463.pdf
> 122/122393.pdf
> 129/129416.pdf
> 133/133423.pdf
> 148/148020.pdf
> 152/152012.pdf
> 161/161466.pdf
> 
> to be found here:
> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ <http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/>
> 
> Tilman
> 
> Am 14.10.2014 um 21:06 schrieb John Hewson:
>> Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports.
>> 
>> -- John
>> 
>> On 14 Oct 2014, at 07:39, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
>> 
>>> I prefer that the "old" parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a  large scale test with TIKA.
>>> 
>>> The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception.
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>> Hi,
>>>> 
>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>> Hi,
>>>>>>> John Hewson <john@jahewson.com <ma...@jahewson.com>> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>>>> 
>>>>>>> 
>>>>>>>        - Parsing (Andreas?)
>>>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>>>>>> and the COSStream stuff
>>>>> It would be great if we could get rid of the old parser and switch to the non-sequential
>>>>> parser, WDYT?
>>>> I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser.
>>>> Possibly there are a small number of functions for which the old parser is still needed - e.g. signing?
>>>> 
>>>> 
>>>> Best,
>>>> Timo
>>>> 
>>>> 
>> 
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

Here are some:

055/055794.pdf
082/082463.pdf
108/108362.pdf
113/113223.pdf
115/115458.pdf
115/115463.pdf
122/122393.pdf
129/129416.pdf
133/133423.pdf
148/148020.pdf
152/152012.pdf
161/161466.pdf

to be found here:
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/

Tilman

Am 14.10.2014 um 21:06 schrieb John Hewson:
> Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports.
>
> -- John
>
> On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:
>
>> I prefer that the "old" parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a  large scale test with TIKA.
>>
>> The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception.
>>
>> Tilman
>>
>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>> Hi,
>>>
>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>> Hi,
>>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>>>
>>>>>>
>>>>>>         - Parsing (Andreas?)
>>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>>>>> and the COSStream stuff
>>>> It would be great if we could get rid of the old parser and switch to the non-sequential
>>>> parser, WDYT?
>>> I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser.
>>> Possibly there are a small number of functions for which the old parser is still needed - e.g. signing?
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports.

-- John

On 14 Oct 2014, at 07:39, Tilman Hausherr <TH...@t-online.de> wrote:

> I prefer that the "old" parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a  large scale test with TIKA.
> 
> The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception.
> 
> Tilman
> 
> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>> Hi,
>> 
>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>> Hi,
>>>> 
>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>> 
>>>>> 
>>>>>        - Parsing (Andreas?)
>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>>>> and the COSStream stuff
>>> 
>>> It would be great if we could get rid of the old parser and switch to the non-sequential
>>> parser, WDYT?
>> 
>> I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser.
>> Possibly there are a small number of functions for which the old parser is still needed - e.g. signing?
>> 
>> 
>> Best,
>> Timo
>> 
>> 
>

Re: 2.0

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 14.10.2014 um 19:39 schrieb Tilman Hausherr:
> I prefer that the "old" parser not be removed, because there are many files that
> can only be parsed by the old parser. This came out in a large scale test with
> TIKA.
There is one additional reason to keep the old one, the signing stuff doesn't 
work with the non-sequential parser.

BR
Andreas
> The best idea (in my current opinion) is to use the nonSeq parser first, and the
> old parser if there is an exception.
>
> Tilman
>
> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>> Hi,
>>
>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>> Hi,
>>>>
>>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>>>
>>>>>
>>>>>         - Parsing (Andreas?)
>>>> I guess we won't get a complete new parser in 2.0, but I try to improve the
>>>> XRef
>>>> and the COSStream stuff
>>>
>>> It would be great if we could get rid of the old parser and switch to the
>>> non-sequential
>>> parser, WDYT?
>>
>> I would also propose to completely remove the old parser. That way we are more
>> flexible in parsing streams etc. since parts of the non-sequential parser are
>> a compromise to work side-by-side with the old parser.
>> Possibly there are a small number of functions for which the old parser is
>> still needed - e.g. signing?
>>
>>
>> Best,
>> Timo
>>
>>
>

Re: 2.0

Posted by Tilman Hausherr <TH...@t-online.de>.

I prefer that the "old" parser not be removed, because there are many 
files that can only be parsed by the old parser. This came out in a  
large scale test with TIKA.

The best idea (in my current opinion) is to use the nonSeq parser first, 
and the old parser if there is an exception.

Tilman

Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> Hi,
>
> Am 14.10.2014 um 07:22 schrieb John Hewson:
>> Hi,
>>>
>>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 
>>>> geschrieben:
>>>>
>>>>
>>>>         - Parsing (Andreas?)
>>> I guess we won't get a complete new parser in 2.0, but I try to 
>>> improve the XRef
>>> and the COSStream stuff
>>
>> It would be great if we could get rid of the old parser and switch to 
>> the non-sequential
>> parser, WDYT?
>
> I would also propose to completely remove the old parser. That way we 
> are more flexible in parsing streams etc. since parts of the 
> non-sequential parser are a compromise to work side-by-side with the 
> old parser.
> Possibly there are a small number of functions for which the old 
> parser is still needed - e.g. signing?
>
>
> Best,
> Timo
>
>

Re: 2.0

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

Am 14.10.2014 um 07:22 schrieb John Hewson:
> Hi,
>>
>>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>>>
>>>
>>>         - Parsing (Andreas?)
>> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
>> and the COSStream stuff
>
> It would be great if we could get rid of the old parser and switch to the non-sequential
> parser, WDYT?

I would also propose to completely remove the old parser. That way we 
are more flexible in parsing streams etc. since parts of the 
non-sequential parser are a compromise to work side-by-side with the old 
parser.
Possibly there are a small number of functions for which the old parser 
is still needed - e.g. signing?


Best,
Timo


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Hi,
> 
>> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>> 
>> 
>> Simon,
>> 
>> Andreas has the best handle on this, but off the top of my head what we need
>> is to finish
>> making breaking API changes and for the code to have been stable for a while
>> before
>> making a 2.0 release.
>> 
>> Improvements and fixes which still need breaking API changes include:
>>        - Pattern rendering
> That's almost done, isn't it? Should be part of 2.0

Oh yes, it’s nearly there, when I get a decent chunk of spare time it will get done.

> 
>>        - Pages resource caching (significant memory usage issues)
> IMHO could be postponed

The catch is that it will be a breaking API change, so it’d be better to get it into
2.0 while we can. Otherwise I’d agree.

> 
>>        - Font embedding (particularly TTF)
> Is a real show stopper, should be part of 2.0

Yep. We’re getting there slowly, there’s lots of pieces which need to work together.

>>        - Parsing (Andreas?)
> I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
> and the COSStream stuff

It would be great if we could get rid of the old parser and switch to the non-sequential
parser, WDYT?

>>        - Page Tree (needs completely re-writing)
> IMHO could be postponed

Once again the catch is that this will be a breaking API change in an important
place, so it’d be smart to do it in 2.0.

>>        - Text extraction on Java 8 (this might end up being a breaking change
>> to the sort)
> I've already commited PDFBOX-1512, so that it works on Java7. What is the issue
> with java8?

Same thing, so the Java 7 fix will fix this too.

>> There’s probably more, such as work on Acroforms, and we need to have much
>> better
>> example code for 2.0 due to all the changes.
> Yes, once PDFBOX-922 is implemented (embedding TTFs) we should be able to
> improve the creation of Appearance streams.

I feel like we could definitely postpone this one, it seems like we’d just be adding
APIs and not breaking anything, so it can wait?

>> This seems like a good time to explicitly try to make sure that we have JIRA
>> issues open
>> for all outstanding tasks, so that we can track how close 2.0 is to being
>> ready. The stability
>> of the code is a pretty good indicator - we’re not there yet.
>> 
>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for
>> any
>> 2.0 features which you think we need.
>> 
>> Thanks
>> 
>> -- John
> 
> BR
> Andreas Lehmkühler
> 
>> 
>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Could you set a target date for 2.0 release. What's missing to make a
>>> release?
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>

Re: 2.0

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

> John Hewson <jo...@jahewson.com> hat am 10. Oktober 2014 um 20:05 geschrieben:
>
>
> Simon,
>
> Andreas has the best handle on this, but off the top of my head what we need
> is to finish
> making breaking API changes and for the code to have been stable for a while
> before
> making a 2.0 release.
>
> Improvements and fixes which still need breaking API changes include:
>       - Pattern rendering
That's almost done, isn't it? Should be part of 2.0

>       - Pages resource caching (significant memory usage issues)
IMHO could be postponed

>       - Font embedding (particularly TTF)
Is a real show stopper, should be part of 2.0

>       - Parsing (Andreas?)
I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
and the COSStream stuff

>       - Page Tree (needs completely re-writing)
IMHO could be postponed

>       - Text extraction on Java 8 (this might end up being a breaking change
>to the sort)
I've already commited PDFBOX-1512, so that it works on Java7. What is the issue
with java8?

> There’s probably more, such as work on Acroforms, and we need to have much
> better
> example code for 2.0 due to all the changes.
Yes, once PDFBOX-922 is implemented (embedding TTFs) we should be able to
improve the creation of Appearance streams.

> This seems like a good time to explicitly try to make sure that we have JIRA
> issues open
> for all outstanding tasks, so that we can track how close 2.0 is to being
> ready. The stability
> of the code is a pretty good indicator - we’re not there yet.
>
> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for
> any
> 2.0 features which you think we need.
>
> Thanks
>
> -- John

BR
Andreas Lehmkühler

>
> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>
> > Hi,
> >
> >
> >
> > Could you set a target date for 2.0 release. What's missing to make a
> > release?
> >
> >
> >
> > Thanks
> >
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

On 14 Oct 2014, at 06:41, Andreas Lehmkuehler <an...@lehmi.de> wrote:

> Hi,
> 
> Am 14.10.2014 um 08:16 schrieb John Hewson:
>> Andreas,
>> 
>>> Hi,
>>> 
>>> Am 10.10.2014 um 20:10 schrieb John Hewson:
>>>> Andreas - can we create a new “Later” version in JIRA so that we can assign
>>>> issues that we’ve decided to defer until after 2.0? That way we can have a
>>>> definitive list of what does and doesn’t need attention.
>>> What exaclty would be the difference between "Later" and "Unscheduled”?
>> 
>> The use of a blank Fix Version to mean “unscheduled” is problematic as there’s
>> no way to tell a deliberately deferred issue from once which we forgot to schedule,
>> or from the hundreds of historic 1.8 and older issues we have, or from version-less
>> issues such as those related to the web site, so I was thinking that an explicit “Later
>> version would let us proactively and unambiguously defer issues.
>> 
>> Thinking about it, “2.1” and “3.0” labels might be better, for breaking and non-breaking
>> changes, respectively. This way we’re using JIRA for release management as well as
>> bug tracking.
> Sounde more reasonable than the "Later" idea. I've added both versions

Thanks!

> BR
> Andreas Lehmkühler
> 
>>>> -- John
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>> 
>> -- John
>> 
>>>> 
>>>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> Simon,
>>>>> 
>>>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>>>> making breaking API changes and for the code to have been stable for a while before
>>>>> making a 2.0 release.
>>>>> 
>>>>> Improvements and fixes which still need breaking API changes include:
>>>>> 	- Pattern rendering
>>>>> 	- Pages resource caching (significant memory usage issues)
>>>>> 	- Font embedding (particularly TTF)
>>>>> 	- Parsing (Andreas?)
>>>>> 	- Page Tree (needs completely re-writing)
>>>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>>>> 
>>>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>>>> example code for 2.0 due to all the changes.
>>>>> 
>>>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>>>> of the code is a pretty good indicator - we’re not there yet.
>>>>> 
>>>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>>>> 2.0 features which you think we need.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>>>> release?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks

Re: 2.0

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 14.10.2014 um 08:16 schrieb John Hewson:
> Andreas,
>
>> Hi,
>>
>> Am 10.10.2014 um 20:10 schrieb John Hewson:
>>> Andreas - can we create a new “Later” version in JIRA so that we can assign
>>> issues that we’ve decided to defer until after 2.0? That way we can have a
>>> definitive list of what does and doesn’t need attention.
>> What exaclty would be the difference between "Later" and "Unscheduled”?
>
> The use of a blank Fix Version to mean “unscheduled” is problematic as there’s
> no way to tell a deliberately deferred issue from once which we forgot to schedule,
> or from the hundreds of historic 1.8 and older issues we have, or from version-less
> issues such as those related to the web site, so I was thinking that an explicit “Later
> version would let us proactively and unambiguously defer issues.
>
> Thinking about it, “2.1” and “3.0” labels might be better, for breaking and non-breaking
> changes, respectively. This way we’re using JIRA for release management as well as
> bug tracking.
Sounde more reasonable than the "Later" idea. I've added both versions

BR
Andreas Lehmkühler

>>> -- John
>>
>> BR
>> Andreas Lehmkühler
>>
>
> -- John
>
>>>
>>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>>>
>>>> Simon,
>>>>
>>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>>> making breaking API changes and for the code to have been stable for a while before
>>>> making a 2.0 release.
>>>>
>>>> Improvements and fixes which still need breaking API changes include:
>>>> 	- Pattern rendering
>>>> 	- Pages resource caching (significant memory usage issues)
>>>> 	- Font embedding (particularly TTF)
>>>> 	- Parsing (Andreas?)
>>>> 	- Page Tree (needs completely re-writing)
>>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>>>
>>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>>> example code for 2.0 due to all the changes.
>>>>
>>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>>> of the code is a pretty good indicator - we’re not there yet.
>>>>
>>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>>> 2.0 features which you think we need.
>>>>
>>>> Thanks
>>>>
>>>> -- John
>>>>
>>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>>> release?
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>>
>>
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Andreas,

> Hi,
> 
> Am 10.10.2014 um 20:10 schrieb John Hewson:
>> Andreas - can we create a new “Later” version in JIRA so that we can assign
>> issues that we’ve decided to defer until after 2.0? That way we can have a
>> definitive list of what does and doesn’t need attention.
> What exaclty would be the difference between "Later" and "Unscheduled”?

The use of a blank Fix Version to mean “unscheduled” is problematic as there’s
no way to tell a deliberately deferred issue from once which we forgot to schedule,
or from the hundreds of historic 1.8 and older issues we have, or from version-less
issues such as those related to the web site, so I was thinking that an explicit “Later
version would let us proactively and unambiguously defer issues.

Thinking about it, “2.1” and “3.0” labels might be better, for breaking and non-breaking
changes, respectively. This way we’re using JIRA for release management as well as
bug tracking.

>> -- John
> 
> BR
> Andreas Lehmkühler
> 

-- John

>> 
>> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> Simon,
>>> 
>>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>>> making breaking API changes and for the code to have been stable for a while before
>>> making a 2.0 release.
>>> 
>>> Improvements and fixes which still need breaking API changes include:
>>> 	- Pattern rendering
>>> 	- Pages resource caching (significant memory usage issues)
>>> 	- Font embedding (particularly TTF)
>>> 	- Parsing (Andreas?)
>>> 	- Page Tree (needs completely re-writing)
>>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>> 
>>> There’s probably more, such as work on Acroforms, and we need to have much better
>>> example code for 2.0 due to all the changes.
>>> 
>>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>>> of the code is a pretty good indicator - we’re not there yet.
>>> 
>>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>>> 2.0 features which you think we need.
>>> 
>>> Thanks
>>> 
>>> -- John
>>> 
>>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> Could you set a target date for 2.0 release. What's missing to make a
>>>> release?
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> 
>>> 
>> 
>> 
>

Re: 2.0

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 10.10.2014 um 20:10 schrieb John Hewson:
> Andreas - can we create a new “Later” version in JIRA so that we can assign
> issues that we’ve decided to defer until after 2.0? That way we can have a
> definitive list of what does and doesn’t need attention.
What exaclty would be the difference between "Later" and "Unscheduled"?

> -- John

BR
Andreas Lehmkühler

>
> On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:
>
>> Simon,
>>
>> Andreas has the best handle on this, but off the top of my head what we need is to finish
>> making breaking API changes and for the code to have been stable for a while before
>> making a 2.0 release.
>>
>> Improvements and fixes which still need breaking API changes include:
>> 	- Pattern rendering
>> 	- Pages resource caching (significant memory usage issues)
>> 	- Font embedding (particularly TTF)
>> 	- Parsing (Andreas?)
>> 	- Page Tree (needs completely re-writing)
>> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
>>
>> There’s probably more, such as work on Acroforms, and we need to have much better
>> example code for 2.0 due to all the changes.
>>
>> This seems like a good time to explicitly try to make sure that we have JIRA issues open
>> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
>> of the code is a pretty good indicator - we’re not there yet.
>>
>> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
>> 2.0 features which you think we need.
>>
>> Thanks
>>
>> -- John
>>
>> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Could you set a target date for 2.0 release. What's missing to make a
>>> release?
>>>
>>>
>>>
>>> Thanks
>>>
>>
>
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Andreas - can we create a new “Later” version in JIRA so that we can assign
issues that we’ve decided to defer until after 2.0? That way we can have a
definitive list of what does and doesn’t need attention.

-- John

On 10 Oct 2014, at 11:05, John Hewson <jo...@jahewson.com> wrote:

> Simon,
> 
> Andreas has the best handle on this, but off the top of my head what we need is to finish
> making breaking API changes and for the code to have been stable for a while before
> making a 2.0 release.
> 
> Improvements and fixes which still need breaking API changes include:
> 	- Pattern rendering
> 	- Pages resource caching (significant memory usage issues)
> 	- Font embedding (particularly TTF)
> 	- Parsing (Andreas?)
> 	- Page Tree (needs completely re-writing)
> 	- Text extraction on Java 8 (this might end up being a breaking change to the sort)
> 
> There’s probably more, such as work on Acroforms, and we need to have much better
> example code for 2.0 due to all the changes.
> 
> This seems like a good time to explicitly try to make sure that we have JIRA issues open
> for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
> of the code is a pretty good indicator - we’re not there yet.
> 
> I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
> 2.0 features which you think we need.
> 
> Thanks
> 
> -- John
> 
> On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:
> 
>> Hi,
>> 
>> 
>> 
>> Could you set a target date for 2.0 release. What's missing to make a
>> release?
>> 
>> 
>> 
>> Thanks
>> 
>

Re: 2.0

Posted by John Hewson <jo...@jahewson.com>.

Simon,

Andreas has the best handle on this, but off the top of my head what we need is to finish
making breaking API changes and for the code to have been stable for a while before
making a 2.0 release.

Improvements and fixes which still need breaking API changes include:
	- Pattern rendering
	- Pages resource caching (significant memory usage issues)
	- Font embedding (particularly TTF)
	- Parsing (Andreas?)
	- Page Tree (needs completely re-writing)
	- Text extraction on Java 8 (this might end up being a breaking change to the sort)

There’s probably more, such as work on Acroforms, and we need to have much better
example code for 2.0 due to all the changes.

This seems like a good time to explicitly try to make sure that we have JIRA issues open
for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability
of the code is a pretty good indicator - we’re not there yet.

I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any
2.0 features which you think we need.

Thanks

-- John

On 10 Oct 2014, at 08:08, Simon Steiner <si...@gmail.com> wrote:

> Hi,
> 
> 
> 
> Could you set a target date for 2.0 release. What's missing to make a
> release?
> 
> 
> 
> Thanks
>