You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Jukka Zitting <ju...@gmail.com> on 2011/12/19 16:11:57 UTC

Tika parser in PDFBox

Hi,

As you may have noticed in PDFBOX-1132 [1], I wanted to try pushing
the PDF parser in Tika to PDFBox for easier and faster deployment of
latest fixes and improvements. It seems to work pretty well, so I'm
thinking of making this move permanent. See below for the message I
sent to dev@tika about this approach.

The discussion on dev@tika brought up some concerns about how to best
maintain consistency across Tika parsers if they're located in
upstream parser libraries. The solution I had in mind was granting
Tika committers write access to the relevant parts of PDFBox.

My idea is that Tika committers working on the PDFBox-based Tika
parser for PDF could commit those changes directly into PDFBox, from
where they'd be released as a part of the normal PDFBox releases under
the oversight of the PDFBox PMC. Active committers like Michael
McCandless who focus more on PDF parsing could even be invited as
normal PDFBox committers.

I think such a solution would make it easier to improve the PDF
parsing code directly in PDFBox instead of introducing workarounds and
other extra code like what's currently been happening in Tika. For
example in TIKA-738 Michael extended the PDFTextStripper class with
support for annotation handling. Such improvements should ideally have
gone into the PDFTextStripper class itself instead of just to the
downstream code in Tika.

WDYT? I'm planning to call a vote on extending PDFBox commit access
also to Tika committers for this. Please share any concerns or
questions so we can discuss and hopefully address them before the
vote.


[1] https://issues.apache.org/jira/browse/PDFBOX-1132

BR,

Jukka Zitting


---------- Forwarded message ----------
From: Jukka Zitting <ju...@gmail.com>
Date: Tue, Dec 13, 2011 at 10:42 AM
Subject: Pushing parsers upstream
To: Tika Development <de...@tika.apache.org>


Hi,

As you know, we see a lot of questions about version mismatches (which
POI or PDFBox version should go with this Tika version) and there's a
long queue of patches that are waiting for new official releases of
our upstream dependencies to become available.

To avoid this issue I propose that we start moving some of our parser
implementations to upstream projects. Now with Tika 1.0 out we have a
stable Parser and Detector interfaces and related APIs that upstream
libraries could implement directly without us having to worry about
changing Tika code whenever a new version of a parser library becomes
available.

This would allow our users to for example directly upgrade to a new
POI version without waiting for a releated Tika release first.
Similarly, a new PDF parsing option or improvement could be
implemented directly in PDFBox and be usable without any code changes
in Tika.

The classloading and OSGi service mechanisms we've added should make
such upstream Parser implementations trivially easy to use, and we
could still keep the dependencies in tika-parsers as a way to pull in
the libraries even if the relevant implementation classes would no
longer reside in org.apache.tika.parsers.*.

In addition to some of the GPL libraries for which we've already done
this, I recently took the liberty of trying this out also with PDFBox.
See PDFBOX-1132 [1] for the issue where I copied the
org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
without problems, so now I'd like to propose that we copy any more
recent PDF parser changes to PDFBox and prepare to drop the parser
implementation in tika-parsers. Any further PDF parser work should
then be done directly in PDFBox. I haven't yet talked about this with
the PDFBox PMC (of which I'm a member), but I suppose we should be
able to come up with an arrangement where Tika committers can commit
directly to the Tika parser implementation in PDFBox.

It would be cool if we could do the same thing also with POI.

WDYT?

[1] https://issues.apache.org/jira/browse/PDFBOX-1132

BR,

Jukka Zitting

RE: Tika parser in PDFBox

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.

+1

I think this sounds like a great idea.   It would speed up getting those
sorts of changes into PDFBox proper.

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Monday, December 19, 2011 10:12 AM
To: PDFBox Development
Subject: Tika parser in PDFBox

Hi,

As you may have noticed in PDFBOX-1132 [1], I wanted to try pushing
the PDF parser in Tika to PDFBox for easier and faster deployment of
latest fixes and improvements. It seems to work pretty well, so I'm
thinking of making this move permanent. See below for the message I
sent to dev@tika about this approach.

The discussion on dev@tika brought up some concerns about how to best
maintain consistency across Tika parsers if they're located in
upstream parser libraries. The solution I had in mind was granting
Tika committers write access to the relevant parts of PDFBox.

My idea is that Tika committers working on the PDFBox-based Tika
parser for PDF could commit those changes directly into PDFBox, from
where they'd be released as a part of the normal PDFBox releases under
the oversight of the PDFBox PMC. Active committers like Michael
McCandless who focus more on PDF parsing could even be invited as
normal PDFBox committers.

I think such a solution would make it easier to improve the PDF
parsing code directly in PDFBox instead of introducing workarounds and
other extra code like what's currently been happening in Tika. For
example in TIKA-738 Michael extended the PDFTextStripper class with
support for annotation handling. Such improvements should ideally have
gone into the PDFTextStripper class itself instead of just to the
downstream code in Tika.

WDYT? I'm planning to call a vote on extending PDFBox commit access
also to Tika committers for this. Please share any concerns or
questions so we can discuss and hopefully address them before the
vote.

[1] https://issues.apache.org/jira/browse/PDFBOX-1132

BR,

Jukka Zitting

---------- Forwarded message ----------
From: Jukka Zitting <ju...@gmail.com>
Date: Tue, Dec 13, 2011 at 10:42 AM
Subject: Pushing parsers upstream
To: Tika Development <de...@tika.apache.org>

Hi,

As you know, we see a lot of questions about version mismatches (which
POI or PDFBox version should go with this Tika version) and there's a
long queue of patches that are waiting for new official releases of
our upstream dependencies to become available.

To avoid this issue I propose that we start moving some of our parser
implementations to upstream projects. Now with Tika 1.0 out we have a
stable Parser and Detector interfaces and related APIs that upstream
libraries could implement directly without us having to worry about
changing Tika code whenever a new version of a parser library becomes
available.

This would allow our users to for example directly upgrade to a new
POI version without waiting for a releated Tika release first.
Similarly, a new PDF parsing option or improvement could be
implemented directly in PDFBox and be usable without any code changes
in Tika.

The classloading and OSGi service mechanisms we've added should make
such upstream Parser implementations trivially easy to use, and we
could still keep the dependencies in tika-parsers as a way to pull in
the libraries even if the relevant implementation classes would no
longer reside in org.apache.tika.parsers.*.

In addition to some of the GPL libraries for which we've already done
this, I recently took the liberty of trying this out also with PDFBox.
See PDFBOX-1132 [1] for the issue where I copied the
org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
without problems, so now I'd like to propose that we copy any more
recent PDF parser changes to PDFBox and prepare to drop the parser
implementation in tika-parsers. Any further PDF parser work should
then be done directly in PDFBox. I haven't yet talked about this with
the PDFBox PMC (of which I'm a member), but I suppose we should be
able to come up with an arrangement where Tika committers can commit
directly to the Tika parser implementation in PDFBox.

It would be cool if we could do the same thing also with POI.

WDYT?

[1] https://issues.apache.org/jira/browse/PDFBOX-1132

BR,

Jukka Zitting

Re: Tika parser in PDFBox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

sounds really good to me, go ahead!!

BR
Andreas Lehmkühler

Am 19.12.2011 16:11, schrieb Jukka Zitting:
> Hi,
>
> As you may have noticed in PDFBOX-1132 [1], I wanted to try pushing
> the PDF parser in Tika to PDFBox for easier and faster deployment of
> latest fixes and improvements. It seems to work pretty well, so I'm
> thinking of making this move permanent. See below for the message I
> sent to dev@tika about this approach.
>
> The discussion on dev@tika brought up some concerns about how to best
> maintain consistency across Tika parsers if they're located in
> upstream parser libraries. The solution I had in mind was granting
> Tika committers write access to the relevant parts of PDFBox.
>
> My idea is that Tika committers working on the PDFBox-based Tika
> parser for PDF could commit those changes directly into PDFBox, from
> where they'd be released as a part of the normal PDFBox releases under
> the oversight of the PDFBox PMC. Active committers like Michael
> McCandless who focus more on PDF parsing could even be invited as
> normal PDFBox committers.
>
> I think such a solution would make it easier to improve the PDF
> parsing code directly in PDFBox instead of introducing workarounds and
> other extra code like what's currently been happening in Tika. For
> example in TIKA-738 Michael extended the PDFTextStripper class with
> support for annotation handling. Such improvements should ideally have
> gone into the PDFTextStripper class itself instead of just to the
> downstream code in Tika.
>
> WDYT? I'm planning to call a vote on extending PDFBox commit access
> also to Tika committers for this. Please share any concerns or
> questions so we can discuss and hopefully address them before the
> vote.
>
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-1132
>
> BR,
>
> Jukka Zitting
>
>
> ---------- Forwarded message ----------
> From: Jukka Zitting<ju...@gmail.com>
> Date: Tue, Dec 13, 2011 at 10:42 AM
> Subject: Pushing parsers upstream
> To: Tika Development<de...@tika.apache.org>
>
>
> Hi,
>
> As you know, we see a lot of questions about version mismatches (which
> POI or PDFBox version should go with this Tika version) and there's a
> long queue of patches that are waiting for new official releases of
> our upstream dependencies to become available.
>
> To avoid this issue I propose that we start moving some of our parser
> implementations to upstream projects. Now with Tika 1.0 out we have a
> stable Parser and Detector interfaces and related APIs that upstream
> libraries could implement directly without us having to worry about
> changing Tika code whenever a new version of a parser library becomes
> available.
>
> This would allow our users to for example directly upgrade to a new
> POI version without waiting for a releated Tika release first.
> Similarly, a new PDF parsing option or improvement could be
> implemented directly in PDFBox and be usable without any code changes
> in Tika.
>
> The classloading and OSGi service mechanisms we've added should make
> such upstream Parser implementations trivially easy to use, and we
> could still keep the dependencies in tika-parsers as a way to pull in
> the libraries even if the relevant implementation classes would no
> longer reside in org.apache.tika.parsers.*.
>
> In addition to some of the GPL libraries for which we've already done
> this, I recently took the liberty of trying this out also with PDFBox.
> See PDFBOX-1132 [1] for the issue where I copied the
> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
> without problems, so now I'd like to propose that we copy any more
> recent PDF parser changes to PDFBox and prepare to drop the parser
> implementation in tika-parsers. Any further PDF parser work should
> then be done directly in PDFBox. I haven't yet talked about this with
> the PDFBox PMC (of which I'm a member), but I suppose we should be
> able to come up with an arrangement where Tika committers can commit
> directly to the Tika parser implementation in PDFBox.
>
> It would be cool if we could do the same thing also with POI.
>
> WDYT?
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-1132
>
> BR,
>
> Jukka Zitting