You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "P. Hill" <pa...@gmail.com> on 2012/01/13 19:32:18 UTC
Parsing PDF Portfolio Files
Anyone know about the (future?) ability of Tika to parse PDF Portfolio
Files?
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
I'm using Tika 1.0 and it doesn't understand them except in a most
rudimentary sense.
-Paul
Re: Parsing PDF Portfolio Files
Posted by "P. Hill" <pa...@gmail.com>.
On 1/24/2012 7:19 AM, Nick Burch wrote:
> On Mon, 23 Jan 2012, P. Hill wrote:
>> I can't show you the client's document, but here are some interesting
>> "portfolio documents". http://acrobatusers.com/gallery/pdf_portfolio.
>> I'm not sure if all are in Portfolio PDF, but the are all complex
>> documents).
>
> We're ideally after a small document that's a Portfolio, but we
> require one under a suitable license, so I don't think we can use any
> of those. We really need someone with a suitable tool to generate some
> sample files for us...
>
> Nick
You could build one using PDFBox (recall the example code from the
PDFBox website) out of other examples (with appropriate licenses) that
you already have around in your test system. This runs the risk of not
being _exactly_ what Acrobat produces.
Another solution is for someone to go on AcrobatUsers.com and ask some
users to produce a few examples that they can give away.
Toward that end, if some folks did that, what would be the minimal
content of each included files. Adobe allows all kinds of files to be
embedded in the portfolio, so we have to ask what types we want to test
including within the portfolio.
-Paul
Re: Parsing PDF Portfolio Files
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 23 Jan 2012, P. Hill wrote:
> I can't show you the client's document, but here are some interesting
> "portfolio documents". http://acrobatusers.com/gallery/pdf_portfolio.
> I'm not sure if all are in Portfolio PDF, but the are all complex
> documents).
We're ideally after a small document that's a Portfolio, but we require
one under a suitable license, so I don't think we can use any of those. We
really need someone with a suitable tool to generate some sample files for
us...
Nick
Re: Parsing PDF Portfolio Files
Posted by "P. Hill" <pa...@gmail.com>.
On 1/23/2012 1:19 PM, Nick Burch wrote:
> >I can report my company has seen a least one end user using Portfolio
> files, but they don't seem very common.
>
> We would ideally want a test document, for both sanity checking and
> unit testing. Don't suppose you can ask your end user to do us a
> sample one?
>
> Nick
I can't show you the client's document, but here are some interesting
"portfolio documents".
http://acrobatusers.com/gallery/pdf_portfolio. I'm not sure if all are
in Portfolio PDF, but the are all complex documents).
For example, one of the documents listed on that page "Training Courses
Portfolio" is a PDF of PDFs (which curiously is set of courses about
using Acrobat to create PDFs including how to create PDF Portfolio
Documents; yikes the self-reference could give one a headache)
This one is:
http://acrobatusers.com/assets/uploads/gallery/Ted-Osuch_tf_acro_train_1.pdf
If you download this and view it in Acrobat reader, you can not only
click from the primary page which brings up each document, but you can
use the tool bar
"Files" button to view that the document is made from 6 different
documents. Clicking on "Adobe Acrobat advanced.pdf" shows a nice sample
page
(which includes the entry "Creating a PDF Portfolio" hahaha).
Another one I can understand is:
http://acrobatusers.com/assets/uploads/gallery/aron%20katz_30-Jun-2009_122215_Portfolio1.pdf
Which when you click on Files you see several directories each
containing various files.
I'm thinking these both (and others) are examples of Portfolio
documents, but there does seem to be a difference between PDF documents
with PDF attachments and PDF Portfolio files as it says at the bottom of
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
"PDFs attached to other PDFs do not offer the same benefits as PDF
Portfolios", so somehow there is a difference.
I hope this helps,
-Paul
Re: Parsing PDF Portfolio Files
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 23 Jan 2012, P. Hill wrote:
> I finally got a moment to ask about PDF Portfolio files and the folks
> over at PDFBox directed me to:
> http://pdfbox.apache.org/userguide/file_references.html
Thats interesting, just a shame the examples only cover writing! If you're
able to get some information on how to read them too, we can certainly
have a look.
> I pass that along for Tika developers, but it seems there might be some
> issues about combining all the content in a portfolio not unlike e-mails
> with attachments or other compound documents
I think we've now largely got that model sorted, so we'd support them in
the same way that we currently support emails with attachments, word
documents with embedded images etc
> I can report my company has seen a least one end user using Portfolio
> files, but they don't seem very common.
We would ideally want a test document, for both sanity checking and unit
testing. Don't suppose you can ask your end user to do us a sample one?
Nick
Re: Parsing PDF Portfolio Files
Posted by "P. Hill" <pa...@gmail.com>.
On 1/16/2012 4:24 AM, Nick Burch wrote:
> On Fri, 13 Jan 2012, P. Hill wrote:
>> Anyone know about the (future?) ability of Tika to parse PDF
>> Portfolio Files?
>> http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
>>
>
> My hunch is that this'll need some PDFBox support too, to let us at
> the original files, and to let us know what parts are a portfolio.
>
> As a first step, I'd suggest you ask on the PDFBox list about their
> support for Portfolio files
>
> Nick
Nick,
I finally got a moment to ask about PDF Portfolio files and the folks
over at PDFBox directed me to:
http://pdfbox.apache.org/userguide/file_references.html
I pass that along for Tika developers, but it seems there might be some
issues about combining all the content in a portfolio not unlike e-mails
with attachments or other compound documents
(http://wiki.apache.org/tika/MetadataDiscussion).
I can report my company has seen a least one end user using Portfolio
files, but they don't seem very common.
-Paul
Re: Parsing PDF Portfolio Files
Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 13 Jan 2012, P. Hill wrote:
> Anyone know about the (future?) ability of Tika to parse PDF Portfolio Files?
> http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
My hunch is that this'll need some PDFBox support too, to let us at the
original files, and to let us know what parts are a portfolio.
As a first step, I'd suggest you ask on the PDFBox list about their
support for Portfolio files
Nick