You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "P. Hill" <pa...@gmail.com> on 2012/01/13 19:32:18 UTC

Parsing PDF Portfolio Files

Anyone know about the (future?) ability of Tika to parse PDF Portfolio 
Files?
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html

I'm using Tika 1.0 and it doesn't understand them except in a most 
rudimentary sense.


-Paul


Re: Parsing PDF Portfolio Files

Posted by "P. Hill" <pa...@gmail.com>.
On 1/24/2012 7:19 AM, Nick Burch wrote:
> On Mon, 23 Jan 2012, P. Hill wrote:
>> I can't show you the client's document, but here are some interesting 
>> "portfolio documents". http://acrobatusers.com/gallery/pdf_portfolio. 
>> I'm not sure if all are in Portfolio PDF, but the are all complex 
>> documents).
>
> We're ideally after a small document that's a Portfolio, but we 
> require one under a suitable license, so I don't think we can use any 
> of those. We really need someone with a suitable tool to generate some 
> sample files for us...
>
> Nick

You could build one using PDFBox (recall the example code from the 
PDFBox website) out of other examples (with appropriate licenses) that 
you already have around in your test system.  This runs the risk of not 
being _exactly_ what Acrobat produces.

Another solution is for someone to go on AcrobatUsers.com and ask some 
users to produce a few examples that they can give away.
Toward that end, if some folks did that, what would be the minimal 
content of each included files. Adobe allows all kinds of files to be 
embedded in the portfolio, so we have to ask what types we want to test 
including within the portfolio.

-Paul


Re: Parsing PDF Portfolio Files

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 23 Jan 2012, P. Hill wrote:
> I can't show you the client's document, but here are some interesting 
> "portfolio documents". http://acrobatusers.com/gallery/pdf_portfolio. 
> I'm not sure if all are in Portfolio PDF, but the are all complex 
> documents).

We're ideally after a small document that's a Portfolio, but we require 
one under a suitable license, so I don't think we can use any of those. We 
really need someone with a suitable tool to generate some sample files for 
us...

Nick

Re: Parsing PDF Portfolio Files

Posted by "P. Hill" <pa...@gmail.com>.
On 1/23/2012 1:19 PM, Nick Burch wrote:

> >I can report my company has seen a least one end user using Portfolio 
> files, but they don't seem very common.
>
> We would ideally want a test document, for both sanity checking and 
> unit testing. Don't suppose you can ask your end user to do us a 
> sample one?
>
> Nick

I can't show you the client's document, but here are some interesting 
"portfolio documents".
http://acrobatusers.com/gallery/pdf_portfolio.  I'm not sure if all are 
in Portfolio PDF, but the are all complex documents).

For example, one of the documents listed on that page "Training Courses 
Portfolio" is a PDF of PDFs (which curiously is set of courses about 
using Acrobat to create PDFs including how to create PDF Portfolio 
Documents; yikes the self-reference could give one a headache)
This one is:
http://acrobatusers.com/assets/uploads/gallery/Ted-Osuch_tf_acro_train_1.pdf

If you download this and view it in Acrobat reader, you can not only 
click from the primary page which brings up each document, but you can 
use the tool bar
"Files" button to view that the document is made from 6 different 
documents.  Clicking on "Adobe Acrobat advanced.pdf" shows a nice sample 
page
(which includes the entry "Creating a PDF Portfolio" hahaha).

Another one I can understand is:
http://acrobatusers.com/assets/uploads/gallery/aron%20katz_30-Jun-2009_122215_Portfolio1.pdf
Which when you click on Files you see several directories each 
containing various files.
I'm thinking these both (and others) are examples of Portfolio 
documents, but there does seem to be a difference between PDF documents 
with PDF attachments and PDF Portfolio files as it says at the bottom of
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
"PDFs attached to other PDFs do not offer the same benefits as PDF 
Portfolios", so somehow there is a difference.

I hope this helps,

-Paul




Re: Parsing PDF Portfolio Files

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 23 Jan 2012, P. Hill wrote:
> I finally got a moment to ask about PDF Portfolio files and the folks 
> over at PDFBox directed me to: 
> http://pdfbox.apache.org/userguide/file_references.html

Thats interesting, just a shame the examples only cover writing! If you're 
able to get some information on how to read them too, we can certainly 
have a look.

> I pass that along for Tika developers, but it seems there might be some 
> issues about combining all the content in a portfolio not unlike e-mails 
> with attachments or other compound documents

I think we've now largely got that model sorted, so we'd support them in 
the same way that we currently support emails with attachments, word 
documents with embedded images etc

> I can report my company has seen a least one end user using Portfolio 
> files, but they don't seem very common.

We would ideally want a test document, for both sanity checking and unit 
testing. Don't suppose you can ask your end user to do us a sample one?

Nick

Re: Parsing PDF Portfolio Files

Posted by "P. Hill" <pa...@gmail.com>.
On 1/16/2012 4:24 AM, Nick Burch wrote:
> On Fri, 13 Jan 2012, P. Hill wrote:
>> Anyone know about the (future?) ability of Tika to parse PDF 
>> Portfolio Files?
>> http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html 
>>
>
> My hunch is that this'll need some PDFBox support too, to let us at 
> the original files, and to let us know what parts are a portfolio.
>
> As a first step, I'd suggest you ask on the PDFBox list about their 
> support for Portfolio files
>
> Nick

Nick,

I finally got a moment to ask about PDF Portfolio files and the folks 
over at PDFBox directed me to:
http://pdfbox.apache.org/userguide/file_references.html

I pass that along for Tika developers, but it seems there might be some 
issues about combining all the content in a portfolio not unlike e-mails 
with attachments or other compound documents 
(http://wiki.apache.org/tika/MetadataDiscussion).

I can report my company has seen a least one end user using Portfolio 
files, but they don't seem very common.

-Paul

Re: Parsing PDF Portfolio Files

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 13 Jan 2012, P. Hill wrote:
> Anyone know about the (future?) ability of Tika to parse PDF Portfolio Files?
> http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html

My hunch is that this'll need some PDFBox support too, to let us at the 
original files, and to let us know what parts are a portfolio.

As a first step, I'd suggest you ask on the PDFBox list about their 
support for Portfolio files

Nick