You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Yoni Amir <Yo...@niceactimize.com> on 2013/06/11 13:45:08 UTC

xfa extraction not working properly?

Hello,
I have a pdf document (link here: https://www.dropbox.com/s/vr2xi5cf0uzur69/TEST_POC_DS_01.pdf).

I think it is an XFA document, although I am not 100% sure how to verify this. So I apologize in advance if this question is misdirected.
When I run the sample ExtractText class on this file, I am not receiving the actual text in the pdf. Rather, I receive the generic text hidden in the pdf similar to this:

"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document."

The text returned also contains stuff in French (link here: https://www.dropbox.com/s/oor4gj7wbhue8yc/TEST_POC_DS_01.txt)
but I haven't figured out what it is. It is not text that is visible in the PDF file.

Thanks,
Yoni

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Re: xfa extraction not working properly?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Yoni,

this is an XFA document. The french text is part of a javascript message.

ExtractText works on PDF text objects that's why you don't get the XFA forms part.


BR
Maruan Sahyoun

Am 11.06.2013 um 13:45 schrieb Yoni Amir <Yo...@niceactimize.com>:

> Hello,
> I have a pdf document (link here: https://www.dropbox.com/s/vr2xi5cf0uzur69/TEST_POC_DS_01.pdf).
> 
> I think it is an XFA document, although I am not 100% sure how to verify this. So I apologize in advance if this question is misdirected.
> When I run the sample ExtractText class on this file, I am not receiving the actual text in the pdf. Rather, I receive the generic text hidden in the pdf similar to this:
> 
> "Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF
> viewer may not be able to display this type of document."
> 
> The text returned also contains stuff in French (link here: https://www.dropbox.com/s/oor4gj7wbhue8yc/TEST_POC_DS_01.txt)
> but I haven't figured out what it is. It is not text that is visible in the PDF file.
> 
> Thanks,
> Yoni
> 
> Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.  
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.