You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Fausto Cruzeiro de Moraes (Created) (JIRA)" <ji...@apache.org> on 2012/03/14 22:24:35 UTC
[jira] [Created] (TIKA-876) Signed pdf parsing
Signed pdf parsing
------------------
Key: TIKA-876
URL: https://issues.apache.org/jira/browse/TIKA-876
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.0
Environment: Java 6.0, Ubuntu
Reporter: Fausto Cruzeiro de Moraes
Fix For: 1.0
Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-876) Signed pdf parsing
Posted by "Fausto Cruzeiro de Moraes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fausto Cruzeiro de Moraes updated TIKA-876:
-------------------------------------------
Attachment: PDFsigned.pdf.p7s
PDF para teste indexação conteúdo.pdf
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Fausto Cruzeiro de Moraes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270628#comment-13270628 ]
Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------
Hi Nick!
I have just attached two samples files, as requested by you.
Thank you very much!
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-876) Signed pdf parsing
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-876.
--------------------------------
Resolution: Fixed
Fix Version/s: (was: 1.0)
1.2
Assignee: Jukka Zitting
In revision 1355724 I added a simple o.a.t.parser.crypto.Pkcs7Parser class that is able to parse the attached p7s file using Bouncy Castle.
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Assignee: Jukka Zitting
> Labels: features
> Fix For: 1.2
>
> Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231616#comment-13231616 ]
Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------
Hi Nick
I am running Tika over two files: PDFnotsigned.pdf (original pdf document, application/pdf) and PDFsigned.pdf.p7s (digitally signed document, application/pkcs7-signature).
1 - When running the statement: java -jar tika-app-1.0.jar -t PDFnotsigned.pdf > PDFnotsigned.pdf.txt, i get an output file with the expected content
2 - When running the statement: When running the statement: java -jar tika-app-1.0.jar -t PDFsigned.pdf > PDFsigned.pdf.txt, i get an output file with no content at all, just 0Kb.
As far as I am concerned, there is no default tika filter related to application/pkcs7-signature mimetype...
Thanks
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264112#comment-13264112 ]
Nick Burch commented on TIKA-876:
---------------------------------
We still can't help you very much without a (small) sample file, any chance you could upload one?
If your PDFs really are wrapped in PKCS7, then we'll need something that unpacks the PCKS7 wrapper, and for signed files (initially - no way to supply the private key yet for encrypted ones) triggers the recursing parser for the contents. I think BouncyCastle might help for this, it's worth a look to start with
In r1331634 I've added some mime magic for pkcs7 files. I'm not sure if it's quite right or not, but it seems OK for a few files I've tried. It'll need someone who knows the PCKS format (or maybe just DER encoding?) to be sure though. Ideally, we should distinguish between signed, encrypted and signed+encrypted, but I'm not sure how we do that...
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278218#comment-13278218 ]
Nick Burch commented on TIKA-876:
---------------------------------
I can't seem to find any information on how the pkcs7 wrapping takes place, nor how to unwrap it. Without knowing that, we can't write anything to use BouncyCastle (or similar) to unpack it
Are you able to track down any information on how it's done?
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245303#comment-13245303 ]
Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------
Hi Nick
Do you have any tip/advice for helping me on this subject?
Thank you a lot
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230298#comment-13230298 ]
Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------
Hi Nick
I mean, in fact, I really need do parsing in digitally signed (PKCS7, for example) pdf files, so that Jackrabbit 2.4.0 can extract and index their content.
Thanks
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230299#comment-13230299 ]
Nick Burch commented on TIKA-876:
---------------------------------
Can you upload a small example file?
When you try to detect it with Tika, what do you get? When you parse it, what do you get? And how do those two things differ from what you'd expect?
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-876) Signed pdf parsing
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230291#comment-13230291 ]
Nick Burch commented on TIKA-876:
---------------------------------
Shortly after someone submits a patch for it! Unfortunately / fortunately (depending on your perspective), we're all volunteers here.
In the mean time, it may help if you explain what doesn't work and/or what you'd expect to see
For example, I know we do support password protected Microsoft Office and PDF files
> Signed pdf parsing
> ------------------
>
> Key: TIKA-876
> URL: https://issues.apache.org/jira/browse/TIKA-876
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Environment: Java 6.0, Ubuntu
> Reporter: Fausto Cruzeiro de Moraes
> Labels: features
> Fix For: 1.0
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira