You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Fausto Cruzeiro de Moraes (Created) (JIRA)" <ji...@apache.org> on 2012/03/14 22:24:35 UTC

[jira] [Created] (TIKA-876) Signed pdf parsing

Signed pdf parsing
------------------

                 Key: TIKA-876
                 URL: https://issues.apache.org/jira/browse/TIKA-876
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.0
         Environment: Java 6.0, Ubuntu
            Reporter: Fausto Cruzeiro de Moraes
             Fix For: 1.0


Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-876) Signed pdf parsing

Posted by "Fausto Cruzeiro de Moraes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fausto Cruzeiro de Moraes updated TIKA-876:
-------------------------------------------

    Attachment: PDFsigned.pdf.p7s
                PDF para teste indexação conteúdo.pdf
    
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>         Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Fausto Cruzeiro de Moraes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270628#comment-13270628 ] 

Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------

Hi Nick!

I have just attached two samples files, as requested by you.

Thank you very much!
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>         Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (TIKA-876) Signed pdf parsing

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-876.
--------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 1.0)
                   1.2
         Assignee: Jukka Zitting

In revision 1355724 I added a simple o.a.t.parser.crypto.Pkcs7Parser class that is able to parse the attached p7s file using Bouncy Castle.
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>            Assignee: Jukka Zitting
>              Labels: features
>             Fix For: 1.2
>
>         Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231616#comment-13231616 ] 

Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------

Hi Nick

I am running Tika over two files: PDFnotsigned.pdf (original pdf document, application/pdf) and PDFsigned.pdf.p7s (digitally signed document, application/pkcs7-signature).

1 - When running the statement: java -jar tika-app-1.0.jar -t PDFnotsigned.pdf > PDFnotsigned.pdf.txt, i get an output file with the expected content

2 - When running the statement: When running the statement: java -jar tika-app-1.0.jar -t PDFsigned.pdf > PDFsigned.pdf.txt, i get an output file with no content at all, just 0Kb.

As far as I am concerned, there is no default tika filter related to application/pkcs7-signature mimetype...

Thanks





                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264112#comment-13264112 ] 

Nick Burch commented on TIKA-876:
---------------------------------

We still can't help you very much without a (small) sample file, any chance you could upload one?

If your PDFs really are wrapped in PKCS7, then we'll need something that unpacks the PCKS7 wrapper, and for signed files (initially - no way to supply the private key yet for encrypted ones) triggers the recursing parser for the contents. I think BouncyCastle might help for this, it's worth a look to start with

In r1331634 I've added some mime magic for pkcs7 files. I'm not sure if it's quite right or not, but it seems OK for a few files I've tried. It'll need someone who knows the PCKS format (or maybe just DER encoding?) to be sure though. Ideally, we should distinguish between signed, encrypted and signed+encrypted, but I'm not sure how we do that...
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278218#comment-13278218 ] 

Nick Burch commented on TIKA-876:
---------------------------------

I can't seem to find any information on how the pkcs7 wrapping takes place, nor how to unwrap it. Without knowing that, we can't write anything to use BouncyCastle (or similar) to unpack it

Are you able to track down any information on how it's done?
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>         Attachments: PDF para teste indexação conteúdo.pdf, PDFsigned.pdf.p7s
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245303#comment-13245303 ] 

Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------

Hi Nick

Do you have any tip/advice for helping me on this subject?

Thank you a lot
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Fausto Cruzeiro de Moraes (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230298#comment-13230298 ] 

Fausto Cruzeiro de Moraes commented on TIKA-876:
------------------------------------------------

Hi Nick

I mean, in fact, I really need do parsing in digitally signed (PKCS7, for example) pdf files, so that Jackrabbit 2.4.0 can extract and index their content. 

Thanks
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230299#comment-13230299 ] 

Nick Burch commented on TIKA-876:
---------------------------------

Can you upload a small example file?

When you try to detect it with Tika, what do you get? When you parse it, what do you get? And how do those two things differ from what you'd expect?
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-876) Signed pdf parsing

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230291#comment-13230291 ] 

Nick Burch commented on TIKA-876:
---------------------------------

Shortly after someone submits a patch for it! Unfortunately / fortunately (depending on your perspective), we're all volunteers here.

In the mean time, it may help if you explain what doesn't work and/or what you'd expect to see

For example, I know we do support password protected Microsoft Office and PDF files
                
> Signed pdf parsing
> ------------------
>
>                 Key: TIKA-876
>                 URL: https://issues.apache.org/jira/browse/TIKA-876
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Java 6.0, Ubuntu
>            Reporter: Fausto Cruzeiro de Moraes
>              Labels: features
>             Fix For: 1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira