You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2019/07/31 16:43:00 UTC

[jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

    [ https://issues.apache.org/jira/browse/TIKA-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897315#comment-16897315 ] 

Hudson commented on TIKA-2917:
------------------------------

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #446 (See [https://builds.apache.org/job/tika-2.x-windows/446/])
TIKA-2917 -- extract metadata that accompanies inline images (tallison: rev 86325105ab206dca88d076dc865fcb17404c4531)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java


> Extract metadata from inline images in PDFs
> -------------------------------------------
>
>                 Key: TIKA-2917
>                 URL: https://issues.apache.org/jira/browse/TIKA-2917
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>
> Inline images may have XMP associated with them.  We are not currently extracting this metadata.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

Posted by Ken Krugler <kk...@transpac.com>.

Hi Tim,

> On Nov 20, 2020, at 11:21 AM, Tim Allison <ta...@apache.org> wrote:
> 
> Y.  That should do it.  I don't think we're currently documenting this.  It
> looks like POI and PDFBox also require jce unlimited to build.
> 
> Hmmm... Should we assumeTrue that jce is installed and then skip that unit
> test if not or do we want to require it to build Tika?

I think we should document that when building, it’s required to install the JCE Unlimited Strength Jurisdiction Policy Files.

For Java 8 on my Mac, this worked:

1. Go to https://www.oracle.com/java/technologies/javase-jce8-downloads.html, and click the download link.
2. Sign in with your Oracle account, accept the license, and wait for the (small) file to download.
3. Expand the downloaded zip

From a terminal:

> sudo cp ~/Downloads/UnlimitedJCEPolicyJDK8/US_export_policy.jar $JAVA_HOME/jre/lib/security/
> sudo cp ~/Downloads/UnlimitedJCEPolicyJDK8/local_policy.jar $JAVA_HOME/jre/lib/security/

— Ken


> 
> On Fri, Nov 20, 2020 at 1:43 PM Ken Krugler <kk...@transpac.com>
> wrote:
> 
>> Hi all,
>> 
>> I was trying to build the 1.25-rc1 branch, and ran into this same issue
>> while building the Tika parsers:
>> 
>>> Tests run: 87, Failures: 0, Errors: 1, Skipped: 3, Time elapsed: 6.816 s
>> <<< FAILURE! - in org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest
>>> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted
>> Time elapsed: 0.286 s  <<< ERROR!
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.apache.tika.parser.microsoft.OfficeParser@c0de6c9
>>>      at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)
>>> Caused by: org.apache.poi.EncryptedDocumentException: Export
>> Restrictions in place - please install JCE Unlimited Strength Jurisdiction
>> Policy files
>>>      at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)
>> 
>> I assume I need to follow instructions at say
>> https://dzone.com/articles/install-java-cryptography-extension-jce-unlimited
>> to get the appropriate files installed, yes?
>> 
>> And is this documented for Tika somewhere?
>> 
>> Thanks,
>> 
>> — Ken
>> 
>> 
>>> On Jul 31, 2019, at 9:45 AM, Tim Allison <ta...@apache.org> wrote:
>>> 
>>> Dave,
>>> So that I can fix stuff in the future...can you share with me how to
>>> fix this issue on Hudson?
>>> 
>>> org.apache.tika.parser.microsoft.OfficeParser@6f1fd7c1
>>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
>>> Caused by: org.apache.poi.EncryptedDocumentException: Export
>>> Restrictions in place - please install JCE Unlimited Strength
>>> Jurisdiction Policy files
>>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
>>> 
>>> Many thanks!
>>> 
>>>       Cheers,
>>> 
>>>             Tim
>>> 
>>> On Wed, Jul 31, 2019 at 12:43 PM Hudson (JIRA) <ji...@apache.org> wrote:
>>>> 
>>>> 
>>>>   [
>> https://issues.apache.org/jira/browse/TIKA-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897315#comment-16897315
>> ]
>>>> 
>>>> Hudson commented on TIKA-2917:
>>>> ------------------------------
>>>> 
>>>> UNSTABLE: Integrated in Jenkins build tika-2.x-windows #446 (See [
>> https://builds.apache.org/job/tika-2.x-windows/446/])
>>>> TIKA-2917 -- extract metadata that accompanies inline images (tallison:
>> rev 86325105ab206dca88d076dc865fcb17404c4531)
>>>> * (edit)
>> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
>>>> * (edit)
>> tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
>>>> * (edit)
>> tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
>>>> * (edit)
>> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
>>>> * (add)
>> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
>>>> 
>>>> 
>>>>> Extract metadata from inline images in PDFs
>>>>> -------------------------------------------
>>>>> 
>>>>>               Key: TIKA-2917
>>>>>               URL: https://issues.apache.org/jira/browse/TIKA-2917
>>>>>           Project: Tika
>>>>>        Issue Type: Improvement
>>>>>          Reporter: Tim Allison
>>>>>          Assignee: Tim Allison
>>>>>          Priority: Minor
>>>>> 
>>>>> Inline images may have XMP associated with them.  We are not currently
>> extracting this metadata.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v7.6.14#76016)
>> 
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

Posted by Tim Allison <ta...@apache.org>.

Y.  That should do it.  I don't think we're currently documenting this.  It
looks like POI and PDFBox also require jce unlimited to build.

Hmmm... Should we assumeTrue that jce is installed and then skip that unit
test if not or do we want to require it to build Tika?

On Fri, Nov 20, 2020 at 1:43 PM Ken Krugler <kk...@transpac.com>
wrote:

> Hi all,
>
> I was trying to build the 1.25-rc1 branch, and ran into this same issue
> while building the Tika parsers:
>
> > Tests run: 87, Failures: 0, Errors: 1, Skipped: 3, Time elapsed: 6.816 s
> <<< FAILURE! - in org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest
> > org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted
> Time elapsed: 0.286 s  <<< ERROR!
> > org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.apache.tika.parser.microsoft.OfficeParser@c0de6c9
> >       at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)
> > Caused by: org.apache.poi.EncryptedDocumentException: Export
> Restrictions in place - please install JCE Unlimited Strength Jurisdiction
> Policy files
> >       at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)
>
> I assume I need to follow instructions at say
> https://dzone.com/articles/install-java-cryptography-extension-jce-unlimited
> to get the appropriate files installed, yes?
>
> And is this documented for Tika somewhere?
>
> Thanks,
>
> — Ken
>
>
> > On Jul 31, 2019, at 9:45 AM, Tim Allison <ta...@apache.org> wrote:
> >
> > Dave,
> >  So that I can fix stuff in the future...can you share with me how to
> > fix this issue on Hudson?
> >
> > org.apache.tika.parser.microsoft.OfficeParser@6f1fd7c1
> > at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
> > Caused by: org.apache.poi.EncryptedDocumentException: Export
> > Restrictions in place - please install JCE Unlimited Strength
> > Jurisdiction Policy files
> > at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
> >
> > Many thanks!
> >
> >        Cheers,
> >
> >              Tim
> >
> > On Wed, Jul 31, 2019 at 12:43 PM Hudson (JIRA) <ji...@apache.org> wrote:
> >>
> >>
> >>    [
> https://issues.apache.org/jira/browse/TIKA-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897315#comment-16897315
> ]
> >>
> >> Hudson commented on TIKA-2917:
> >> ------------------------------
> >>
> >> UNSTABLE: Integrated in Jenkins build tika-2.x-windows #446 (See [
> https://builds.apache.org/job/tika-2.x-windows/446/])
> >> TIKA-2917 -- extract metadata that accompanies inline images (tallison:
> rev 86325105ab206dca88d076dc865fcb17404c4531)
> >> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> >> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
> >> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
> >> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> >> * (add)
> tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
> >>
> >>
> >>> Extract metadata from inline images in PDFs
> >>> -------------------------------------------
> >>>
> >>>                Key: TIKA-2917
> >>>                URL: https://issues.apache.org/jira/browse/TIKA-2917
> >>>            Project: Tika
> >>>         Issue Type: Improvement
> >>>           Reporter: Tim Allison
> >>>           Assignee: Tim Allison
> >>>           Priority: Minor
> >>>
> >>> Inline images may have XMP associated with them.  We are not currently
> extracting this metadata.
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v7.6.14#76016)
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

Posted by Ken Krugler <kk...@transpac.com>.

Hi all,

I was trying to build the 1.25-rc1 branch, and ran into this same issue while building the Tika parsers:

> Tests run: 87, Failures: 0, Errors: 1, Skipped: 3, Time elapsed: 6.816 s <<< FAILURE! - in org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest
> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted  Time elapsed: 0.286 s  <<< ERROR!
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@c0de6c9
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)
> Caused by: org.apache.poi.EncryptedDocumentException: Export Restrictions in place - please install JCE Unlimited Strength Jurisdiction Policy files
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1120)

I assume I need to follow instructions at say https://dzone.com/articles/install-java-cryptography-extension-jce-unlimited to get the appropriate files installed, yes?

And is this documented for Tika somewhere?

Thanks,

— Ken


> On Jul 31, 2019, at 9:45 AM, Tim Allison <ta...@apache.org> wrote:
> 
> Dave,
>  So that I can fix stuff in the future...can you share with me how to
> fix this issue on Hudson?
> 
> org.apache.tika.parser.microsoft.OfficeParser@6f1fd7c1
> at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
> Caused by: org.apache.poi.EncryptedDocumentException: Export
> Restrictions in place - please install JCE Unlimited Strength
> Jurisdiction Policy files
> at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
> 
> Many thanks!
> 
>        Cheers,
> 
>              Tim
> 
> On Wed, Jul 31, 2019 at 12:43 PM Hudson (JIRA) <ji...@apache.org> wrote:
>> 
>> 
>>    [ https://issues.apache.org/jira/browse/TIKA-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897315#comment-16897315 ]
>> 
>> Hudson commented on TIKA-2917:
>> ------------------------------
>> 
>> UNSTABLE: Integrated in Jenkins build tika-2.x-windows #446 (See [https://builds.apache.org/job/tika-2.x-windows/446/])
>> TIKA-2917 -- extract metadata that accompanies inline images (tallison: rev 86325105ab206dca88d076dc865fcb17404c4531)
>> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
>> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
>> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
>> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
>> * (add) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
>> 
>> 
>>> Extract metadata from inline images in PDFs
>>> -------------------------------------------
>>> 
>>>                Key: TIKA-2917
>>>                URL: https://issues.apache.org/jira/browse/TIKA-2917
>>>            Project: Tika
>>>         Issue Type: Improvement
>>>           Reporter: Tim Allison
>>>           Assignee: Tim Allison
>>>           Priority: Minor
>>> 
>>> Inline images may have XMP associated with them.  We are not currently extracting this metadata.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v7.6.14#76016)

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

Posted by Tim Allison <ta...@apache.org>.

Dave,
  So that I can fix stuff in the future...can you share with me how to
fix this issue on Hudson?

org.apache.tika.parser.microsoft.OfficeParser@6f1fd7c1
at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)
Caused by: org.apache.poi.EncryptedDocumentException: Export
Restrictions in place - please install JCE Unlimited Strength
Jurisdiction Policy files
at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testEncrypted(OOXMLParserTest.java:1234)

Many thanks!

        Cheers,

              Tim

On Wed, Jul 31, 2019 at 12:43 PM Hudson (JIRA) <ji...@apache.org> wrote:
>
>
>     [ https://issues.apache.org/jira/browse/TIKA-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897315#comment-16897315 ]
>
> Hudson commented on TIKA-2917:
> ------------------------------
>
> UNSTABLE: Integrated in Jenkins build tika-2.x-windows #446 (See [https://builds.apache.org/job/tika-2.x-windows/446/])
> TIKA-2917 -- extract metadata that accompanies inline images (tallison: rev 86325105ab206dca88d076dc865fcb17404c4531)
> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
> * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> * (add) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
>
>
> > Extract metadata from inline images in PDFs
> > -------------------------------------------
> >
> >                 Key: TIKA-2917
> >                 URL: https://issues.apache.org/jira/browse/TIKA-2917
> >             Project: Tika
> >          Issue Type: Improvement
> >            Reporter: Tim Allison
> >            Assignee: Tim Allison
> >            Priority: Minor
> >
> > Inline images may have XMP associated with them.  We are not currently extracting this metadata.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.14#76016)