You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2022/08/31 19:23:00 UTC

[jira] [Commented] (TIKA-3844) Improve extraction of PDF subset info

    [ https://issues.apache.org/jira/browse/TIKA-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598590#comment-17598590 ] 

Hudson commented on TIKA-3844:
------------------------------

SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk8 #768 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/768/])
TIKA-3844: improve extraction of PDF subset information. (tallison: [https://github.com/apache/tika/commit/ff9873378e17453b1f3ac08d43846da6f16e145a])
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/xmp/testPDFVT.xmp
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/xmp/testPDFUA.xmp
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/xmp/testPDFA.xmp
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/xmpschemas/XMPSchemaPDFXId.java
* (edit) CHANGES.txt
* (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
* (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/xmpschemas/XMPSchemaPDFX.java
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/xmp/testPDFX.xmp
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/xmpschemas/XMPSchemaPDFVT.java
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/CustomTikaXMPTest.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/xmpschemas/XMPSchemaPDFUA.java
TIKA-3844: rm debug -- shakes head in shame. (tallison: [https://github.com/apache/tika/commit/4a5326ce776182e24b809fd02e4f5998a58b94d7])
* (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/CustomTikaXMPTest.java


> Improve extraction of PDF subset info
> -------------------------------------
>
>                 Key: TIKA-3844
>                 URL: https://issues.apache.org/jira/browse/TIKA-3844
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 2.4.2
>
>
> We're extracting PDFA part and conformance. We should add extraction for VT, UA, and X.
> We should also finally get rid of the bad hack from 1.x that appended the pdfa conformance to the file type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)