You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Carina (Jira)" <ji...@apache.org> on 2020/03/12 11:34:00 UTC
[jira] [Created] (TIKA-3070) Null bytes in extracted metadata
Carina created TIKA-3070:
----------------------------
Summary: Null bytes in extracted metadata
Key: TIKA-3070
URL: https://issues.apache.org/jira/browse/TIKA-3070
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.23
Environment: Docker image: apache/tika:1.23
Reporter: Carina
Attachments: Technical_manual.pdf
Both /rmeta/text and unpack/all return null bytes in metadata.
Note *"pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000"*
{code:java}
$ curl -T Technical_manual.pdf http://localhost:9998/rmeta/text
[{
"Content-Type": "application/pdf",
"Creation-Date": "2018-08-21T09:40:33Z",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "42",
"access_permission:assemble_document": "true",
"access_permission:can_modify": "true",
"access_permission:can_print": "true",
"access_permission:can_print_degraded": "true",
"access_permission:extract_content": "true",
"access_permission:extract_for_accessibility": "true",
"access_permission:fill_in_form": "true",
"access_permission:modify_annotations": "true",
"dc:format": "application/pdf; version\u003d1.4",
"dcterms:created": "2018-08-21T09:40:33Z",
"meta:creation-date": "2018-08-21T09:40:33Z",
"pdf:PDFVersion": "1.4",
"pdf:charsPerPage": [
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0"
],
"pdf:docinfo:created": "2018-08-21T09:40:33Z",
"pdf:docinfo:creator_tool": "Canon iR-ADV C5235 PDF",
"pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000",
"pdf:encrypted": "false",
"pdf:hasXFA": "false",
"pdf:hasXMP": "true",
"pdf:unmappedUnicodeCharsPerPage": [
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0"
],
"xmp:CreatorTool": "Canon iR-ADV C5235 PDF",
"xmpMM:DocumentID": "uuid:03e07b5b-0000-f481-39c4-e94700000000",
"xmpTPg:NPages": "31"
}]
{code}
Other example.
Note fields "pdf:docinfo:creator_tool": "DigiPath\u0000", "pdf:docinfo:producer": "DigiPath\u0000" and "xmp:CreatorTool": "DigiPath\u0000"
{code:java}
[{
"Content-Type": "application/pdf",
"Last-Modified": "2006-03-02T08:53:15Z",
"Last-Save-Date": "2006-03-02T08:53:15Z",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "96",
"access_permission:assemble_document": "true",
"access_permission:can_modify": "true",
"access_permission:can_print": "true",
"access_permission:can_print_degraded": "true",
"access_permission:extract_content": "true",
"access_permission:extract_for_accessibility": "true",
"access_permission:fill_in_form": "true",
"access_permission:modify_annotations": "true",
"date": "2006-03-02T08:53:15Z",
"dc:format": "application/pdf; version\u003d1.3",
"dcterms:modified": "2006-03-02T08:53:15Z",
"meta:save-date": "2006-03-02T08:53:15Z",
"modified": "2006-03-02T08:53:15Z",
"pdf:PDFVersion": "1.3",
"pdf:charsPerPage": [
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0"
],
"pdf:docinfo:creator_tool": "DigiPath\u0000",
"pdf:docinfo:modified": "2006-03-02T08:53:15Z",
"pdf:docinfo:producer": "DigiPath\u0000",
"pdf:encrypted": "false",
"pdf:hasXFA": "false",
"pdf:hasXMP": "false",
"pdf:unmappedUnicodeCharsPerPage": [
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0",
"0"
],
"xmp:CreatorTool": "DigiPath\u0000",
"xmpTPg:NPages": "14"
}]
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)