You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/03/12 12:10:00 UTC

[jira] [Commented] (TIKA-3070) Null bytes in extracted metadata

    [ https://issues.apache.org/jira/browse/TIKA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057863#comment-17057863 ] 

Tim Allison commented on TIKA-3070:
-----------------------------------

The null bytes are in the string as stored in the PDF.

{noformat}
/Producer <feff00410064006f00620065002000500053004c00200031002e0032006500200066006f0072002000430061006e006f006e0000>
{noformat}

If there's something in the PDF spec that says trailing null bytes should be ignored, we should open a ticket with PDFBox.

> Null bytes in extracted metadata
> --------------------------------
>
>                 Key: TIKA-3070
>                 URL: https://issues.apache.org/jira/browse/TIKA-3070
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.23
>         Environment: Docker image: apache/tika:1.23
>            Reporter: Carina
>            Priority: Major
>         Attachments: Technical_manual.pdf
>
>
> Both /rmeta/text and unpack/all return null bytes in metadata. 
>  
> Note *"pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000"*
>  
> {code:java}
> $ curl -T Technical_manual.pdf http://localhost:9998/rmeta/text 
> [{
>   "Content-Type": "application/pdf",
>   "Creation-Date": "2018-08-21T09:40:33Z",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.pdf.PDFParser"
>   ],
>   "X-TIKA:embedded_depth": "0",
>   "X-TIKA:parse_time_millis": "42",
>   "access_permission:assemble_document": "true",
>   "access_permission:can_modify": "true",
>   "access_permission:can_print": "true",
>   "access_permission:can_print_degraded": "true",
>   "access_permission:extract_content": "true",
>   "access_permission:extract_for_accessibility": "true",
>   "access_permission:fill_in_form": "true",
>   "access_permission:modify_annotations": "true",
>   "dc:format": "application/pdf; version\u003d1.4",
>   "dcterms:created": "2018-08-21T09:40:33Z",
>   "meta:creation-date": "2018-08-21T09:40:33Z",
>   "pdf:PDFVersion": "1.4",
>   "pdf:charsPerPage": [
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0"
>   ],
>   "pdf:docinfo:created": "2018-08-21T09:40:33Z",
>   "pdf:docinfo:creator_tool": "Canon iR-ADV C5235  PDF",
>   "pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000",
>   "pdf:encrypted": "false",
>   "pdf:hasXFA": "false",
>   "pdf:hasXMP": "true",
>   "pdf:unmappedUnicodeCharsPerPage": [
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0"
>   ],
>   "xmp:CreatorTool": "Canon iR-ADV C5235  PDF",
>   "xmpMM:DocumentID": "uuid:03e07b5b-0000-f481-39c4-e94700000000",
>   "xmpTPg:NPages": "31"
> }]
> {code}
>  
>  
> Other example. 
> Note fields "pdf:docinfo:creator_tool": "DigiPath\u0000", "pdf:docinfo:producer": "DigiPath\u0000" and "xmp:CreatorTool": "DigiPath\u0000"
>  
> {code:java}
> [{
>   "Content-Type": "application/pdf",
>   "Last-Modified": "2006-03-02T08:53:15Z",
>   "Last-Save-Date": "2006-03-02T08:53:15Z",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.pdf.PDFParser"
>   ],
>   "X-TIKA:embedded_depth": "0",
>   "X-TIKA:parse_time_millis": "96",
>   "access_permission:assemble_document": "true",
>   "access_permission:can_modify": "true",
>   "access_permission:can_print": "true",
>   "access_permission:can_print_degraded": "true",
>   "access_permission:extract_content": "true",
>   "access_permission:extract_for_accessibility": "true",
>   "access_permission:fill_in_form": "true",
>   "access_permission:modify_annotations": "true",
>   "date": "2006-03-02T08:53:15Z",
>   "dc:format": "application/pdf; version\u003d1.3",
>   "dcterms:modified": "2006-03-02T08:53:15Z",
>   "meta:save-date": "2006-03-02T08:53:15Z",
>   "modified": "2006-03-02T08:53:15Z",
>   "pdf:PDFVersion": "1.3",
>   "pdf:charsPerPage": [
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0"
>   ],
>   "pdf:docinfo:creator_tool": "DigiPath\u0000",
>   "pdf:docinfo:modified": "2006-03-02T08:53:15Z",
>   "pdf:docinfo:producer": "DigiPath\u0000",
>   "pdf:encrypted": "false",
>   "pdf:hasXFA": "false",
>   "pdf:hasXMP": "false",
>   "pdf:unmappedUnicodeCharsPerPage": [
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0",
>     "0"
>   ],
>   "xmp:CreatorTool": "DigiPath\u0000",
>   "xmpTPg:NPages": "14"
> }]
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)