You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/12/20 15:34:00 UTC

[jira] [Resolved] (TIKA-2532) Output for PDF file contains X-TIKA:content that is a PDF fragment

     [ https://issues.apache.org/jira/browse/TIKA-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2532.
-------------------------------
    Resolution: Not A Problem

Thank you for opening this issue. 

If you open the file in Acrobat Reader, and click on the paperclip on the left to view attachments within the PDF, there is an attachment in that pdf called {{IEEE.joboptions}}.  If you select "Save Attachment", you get exactly the text that you noted here.

In short, Tika is working as it should.

> Output for PDF file contains X-TIKA:content that is a PDF fragment
> ------------------------------------------------------------------
>
>                 Key: TIKA-2532
>                 URL: https://issues.apache.org/jira/browse/TIKA-2532
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15, 1.16, 1.17
>         Environment: Ubuntu 64 bit
> JDK 1.8
>            Reporter: Trevor Yann
>            Priority: Minor
>         Attachments: A_latent_topic_model_for_complete_entity.pdf
>
>
> I have a PDF file that returns two elements in the recursive json output. The first element is text, as expected. The second element seems to be a fragment of a PDF file, rather than extracted text.
> The start of the second element in the json output is:
>   {
>     "Content-Encoding": "ISO-8859-1",
>     "Content-Length": "-1",
>     "Content-Type": "text/plain; charset\u003dISO-8859-1",
>     "X-Parsed-By": [
>       "org.apache.tika.parser.DefaultParser",
>       "org.apache.tika.parser.txt.TXTParser"
>     ],
>     "X-TIKA:content": "\u003c\u003c\n  /ASCII85EncodePages false\n  /AllowTransparency false\n  /AutoPositionEPSFiles true\n  /AutoRotatePages /None\n  /Binding /Left\n  /CalGrayProfile (Gray Gamma 2.2)\n  /CalRGBProfile (sRGB IEC61966-2.1)\n  /CalCMYKProfile (U.S. Web Coated \\050SWOP\\051 v2)\n  /sRGBProfile (sRGB IEC61966-2.1)\n  /CannotEmbedFontPolicy /Warning\n  /CompatibilityLevel 1.4\n  /CompressObjects /Off\n  /CompressPages true\n  /ConvertImagesToIndexed true\n  /PassThroughJPEGImages true\n  /CreateJobTicket false\n  /DefaultRenderingIntent /Default\n  /DetectBlends true\n  /DetectCurves 0.0000\n  /ColorConversionStrategy /LeaveColorUnchanged\n  /DoThumbnails true\n  /EmbedAllFonts true\n  /EmbedOpenType false\n  /ParseICCProfilesInComments true\n  /EmbedJobOptions true\n  /DSCReportingLevel 0\n  /EmitDSCWarnings false\n  /EndPage -1\n  /ImageMemory 1048576\n  /LockDistillerParams true\n  /MaxSubsetPct 100\n  /Optimize true\n  /OPM 0\n  /ParseDSCComments false\n  /ParseDSCCommentsForDocInfo false\n  /PreserveCopyPage true\n  /PreserveDICMYKValues true\n  /PreserveEPSInfo false\n  /PreserveFlatness true\n  /PreserveHalftoneInfo true\n  /PreserveOPIComments false\n  /PreserveOverprintSettings true\n  /StartPage 1\n  /SubsetFonts true\n  /TransferFunctionInfo /Remove\n  /UCRandBGInfo /Preserve\n  /UsePrologue false\n  /ColorSettingsFile ()\n  /AlwaysEmbed [ true\n    /AbadiMT-CondensedLight\n    /ACaslon-Italic\n    /ACaslon-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)