You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/05/28 02:02:03 UTC

[jira] [Assigned] (TIKA-1291) Invalid JSON output on CLI

     [ https://issues.apache.org/jira/browse/TIKA-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison reassigned TIKA-1291:
---------------------------------

    Assignee: Tim Allison

> Invalid JSON output on CLI
> --------------------------
>
>                 Key: TIKA-1291
>                 URL: https://issues.apache.org/jira/browse/TIKA-1291
>             Project: Tika
>          Issue Type: Bug
>          Components: cli, metadata
>    Affects Versions: 1.4, 1.5
>            Reporter: Steffen
>            Assignee: Tim Allison
>
> Getting the metadata via CLI from tika with output format set to JSON gives sometimes invalid JSON. I only found float/array errors here in jira and thus created this ticket with a new case.
> In my case the file that lead to invalid JSON output was a PNG file (that I unfortunately can't provide for testing):
> {noformat}
> { "Application Record Version":4, 
> "Component 1":"Y component: Quantization table 0, Sampling factors 2 horiz/2 vert", 
> "Component 2":"Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert", 
> "Component 3":"Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert", 
> "Compression Type":"Baseline", 
> "Content-Length":113081, 
> "Content-Type":"image/jpeg", 
> "Data Precision":"8 bits", 
> "IPTC-NAA record":"24 bytes binary data", 
> "Image Height":"479 pixels", 
> "Image Width":"671 pixels", 
> "Number of Components":3, 
> "Resolution Units":"inch", 
> "Unknown tag (0x02f0)":35,0,556,479, 
> "X Resolution":"220 dots", 
> "Y Resolution":"220 dots", 
> "resourceName":18, 
> "tiff:BitsPerSample":8, 
> "tiff:ImageLength":479, 
> "tiff:ImageWidth":671 }
> {noformat}
> The {noformat}"Unknown tag (0x02f0)":35,0,556,479, {noformat} is invalid JSON.
> It would be nice if there's always valid json output from tika. For other cases that might not be catched via fixes by this ticket it would be nice to have a CLI argument/option that disables the output of certain (unknown?) fields or allows giving a whitelist of fieldnames to output. That way users can bridge the time until new releases of tika by being more specific on the shell. If that feature already exists I apology for not having found it directly and a hint to the CLI option would be nice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)