You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2023/01/16 10:58:00 UTC

[jira] [Updated] (TIKA-3703) Consider adding a frictionless data package output format

     [ https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-3703:
------------------------------
    Description: 
For those who want more than just text and metadata, e.g. bytes for thumbnails, or embedded images or embedded files or rendered pages, it would be great to return that data in a standard format. Our current /unpack endpoint uses a zip file but with our own "standard".

I was thinking about heading down the pure json option by including these byte streams as base64 encoded metadata values in our current metadata object. Not sure which is the better way to go.

I'm opening this issue to discuss options.

 

Reference: [https://frictionlessdata.io/standards/#standards-toolkit]

We'd want to make this available as an endpoint on tika-server (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.

  was:
For those who want more than just text and metadata, e.g. bytes for thumbnails, or embedded images or embedded files or rendered pages, it would be great to return that data in a standard format.  Our current /unpack endpoint uses a zip file but with our own "standard".

I was thinking about heading down the pure json option by including these byte streams as base64 encoded metadata values in our current metadata object.  Not sure which is the better way to go.

I'm opening this issue to discuss options.

https://frictionlessdata.io/standards/#standards-toolkit


> Consider adding a frictionless data package output format
> ---------------------------------------------------------
>
>                 Key: TIKA-3703
>                 URL: https://issues.apache.org/jira/browse/TIKA-3703
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for thumbnails, or embedded images or embedded files or rendered pages, it would be great to return that data in a standard format. Our current /unpack endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these byte streams as base64 encoded metadata values in our current metadata object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)