You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2023/08/15 16:34:00 UTC

[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

    [ https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754711#comment-17754711 ] 

Tim Allison commented on TIKA-4048:
-----------------------------------

I'm back to the keyboard and ready to work on this.  I still am inclined to go forth with turning the default "uncompress multiple compressor streams" back to "false".  We can add a gzipped warc compressor detector and then use jwarc to process the gzipped stream.  Going forward, we can augment the gzipped+combo detector to detect tgz and svgz...we can do that on other tickets down the road...

> Gzipped WARC not identifying all assets
> ---------------------------------------
>
>                 Key: TIKA-4048
>                 URL: https://issues.apache.org/jira/browse/TIKA-4048
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gregory Lepore
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.8.1
>
>         Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, rec-20230518121844489398-5335604b8b23.warc.gz, rec-20230518121844489398-5335604b8b23.warc.gz.json, rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain WARC file, with the addition of the GZ file metadata. However, in the attached JSON outputs, the JPEG present in the plain WARC file is not represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)