You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/07/10 18:19:41 UTC

Adding a WARC parser to Tika

Nick,
  Sorry, I can't tell if this is tongue-in-cheek...

Should we look into this?  Perhaps for the -z option?

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Friday, July 7, 2017 6:55 AM
To: user@tika.apache.org
Subject: RE: Tika content detection and crawled "remote" content

On Fri, 7 Jul 2017, Allison, Timothy B. wrote:
> Should we add a WARC parser? ☺

I think we should!

And also add support into Tika Batch for processing from them :)

Nick

Re: Adding a WARC parser to Tika

Posted by "Jackson, Andy" <An...@bl.uk>.

Nice.

Well, in case it¹s useful, I cleaned up my code somewhat, used Sebastian¹s
code to parse the HTTP headers for WARC files, and added (BSD licensed)
test files from DROID and some reasonably meaningful tests.

It¹s on this branch:

https://github.com/ukwa/tika/tree/experimental-warc-parsing

And the parser tests give some idea of the current behaviour:

https://github.com/ukwa/tika/blob/experimental-warc-parsing/tika-parsers/sr
c/test/java/org/apache/tika/parser/warc/WARCParserTest.java

HTH,
Andy


On 11/07/2017, 19:11, "Sebastian Nagel" <wa...@googlemail.com> wrote:

>FYI, for a similar task - testing crawler-commons sitemaps.org parser -
>I've started a small test
>tools which reads the sitemaps from WARC files:
>
>https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVw
>RsxY
>   https://github.com/sebastian-nagel/sitemap-performance-test/
>
>As it only takes what is necessary for testing, it's lean and "no
>overkill".
>
>Sebastian
>
>On 07/11/2017 12:06 PM, Jackson, Andy wrote:
>> In case it helps, I'll try to summarise what we've done in this area.
>>
>> Currently our webarchive-discovery indexing tool parses the WARC and
>>then passes the payload to Tika:
>>
>> https://github.com/ukwa/webarchive-discovery
>>
>>https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src
>>/main/java/uk/bl/wa/solr/TikaExtractor.java
>>
>> This works fine, but along the way we've also experimented with adding
>>WARC parsing to Tika directly. The code is an extremely messy
>>proof-of-concept but I've pushed it here so you can see how it works:
>>
>> https://github.com/ukwa/tika/tree/experimental-warc-parsing
>>
>> The parser itself is fairly straightforward:
>>
>>
>>https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a
>>7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#
>>L94
>>
>> but it did require a few changes elsewhere...
>>
>> 1. Needed to teach Tika to spot ARC/WARC:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7
>>
>> 2. Added webarchive-commons as a dependency:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-2426935affac837a5f8f7a84a15939f7
>>
>> 3. Enable concatenated block gunzip in order to parse WARC.GZ:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-5ae41a78b18e2ca8481960cd5e02b860
>> (given this was explicitly disabled before, this may be contentious?)
>>
>> There's another couple of bigger issues that would need resolving too.
>>
>> Firstly, the WARC format is not a file archive, but primarily a HTTP
>>request/response archive. There are 8 different record types (see
>>https://iipc.github.io/warc-specifications/specifications/warc-format/war
>>c-1.1/#warc-record-types for details) that may or may not be of
>>interest. The HTTP request and the response get separate records, and of
>>course the response might be 303 or 404, not just 200. One strategy that
>>is fairly widely used is to simply ignore anything that is not a 200
>>response, but that does discard quite a lot of information.
>>
>> Secondly, I'm not sure how many layers of embedded are appropriate.
>>According to the spec, I would argue that these are the layers:
>>
>> - archive.warc.gz (a series of block-concatenated gzip records)
>> - archive.warc.gz/record.warc (an individual WARC record)
>> - archive.warc.gz/record.warc/http.response (the message/http in its
>>entirety)
>> - archive.warc.gz/record.warc/http.response/entity.body (the actual
>>resource)
>>
>> This is probably overkill (and gets worse if it's a gzipped HTTP
>>response!). We could just use:
>>
>> - archive.warc.gz (a series of block-concatenated gzip records)
>> - archive.warc.gz/record.warc (the parsed entity.body, with all
>>relevant info from WARC and HTTP headers attached as metadata)
>>
>> Collapsing the layers down does make is less clear where some of the
>>metadata is coming from, but it¹s probably worth it.
>>
>> One final note - I've not put the test WARC files in that repo yet as I
>>need to create some new ones from an Apache 2 source.
>>
>> I hope this is useful.
>>
>> Best,
>> Andy
>>
>>
>> =-=-=-=-=-=-=-=
>> Dr Andrew N. Jackson
>> Web Archiving Technical Lead
>> 01937 546602
>> @UKWebArchive
>> @anjacks0n
>> Blog: http://britishlibrary.typepad.co.uk/webarchive/
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Nick Burch [mailto:apache@gagravarr.org]
>> Sent: 10 July 2017 19:45
>> To: user@tika.apache.org
>> Subject: Re: Adding a WARC parser to Tika
>>
>> On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
>>> Sorry, I can't tell if this is tongue-in-cheek...
>>
>> No, I do think we should add a WARC parser to Tika Parsers.
>>
>> Once done, I'd suggest we figure out a way for Tika Batch to run over a
>>collection of WARC files just as it does for directories, to make it
>>easier to run over crawl collections without having to unpack them first!
>>
>> Nick
>>
>>
>>
>>*************************************************************************
>>*****************************************
>> Experience the British Library online at www.bl.uk<http://www.bl.uk/>
>> The British Library¹s latest Annual Report and Accounts :
>>www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index
>>.html>
>> Help the British Library conserve the world's knowledge. Adopt a Book.
>>www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
>> The Library's St Pancras site is WiFi - enabled
>>
>>*************************************************************************
>>****************************************
>> The information contained in this e-mail is confidential and may be
>>legally privileged. It is intended for the addressee(s) only. If you are
>>not the intended recipient, please delete this e-mail and notify the
>>postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail
>>must not be disclosed or copied without the sender's consent.
>> The statements and opinions expressed in this message are those of the
>>author and do not necessarily reflect those of the British Library. The
>>British Library does not take any responsibility for the views of the
>>author.
>>
>>*************************************************************************
>>****************************************
>> Think before you print
>>
>



******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Re: Adding a WARC parser to Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.

FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've started a small test
tools which reads the sitemaps from WARC files:
   https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY
   https://github.com/sebastian-nagel/sitemap-performance-test/

As it only takes what is necessary for testing, it's lean and "no overkill".

Sebastian

On 07/11/2017 12:06 PM, Jackson, Andy wrote:
> In case it helps, I'll try to summarise what we've done in this area.
> 
> Currently our webarchive-discovery indexing tool parses the WARC and then passes the payload to Tika:
> 
> https://github.com/ukwa/webarchive-discovery
> https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java
> 
> This works fine, but along the way we've also experimented with adding WARC parsing to Tika directly. The code is an extremely messy proof-of-concept but I've pushed it here so you can see how it works:
> 
> https://github.com/ukwa/tika/tree/experimental-warc-parsing
> 
> The parser itself is fairly straightforward:
> 
> https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94
> 
> but it did require a few changes elsewhere...
> 
> 1. Needed to teach Tika to spot ARC/WARC:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7
> 
> 2. Added webarchive-commons as a dependency:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7
> 
> 3. Enable concatenated block gunzip in order to parse WARC.GZ:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860
> (given this was explicitly disabled before, this may be contentious?)
> 
> There's another couple of bigger issues that would need resolving too.
> 
> Firstly, the WARC format is not a file archive, but primarily a HTTP request/response archive. There are 8 different record types (see https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types for details) that may or may not be of interest. The HTTP request and the response get separate records, and of course the response might be 303 or 404, not just 200. One strategy that is fairly widely used is to simply ignore anything that is not a 200 response, but that does discard quite a lot of information.
> 
> Secondly, I'm not sure how many layers of embedded are appropriate. According to the spec, I would argue that these are the layers:
> 
> - archive.warc.gz (a series of block-concatenated gzip records)
> - archive.warc.gz/record.warc (an individual WARC record)
> - archive.warc.gz/record.warc/http.response (the message/http in its entirety)
> - archive.warc.gz/record.warc/http.response/entity.body (the actual resource)
> 
> This is probably overkill (and gets worse if it's a gzipped HTTP response!). We could just use:
> 
> - archive.warc.gz (a series of block-concatenated gzip records)
> - archive.warc.gz/record.warc (the parsed entity.body, with all relevant info from WARC and HTTP headers attached as metadata)
> 
> Collapsing the layers down does make is less clear where some of the metadata is coming from, but it’s probably worth it.
> 
> One final note - I've not put the test WARC files in that repo yet as I need to create some new ones from an Apache 2 source.
> 
> I hope this is useful.
> 
> Best,
> Andy
> 
> 
> =-=-=-=-=-=-=-=
> Dr Andrew N. Jackson
> Web Archiving Technical Lead
> 01937 546602
> @UKWebArchive
> @anjacks0n
> Blog: http://britishlibrary.typepad.co.uk/webarchive/
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Nick Burch [mailto:apache@gagravarr.org]
> Sent: 10 July 2017 19:45
> To: user@tika.apache.org
> Subject: Re: Adding a WARC parser to Tika
> 
> On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
>> Sorry, I can't tell if this is tongue-in-cheek...
> 
> No, I do think we should add a WARC parser to Tika Parsers.
> 
> Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!
> 
> Nick
> 
> 
> ******************************************************************************************************************
> Experience the British Library online at www.bl.uk<http://www.bl.uk/>
> The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
> Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
> The Library's St Pancras site is WiFi - enabled
> *****************************************************************************************************************
> The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
> The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
> *****************************************************************************************************************
> Think before you print
>

RE: Adding a WARC parser to Tika

Posted by "Jackson, Andy" <An...@bl.uk>.

In case it helps, I'll try to summarise what we've done in this area.

Currently our webarchive-discovery indexing tool parses the WARC and then passes the payload to Tika:

https://github.com/ukwa/webarchive-discovery
https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java

This works fine, but along the way we've also experimented with adding WARC parsing to Tika directly. The code is an extremely messy proof-of-concept but I've pushed it here so you can see how it works:

https://github.com/ukwa/tika/tree/experimental-warc-parsing

The parser itself is fairly straightforward:

https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94

but it did require a few changes elsewhere...

1. Needed to teach Tika to spot ARC/WARC:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7

2. Added webarchive-commons as a dependency:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7

3. Enable concatenated block gunzip in order to parse WARC.GZ:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860
(given this was explicitly disabled before, this may be contentious?)

There's another couple of bigger issues that would need resolving too.

Firstly, the WARC format is not a file archive, but primarily a HTTP request/response archive. There are 8 different record types (see https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types for details) that may or may not be of interest. The HTTP request and the response get separate records, and of course the response might be 303 or 404, not just 200. One strategy that is fairly widely used is to simply ignore anything that is not a 200 response, but that does discard quite a lot of information.

Secondly, I'm not sure how many layers of embedded are appropriate. According to the spec, I would argue that these are the layers:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (an individual WARC record)
- archive.warc.gz/record.warc/http.response (the message/http in its entirety)
- archive.warc.gz/record.warc/http.response/entity.body (the actual resource)

This is probably overkill (and gets worse if it's a gzipped HTTP response!). We could just use:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (the parsed entity.body, with all relevant info from WARC and HTTP headers attached as metadata)

Collapsing the layers down does make is less clear where some of the metadata is coming from, but it’s probably worth it.

One final note - I've not put the test WARC files in that repo yet as I need to create some new ones from an Apache 2 source.

I hope this is useful.

Best,
Andy


=-=-=-=-=-=-=-=
Dr Andrew N. Jackson
Web Archiving Technical Lead
01937 546602
@UKWebArchive
@anjacks0n
Blog: http://britishlibrary.typepad.co.uk/webarchive/





-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 10 July 2017 19:45
To: user@tika.apache.org
Subject: Re: Adding a WARC parser to Tika

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!

Nick


******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Re: Adding a WARC parser to Tika

Posted by Chris Mattmann <ma...@apache.org>.

+1 from me makes sense

Giuseppe is interested in this too FWIW

On 7/10/17, 2:59 PM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    Oh, ok...
    
    As for "does for directories"...y, I've been thinking about a modification of -z for tar/zip files, pst and, I guess, now WARC.  Files that can be so enormous that you'd want to unpack them before indexing.  No one would really want to index the Enron pst (if it actually existed) as a single file, rather, they'd want to be able to unpack it and index the individual files.  And, while you can attach a bunch of files inside a PDF or MSOffice file, in practice, there seems to be a fundamental difference between how users might want to deal with embedded files in, say, a PDF than in a PST.  
    
    Depending on interest, might make sense to add disk images to the list of zip/pst/etc..., e.g. AFF? 
    
    
    
    -----Original Message-----
    From: Nick Burch [mailto:apache@gagravarr.org] 
    Sent: Monday, July 10, 2017 2:45 PM
    To: user@tika.apache.org
    Subject: Re: Adding a WARC parser to Tika
    
    On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
    > Sorry, I can't tell if this is tongue-in-cheek...
    
    No, I do think we should add a WARC parser to Tika Parsers.
    
    Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!
    
    Nick

RE: Adding a WARC parser to Tika

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Depending on interest, might make sense to add disk images to the list 
> of zip/pst/etc..., e.g. AFF?

We had a quick look at Apple DMG files in Miami, because another ASF 
project was asking after them. As detailed in TIKA-2372, that'd need a GPL 
plugin. Pretty sure EXT2/3/4 would be the same. Given the similarities 
between OLE2 and FAT, we might be able to bodge something read-only for 
FAT fairly easily with Apache POI, if no other library worked.

Using zip or tar or wrc does seem a lot easier for collections of files to 
me though!

Nick

RE: Adding a WARC parser to Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Oh, ok...

As for "does for directories"...y, I've been thinking about a modification of -z for tar/zip files, pst and, I guess, now WARC.  Files that can be so enormous that you'd want to unpack them before indexing.  No one would really want to index the Enron pst (if it actually existed) as a single file, rather, they'd want to be able to unpack it and index the individual files.  And, while you can attach a bunch of files inside a PDF or MSOffice file, in practice, there seems to be a fundamental difference between how users might want to deal with embedded files in, say, a PDF than in a PST.  

Depending on interest, might make sense to add disk images to the list of zip/pst/etc..., e.g. AFF? 



-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Monday, July 10, 2017 2:45 PM
To: user@tika.apache.org
Subject: Re: Adding a WARC parser to Tika

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!

Nick

Re: Adding a WARC parser to Tika

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a 
collection of WARC files just as it does for directories, to make it 
easier to run over crawl collections without having to unpack them first!

Nick