You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Davíð Steinn Geirsson <da...@dsg.is> on 2016/04/14 17:51:04 UTC

Nutch WARC export problems

Hi all,

I'm trying to use Nutch v1.11 for an archival crawl and export
the results to WARC files.

It seems there are at least two seperate WARC exporters in Nutch,
but both have some problems.

The first one is org.apache.nutch.tools.CommonCrawlDataDumper
(invoked with 'nutch commoncrawldump' which can export a WARC
file with the appropriate option. The resulting WARC file looks
good, except that the HTTP response body seems to have been
mangled by removing the CR-LF between the HTTP response headers
and the HTTP response body. The result is that it's not really
possible to tell where the headers end and the body begins.

The second one is org.apache.nutch.tools.warc.WARCExporter
(invoked with 'nutch warc'). That one writes WARC response
records properly, with the header seperator. Unfortunately,
that's *all* it writes - the resulting file contains no matching
request records, or even a warcinfo record for that matter.

So my question is, is it possible to use Nutch in its present
state to export working WARC files containing both request and
response records? I'm willing to move to nutch v2.x if it makes a
difference.

Best regards,
Davíð

Re: Nutch WARC export problems

Posted by Julien Nioche <li...@gmail.com>.
Hi David

I've created NUTCH-2255 <https://issues.apache.org/jira/browse/NUTCH-2255> to
track this (as well as https://github.com/DigitalPebble/sc-warc/issues/1 for
StormCrawler). Not sure if/when I'll find the time to work on this but at
least it is now in JIRA.

Best

Julien

On 18 April 2016 at 23:25, Davíð Steinn Geirsson <da...@dsg.is> wrote:

> Hi Julien,
>
> Julien Nioche <li...@gmail.com> wrote:
> > Hi David
> >
> >  the resulting file contains no matching request records, or even a
> > > warcinfo record for that matter.
> >
> >
> >  It wouldn't be too difficult to add at least the request records to
> > WARCExporter
> > - please open a JIRA + contributions are welcome as always.
>
> Thanks for the info, I'll open a ticket. I'm not familiar enough
> with java to take a crack at that unfortunately.
>
> I did manage to fix the response record output of the
> CommonCrawlDataDumper, since it was only a tiny change. But given
> this bug, I'm leary of trusting its WARC output and I think I'll
> need to find some good WARC test suite to run it through. If I
> do, I'll submit a patch.
>
> >
> > I'm willing to move to nutch v2.x if it makes a difference.
> >
> >
> > 2.x has neither resources, you're better off being on 1.x
>
> Good to know, thanks.
>
> Best regards,
> Davíð
>
>
>
> >
> > Julien
> >
> >
> > On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is>
> > wrote:
> >
> > > Hi all,
> > >
> > > I'm trying to use Nutch v1.11 for an archival crawl and export
> > > the results to WARC files.
> > >
> > > It seems there are at least two seperate WARC exporters in Nutch,
> > > but both have some problems.
> > >
> > > The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> > > (invoked with 'nutch commoncrawldump' which can export a WARC
> > > file with the appropriate option. The resulting WARC file looks
> > > good, except that the HTTP response body seems to have been
> > > mangled by removing the CR-LF between the HTTP response headers
> > > and the HTTP response body. The result is that it's not really
> > > possible to tell where the headers end and the body begins.
> > >
> > > The second one is org.apache.nutch.tools.warc.WARCExporter
> > > (invoked with 'nutch warc'). That one writes WARC response
> > > records properly, with the header seperator. Unfortunately,
> > > that's *all* it writes - the resulting file contains no matching
> > > request records, or even a warcinfo record for that matter.
> > >
> > > So my question is, is it possible to use Nutch in its present
> > > state to export working WARC files containing both request and
> > > response records? I'm willing to move to nutch v2.x if it makes a
> > > difference.
> > >
> > > Best regards,
> > > Davíð
> >
> >
> >
> >
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Nutch WARC export problems

Posted by Davíð Steinn Geirsson <da...@dsg.is>.
Hi Julien,

Julien Nioche <li...@gmail.com> wrote:
> Hi David
> 
>  the resulting file contains no matching request records, or even a
> > warcinfo record for that matter.
> 
> 
>  It wouldn't be too difficult to add at least the request records to
> WARCExporter
> - please open a JIRA + contributions are welcome as always.

Thanks for the info, I'll open a ticket. I'm not familiar enough
with java to take a crack at that unfortunately.

I did manage to fix the response record output of the
CommonCrawlDataDumper, since it was only a tiny change. But given
this bug, I'm leary of trusting its WARC output and I think I'll
need to find some good WARC test suite to run it through. If I
do, I'll submit a patch.

> 
> I'm willing to move to nutch v2.x if it makes a difference.
> 
> 
> 2.x has neither resources, you're better off being on 1.x

Good to know, thanks.

Best regards,
Davíð



> 
> Julien
> 
> 
> On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is>
> wrote:
> 
> > Hi all,
> >
> > I'm trying to use Nutch v1.11 for an archival crawl and export
> > the results to WARC files.
> >
> > It seems there are at least two seperate WARC exporters in Nutch,
> > but both have some problems.
> >
> > The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> > (invoked with 'nutch commoncrawldump' which can export a WARC
> > file with the appropriate option. The resulting WARC file looks
> > good, except that the HTTP response body seems to have been
> > mangled by removing the CR-LF between the HTTP response headers
> > and the HTTP response body. The result is that it's not really
> > possible to tell where the headers end and the body begins.
> >
> > The second one is org.apache.nutch.tools.warc.WARCExporter
> > (invoked with 'nutch warc'). That one writes WARC response
> > records properly, with the header seperator. Unfortunately,
> > that's *all* it writes - the resulting file contains no matching
> > request records, or even a warcinfo record for that matter.
> >
> > So my question is, is it possible to use Nutch in its present
> > state to export working WARC files containing both request and
> > response records? I'm willing to move to nutch v2.x if it makes a
> > difference.
> >
> > Best regards,
> > Davíð
> 
> 
> 
> 

Re: Nutch WARC export problems

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Julien, hi David,

we could also try to merge the WARC generation code
of both tools, so that we do not have to apply fixes
twice (now and in the future).  The substantial
difference
 commoncrawldump - runs only locally while
 warc - is scalable via Hadoop
isn't as easy to merge.  But sharing the
representation / generation of a WARC document
(or request - response pair) should be doable.

Cheers,
Sebastian

On 04/14/2016 09:50 PM, Julien Nioche wrote:
> Hi David
> 
>  the resulting file contains no matching request records, or even a
>> warcinfo record for that matter.
> 
> 
>  It wouldn't be too difficult to add at least the request records to
> WARCExporter
> - please open a JIRA + contributions are welcome as always.
> 
> I'm willing to move to nutch v2.x if it makes a difference.
> 
> 
> 2.x has neither resources, you're better off being on 1.x
> 
> Julien
> 
> 
> On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is> wrote:
> 
>> Hi all,
>>
>> I'm trying to use Nutch v1.11 for an archival crawl and export
>> the results to WARC files.
>>
>> It seems there are at least two seperate WARC exporters in Nutch,
>> but both have some problems.
>>
>> The first one is org.apache.nutch.tools.CommonCrawlDataDumper
>> (invoked with 'nutch commoncrawldump' which can export a WARC
>> file with the appropriate option. The resulting WARC file looks
>> good, except that the HTTP response body seems to have been
>> mangled by removing the CR-LF between the HTTP response headers
>> and the HTTP response body. The result is that it's not really
>> possible to tell where the headers end and the body begins.
>>
>> The second one is org.apache.nutch.tools.warc.WARCExporter
>> (invoked with 'nutch warc'). That one writes WARC response
>> records properly, with the header seperator. Unfortunately,
>> that's *all* it writes - the resulting file contains no matching
>> request records, or even a warcinfo record for that matter.
>>
>> So my question is, is it possible to use Nutch in its present
>> state to export working WARC files containing both request and
>> response records? I'm willing to move to nutch v2.x if it makes a
>> difference.
>>
>> Best regards,
>> Davíð
> 
> 
> 
> 


Re: Nutch WARC export problems

Posted by Julien Nioche <li...@gmail.com>.
Hi David

 the resulting file contains no matching request records, or even a
> warcinfo record for that matter.


 It wouldn't be too difficult to add at least the request records to
WARCExporter
- please open a JIRA + contributions are welcome as always.

I'm willing to move to nutch v2.x if it makes a difference.


2.x has neither resources, you're better off being on 1.x

Julien


On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is> wrote:

> Hi all,
>
> I'm trying to use Nutch v1.11 for an archival crawl and export
> the results to WARC files.
>
> It seems there are at least two seperate WARC exporters in Nutch,
> but both have some problems.
>
> The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> (invoked with 'nutch commoncrawldump' which can export a WARC
> file with the appropriate option. The resulting WARC file looks
> good, except that the HTTP response body seems to have been
> mangled by removing the CR-LF between the HTTP response headers
> and the HTTP response body. The result is that it's not really
> possible to tell where the headers end and the body begins.
>
> The second one is org.apache.nutch.tools.warc.WARCExporter
> (invoked with 'nutch warc'). That one writes WARC response
> records properly, with the header seperator. Unfortunately,
> that's *all* it writes - the resulting file contains no matching
> request records, or even a warcinfo record for that matter.
>
> So my question is, is it possible to use Nutch in its present
> state to export working WARC files containing both request and
> response records? I'm willing to move to nutch v2.x if it makes a
> difference.
>
> Best regards,
> Davíð




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>