You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/06/18 18:07:49 UTC

Fixing Nutch 2.x Build on Jenkins

Hi Folks,
A while ago, somewhere, we broke the 2.x build!
I've described this in NUTCH-1792
<https://issues.apache.org/jira/browse/NUTCH-1792>
Here is the paste log which somewhere includes the commit which broke the
build.
Does anyone have a clue why the TestImageMetadata test for parse-tika is
failing?
Thanks
Lewis

------------------------------------------------------------------------
r1601937 | jnioche | 2014-06-11 11:56:20 -0400 (Wed, 11 Jun 2014) | 1 line

NUTCH-1736 <https://issues.apache.org/jira/browse/NUTCH-1736> Can't fetch
page if http response header contains Transfer-Encoding:chunked
------------------------------------------------------------------------
r1600837 | markus | 2014-06-06 06:01:51 -0400 (Fri, 06 Jun 2014) | 2 lines

NUTCH-1782 <https://issues.apache.org/jira/browse/NUTCH-1782> NodeWalker to
return current node

------------------------------------------------------------------------
r1600599 | jnioche | 2014-06-05 07:09:42 -0400 (Thu, 05 Jun 2014) | 1 line

Fixing blunder in Nutch-1781
------------------------------------------------------------------------
r1600561 | lewismc | 2014-06-04 23:00:10 -0400 (Wed, 04 Jun 2014) | 1 line

NUTCH-1788 <https://issues.apache.org/jira/browse/NUTCH-1788> Tika may
return multiple values for Title on PDF's
------------------------------------------------------------------------
r1600559 | lewismc | 2014-06-04 22:17:14 -0400 (Wed, 04 Jun 2014) | 1 line

Temporary disable TestGoraStore due to GORA-326
<https://issues.apache.org/jira/browse/GORA-326> Removal of _g_dirty field
from _ALL_FIELDS array and Field Enum in Persistent classes
------------------------------------------------------------------------
r1600546 | lewismc | 2014-06-04 20:18:02 -0400 (Wed, 04 Jun 2014) | 1 line

NUTCH-1781 <https://issues.apache.org/jira/browse/NUTCH-1781> Update
gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4
------------------------------------------------------------------------
r1598622 | jnioche | 2014-05-30 10:55:51 -0400 (Fri, 30 May 2014) | 1 line

NUTCH-1768 <https://issues.apache.org/jira/browse/NUTCH-1768> Upgrade to
ElasticSearch 1.1.0
------------------------------------------------------------------------
r1598619 | jnioche | 2014-05-30 10:50:45 -0400 (Fri, 30 May 2014) | 1 line

NUTCH-1634 <https://issues.apache.org/jira/browse/NUTCH-1634> : readdb
-stats shows the result twice
------------------------------------------------------------------------
r1595398 | lewismc | 2014-05-16 20:38:18 -0400 (Fri, 16 May 2014) | 1 line

NUTCH-1780 <https://issues.apache.org/jira/browse/NUTCH-1780> ttl and
gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
------------------------------------------------------------------------
r1595196 | jnioche | 2014-05-16 09:40:21 -0400 (Fri, 16 May 2014) | 1 line

NUTCH-1676 <https://issues.apache.org/jira/browse/NUTCH-1676> Add
rudimentary SSL support to protocol-http
------------------------------------------------------------------------
r1594813 | jnioche | 2014-05-15 04:14:38 -0400 (Thu, 15 May 2014) | 1 line

NUTCH-1674 <https://issues.apache.org/jira/browse/NUTCH-1674> Use batchId
filter to enable scan (GORA-119
<https://issues.apache.org/jira/browse/GORA-119>) for
Fetch,Parse,Update,Index (Tien Nguyen Manh and Alparslan Avcı via jnioche)
------------------------------------------------------------------------
r1594812 | jnioche | 2014-05-15 04:10:07 -0400 (Thu, 15 May 2014) | 1 line

NUTCH-1714 <https://issues.apache.org/jira/browse/NUTCH-1714> Nutch 2.x
upgrade to Gora 0.4
------------------------------------------------------------------------
r1594071 | snagel | 2014-05-12 15:39:43 -0400 (Mon, 12 May 2014) | 1 line

NUTCH-1752 <https://issues.apache.org/jira/browse/NUTCH-1752> Cache
robots.txt rules per protocol:host:port
------------------------------------------------------------------------
r1593954 | jnioche | 2014-05-12 08:58:41 -0400 (Mon, 12 May 2014) | 1 line

NUTCH-1613 <https://issues.apache.org/jira/browse/NUTCH-1613> Timeouts in
protocol-httpclient when crawling same host with >2 threads
------------------------------------------------------------------------
r1592414 | snagel | 2014-05-04 16:18:50 -0400 (Sun, 04 May 2014) | 1 line

NUTCH-1182 <https://issues.apache.org/jira/browse/NUTCH-1182> fetcher to
log hung threads
------------------------------------------------------------------------


-- 
*Lewis*

Re: Fixing Nutch 2.x Build on Jenkins

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Lewis,

it seems to be related to NUTCH-1714:
WebPage-owned maps (metadata, headers, etc.) are not
initialized any more in the constructor.
This causes also other tests to fail.

The solution would be to replace
  WebPage page = new WebPage();
by
  WebPage page = WebPage.newBuilder().build();
in every test where a WebPage object is needed.

Right?

I'll open a Jira and try to provide a patch.

Cheers,
Sebastian



On 06/18/2014 06:07 PM, Lewis John Mcgibbney wrote:
> Hi Folks,
> A while ago, somewhere, we broke the 2.x build!
> I've described this in NUTCH-1792 <https://issues.apache.org/jira/browse/NUTCH-1792>
> Here is the paste log which somewhere includes the commit which broke the build.
> Does anyone have a clue why the TestImageMetadata test for parse-tika is failing?
> Thanks
> Lewis
> 
> ------------------------------------------------------------------------
> r1601937 | jnioche | 2014-06-11 11:56:20 -0400 (Wed, 11 Jun 2014) | 1 line
> 
> NUTCH-1736 <https://issues.apache.org/jira/browse/NUTCH-1736> Can't fetch page if http response
> header contains Transfer-Encoding:chunked
> ------------------------------------------------------------------------
> r1600837 | markus | 2014-06-06 06:01:51 -0400 (Fri, 06 Jun 2014) | 2 lines
> 
> NUTCH-1782 <https://issues.apache.org/jira/browse/NUTCH-1782> NodeWalker to return current node
> 
> ------------------------------------------------------------------------
> r1600599 | jnioche | 2014-06-05 07:09:42 -0400 (Thu, 05 Jun 2014) | 1 line
> 
> Fixing blunder in Nutch-1781
> ------------------------------------------------------------------------
> r1600561 | lewismc | 2014-06-04 23:00:10 -0400 (Wed, 04 Jun 2014) | 1 line
> 
> NUTCH-1788 <https://issues.apache.org/jira/browse/NUTCH-1788> Tika may return multiple values for
> Title on PDF's
> ------------------------------------------------------------------------
> r1600559 | lewismc | 2014-06-04 22:17:14 -0400 (Wed, 04 Jun 2014) | 1 line
> 
> Temporary disable TestGoraStore due to GORA-326 <https://issues.apache.org/jira/browse/GORA-326>
> Removal of _g_dirty field from _ALL_FIELDS array and Field Enum in Persistent classes
> ------------------------------------------------------------------------
> r1600546 | lewismc | 2014-06-04 20:18:02 -0400 (Wed, 04 Jun 2014) | 1 line
> 
> NUTCH-1781 <https://issues.apache.org/jira/browse/NUTCH-1781> Update gora-*-mapping.xml and
> gora.proeprties to reflect Gora 0.4
> ------------------------------------------------------------------------
> r1598622 | jnioche | 2014-05-30 10:55:51 -0400 (Fri, 30 May 2014) | 1 line
> 
> NUTCH-1768 <https://issues.apache.org/jira/browse/NUTCH-1768> Upgrade to ElasticSearch 1.1.0
> ------------------------------------------------------------------------
> r1598619 | jnioche | 2014-05-30 10:50:45 -0400 (Fri, 30 May 2014) | 1 line
> 
> NUTCH-1634 <https://issues.apache.org/jira/browse/NUTCH-1634> : readdb -stats shows the result twice
> ------------------------------------------------------------------------
> r1595398 | lewismc | 2014-05-16 20:38:18 -0400 (Fri, 16 May 2014) | 1 line
> 
> NUTCH-1780 <https://issues.apache.org/jira/browse/NUTCH-1780> ttl and gc_grace_seconds attributes
> are missing from gora-cassandra-mapping.xml file
> ------------------------------------------------------------------------
> r1595196 | jnioche | 2014-05-16 09:40:21 -0400 (Fri, 16 May 2014) | 1 line
> 
> NUTCH-1676 <https://issues.apache.org/jira/browse/NUTCH-1676> Add rudimentary SSL support to
> protocol-http
> ------------------------------------------------------------------------
> r1594813 | jnioche | 2014-05-15 04:14:38 -0400 (Thu, 15 May 2014) | 1 line
> 
> NUTCH-1674 <https://issues.apache.org/jira/browse/NUTCH-1674> Use batchId filter to enable scan
> (GORA-119 <https://issues.apache.org/jira/browse/GORA-119>) for Fetch,Parse,Update,Index (Tien
> Nguyen Manh and Alparslan Avcı via jnioche)
> ------------------------------------------------------------------------
> r1594812 | jnioche | 2014-05-15 04:10:07 -0400 (Thu, 15 May 2014) | 1 line
> 
> NUTCH-1714 <https://issues.apache.org/jira/browse/NUTCH-1714> Nutch 2.x upgrade to Gora 0.4
> ------------------------------------------------------------------------
> r1594071 | snagel | 2014-05-12 15:39:43 -0400 (Mon, 12 May 2014) | 1 line
> 
> NUTCH-1752 <https://issues.apache.org/jira/browse/NUTCH-1752> Cache robots.txt rules per
> protocol:host:port
> ------------------------------------------------------------------------
> r1593954 | jnioche | 2014-05-12 08:58:41 -0400 (Mon, 12 May 2014) | 1 line
> 
> NUTCH-1613 <https://issues.apache.org/jira/browse/NUTCH-1613> Timeouts in protocol-httpclient when
> crawling same host with >2 threads
> ------------------------------------------------------------------------
> r1592414 | snagel | 2014-05-04 16:18:50 -0400 (Sun, 04 May 2014) | 1 line
> 
> NUTCH-1182 <https://issues.apache.org/jira/browse/NUTCH-1182> fetcher to log hung threads
> ------------------------------------------------------------------------
> 
> 
> 
> -- 
> /Lewis/