You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl> on 2016/09/13 07:24:30 UTC

404 removal not working and title mysteriously appearing in content

Hi,

The daily indexing seems to be working so far (field "indexed" is updated),
but pages that return a 404 are not removed from the solr index. The
content they return is also no included in the index. They just seem tot be
ingnored.
At first db.update.purge.404 was set to true, but upon reading a bit
further on that setting it seemed to me that this would remove the pages
from the Nutch db, essentially leaving them alone without updating the solr
index. So I changed it to false, hoping that they would now be removed from
the index. Alas, nothing changed.

Another issue is that the title tag contents appears at the beginning of
the "content" field before the actualy page contents. This looks a bit
silly so I searched for a place where it might be configured. Nothing in
schema.xml, schema-solr4.xml and solrindex-mapping.xml.
Maybe I've overlooked something, but I couldn't find any setting that might
explain this.
Is there a way to remove the title tag contents from the "content" field?

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: 404 removal not working and title mysteriously appearing in content

Posted by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>.

Hi,

2016-09-14 16:27 GMT+02:00 Jigal van Hemert | alterNET internet BV <
> jigal@alternet.nl>:
>
2016-09-13 04:41:36,541 INFO  indexer.CleaningJob - CleaningJob: deleted a
> total of 2 documents
> 2016-09-13 04:41:36,545 WARN  mapred.FileOutputCommitter - Output path is
> null in cleanup
> 2016-09-13 04:41:37,313 INFO  indexer.CleaningJob - CleaningJob: finished
> at 2016-09-13 04:41:37, elapsed: 00:00:06
>
> It claims to have deleted 2 documents, but there are plenty of 404 pages
> still in the index.
>
> I think it's quite an old version of Nutch. There is a
> lib/apache-nutch-1.8.jar file :-)
>
>
As a workaround I now simply remove all documents which are indexed before
today (as all pages are crawled and updated daily) by calling the update
handler with a delete query. This is however not as it should work, or is
it?

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: 404 removal not working and title mysteriously appearing in content

Posted by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>.

Hi Sebastian,

On 14 September 2016 at 15:20, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Should have the same effect than indexing with -deleteGone.
> If you are using Nutch 1.12 also have a look at this bug which
> could be the reason for your problem:
>   https://issues.apache.org/jira/browse/NUTCH-2269
> Do you see similar errors in the logs?
>
>
2016-09-13 04:38:04,391 INFO  solr.SolrIndexWriter - Indexing 177 documents
2016-09-13 04:38:41,017 INFO  solr.SolrMappingReader - source: appKey dest:
appKey
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: access dest:
access
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: content
dest: content
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: endtime
dest: endtime
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: keywords
dest: keywords
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: site dest:
site
2016-09-13 04:38:41,030 INFO  solr.SolrMappingReader - source: title dest:
title
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: tstamp dest:
changed
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: tstamp dest:
created
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: siteHash
dest: siteHash
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: uid dest: uid
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: type dest:
type
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: site dest:
nutchSite_stringS
2016-09-13 04:38:41,031 INFO  solr.SolrMappingReader - source: host dest:
nutchHost_stringS
2016-09-13 04:41:22,120 INFO  indexer.IndexingJob - Indexer: finished at
2016-09-13 04:41:22, elapsed: 00:03:34
2016-09-13 04:41:30,489 INFO  indexer.CleaningJob - CleaningJob: starting
at 2016-09-13 04:41:30
2016-09-13 04:41:32,047 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2016-09-13 04:41:35,680 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: appKey dest:
appKey
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: access dest:
access
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: content
dest: content
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: endtime
dest: endtime
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: keywords
dest: keywords
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: site dest:
site
2016-09-13 04:41:35,759 INFO  solr.SolrMappingReader - source: title dest:
title
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: tstamp dest:
changed
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: tstamp dest:
created
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: siteHash
dest: siteHash
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: uid dest: uid
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: type dest:
type
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: site dest:
nutchSite_stringS
2016-09-13 04:41:35,760 INFO  solr.SolrMappingReader - source: host dest:
nutchHost_stringS
2016-09-13 04:41:36,541 INFO  indexer.CleaningJob - CleaningJob: deleted a
total of 2 documents
2016-09-13 04:41:36,545 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2016-09-13 04:41:37,313 INFO  indexer.CleaningJob - CleaningJob: finished
at 2016-09-13 04:41:37, elapsed: 00:00:06
2016-09-13 04:41:38,857 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

It claims to have deleted 2 documents, but there are plenty of 404 pages
still in the index.

I think it's quite an old version of Nutch. There is a
lib/apache-nutch-1.8.jar file :-)

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: 404 removal not working and title mysteriously appearing in content

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Jigal,

>> are you indexing with
>>   bin/nutch index ... -deleteGone
>>
>
> No, I'm using:
>
> bin/crawl urls/[projectname] crawls/[projectname]
> http://solr_server.tld/solr/[projectname] 2

Ok, understood. In bin/crawl deletion of 404s is done
by calling first
  bin/nutch index ...
and then
  bin/nutch clean ...

Should have the same effect than indexing with -deleteGone.
If you are using Nutch 1.12 also have a look at this bug which
could be the reason for your problem:
  https://issues.apache.org/jira/browse/NUTCH-2269
Do you see similar errors in the logs?

>>> Another issue is that the title tag contents appears at the beginning of
>>> the "content" field before the actualy page contents.
>>
> Good to know that I didn't miss a setting :-)
> Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who
> spends a lot of free time on the FOSS project TYPO3).
>
> For the time being I can report back that it's hardcoded and that it can't
> be configured. Thanks for that information (really; no sarcasm)!
>

Ok, I'll hope to get it addressed soon.

Cheers,
Sebastian

On 09/14/2016 09:51 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi Sebastian,
> 
> Thanks for the reply.
> 
> On 13 September 2016 at 17:14, Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
>> are you indexing with
>>   bin/nutch index ... -deleteGone
>>
> 
> No, I'm using:
> 
> bin/crawl urls/[projectname] crawls/[projectname]
> http://solr_server.tld/solr/[projectname] 2
> 
> 
>> Purging 404s from CrawlDb should be done only from time to time
>> to keep the CrawlDb small. Normally, 404s are recorded to avoid
>> that they are refetched frequently.
>>
> 
> I'm not too concerned about 404s in CrawlDb, but about the fact that they
> are not removed from the solr index.
> It's only a few hundred URLs that need to be indexed and even if it were
> thousands of 404 items it would not be a problem for a looooong time :-)
> 
> 
>>
>>> Another issue is that the title tag contents appears at the beginning of
>>> the "content" field before the actualy page contents.
>>
>> Yes, this is the case. In general, it's not wrong if "content" is a pure
>> search field and not used as display field. It's a known feature request
>> [1],
>> so let's implement it know as a configurable option. If you have time
>> to work on it that's fine. If not I could get it done the next days.
>>
> 
> Good to know that I didn't miss a setting :-)
> Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who
> spends a lot of free time on the FOSS project TYPO3).
> 
> For the time being I can report back that it's hardcoded and that it can't
> be configured. Thanks for that information (really; no sarcasm)!
> 
>

Re: 404 removal not working and title mysteriously appearing in content

Posted by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>.

Hi Sebastian,

Thanks for the reply.

On 13 September 2016 at 17:14, Sebastian Nagel <wa...@googlemail.com>
wrote:

> are you indexing with
>   bin/nutch index ... -deleteGone
>

No, I'm using:

bin/crawl urls/[projectname] crawls/[projectname]
http://solr_server.tld/solr/[projectname] 2

> Purging 404s from CrawlDb should be done only from time to time
> to keep the CrawlDb small. Normally, 404s are recorded to avoid
> that they are refetched frequently.
>

I'm not too concerned about 404s in CrawlDb, but about the fact that they
are not removed from the solr index.
It's only a few hundred URLs that need to be indexed and even if it were
thousands of 404 items it would not be a problem for a looooong time :-)

>
> > Another issue is that the title tag contents appears at the beginning of
> > the "content" field before the actualy page contents.
>
> Yes, this is the case. In general, it's not wrong if "content" is a pure
> search field and not used as display field. It's a known feature request
> [1],
> so let's implement it know as a configurable option. If you have time
> to work on it that's fine. If not I could get it done the next days.
>

Good to know that I didn't miss a setting :-)
Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who
spends a lot of free time on the FOSS project TYPO3).

For the time being I can report back that it's hardcoded and that it can't
be configured. Thanks for that information (really; no sarcasm)!

-- 

Met vriendelijke groet,

Jigal van Hemert | Ontwikkelaar

Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl

Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: 404 removal not working and title mysteriously appearing in content

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Jigal,

are you indexing with
  bin/nutch index ... -deleteGone

Purging 404s from CrawlDb should be done only from time to time
to keep the CrawlDb small. Normally, 404s are recorded to avoid
that they are refetched frequently.

> Another issue is that the title tag contents appears at the beginning of
> the "content" field before the actualy page contents.

Yes, this is the case. In general, it's not wrong if "content" is a pure
search field and not used as display field. It's a known feature request [1],
so let's implement it know as a configurable option. If you have time
to work on it that's fine. If not I could get it done the next days.

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1749


On 09/13/2016 09:24 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi,
> 
> The daily indexing seems to be working so far (field "indexed" is updated),
> but pages that return a 404 are not removed from the solr index. The
> content they return is also no included in the index. They just seem tot be
> ingnored.
> At first db.update.purge.404 was set to true, but upon reading a bit
> further on that setting it seemed to me that this would remove the pages
> from the Nutch db, essentially leaving them alone without updating the solr
> index. So I changed it to false, hoping that they would now be removed from
> the index. Alas, nothing changed.
> 
> Another issue is that the title tag contents appears at the beginning of
> the "content" field before the actualy page contents. This looks a bit
> silly so I searched for a place where it might be configured. Nothing in
> schema.xml, schema-solr4.xml and solrindex-mapping.xml.
> Maybe I've overlooked something, but I couldn't find any setting that might
> explain this.
> Is there a way to remove the title tag contents from the "content" field?
>