You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Hayles, Steven" <sh...@leicester.ac.uk> on 2015/07/03 11:37:16 UTC

Gone content not reported to Solr

I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)

What I see is that "bin/nutch update" sets db_gone status correctly, but "bin/nutch dedup" removes the records entirely before "bin/nutch index" can tell Sol to remove them from its index.

Is dedup doing more than it should, is the ordering of dedup and index wrong, or is there some configuration that I have wrong?

Thanks

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: sh23@le.ac.uk<ma...@le.ac.uk>

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester

Re: Gone content not reported to Solr

Posted by Sebastian Nagel <wa...@googlemail.com>.

> If db.update.purge.404 is not set, would records with status DB_GONE stay forever, and Solr be
> repeatedly told to remove them?

Yes. But since for a deletion only the URL is sent as document ID
the impact is small unless there are many gone documents.

There is no way to avoid this because Nutch does not store the
status in the index. Maybe it would be better to purge the crawlDb
only from time to time via

 bin/nutch updatedb -Ddb.update.purge.404=true ...

Keeping gone documents also avoids that dead links are tried
to fetch every time they are seen again.

But, no question, there should be at least a statement in the documentation
about the problem regarding db.update.purge.404

Thanks for clarification!

On 07/22/2015 01:15 PM, Steven Hayles wrote:
> 
> Hi Sebastian
> 
> Thanks for the explanation.
> 
> If db.update.purge.404 is not set, would records with status DB_GONE stay forever, and Solr be
> repeatedly told to remove them?
> 
> Steven Hayles
> Systems Analyst
> 
> IT Services, University of Leicester,
> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
> 
> T: +44 (0)116 229 7950
> E: sh23@le.ac.uk
> 
> The Queen's Anniversary Prizes 1994, 2002 & 2013
> THE Awards Winners 2007-2013
> 
> Elite without being elitist
> 
> Follow us on Twitter http://twitter.com/uniofleicester or
> visit our Facebook page https://facebook.com/UniofLeicester
> 
> 
> On Tue, 14 Jul 2015, Sebastian Nagel wrote:
> 
>> Hi Steven,
>>
>> thanks for reporting the issue.
>>
>> I tried to reproduce the problem without success.
>> While looking back to the conversation I found that this property could be
>> the reason:
>>
>> <property>
>>  <name>db.update.purge.404</name>
>>  <value>true</value>
>>  <description>If true, updatedb will add purge records with status DB_GONE
>>  from the CrawlDB.
>>  </description>
>> </property>
>>
>> The dedup job shares some code with the update job, namely the
>> CrawlDbFilter as mapper
>> which will filter away all db_gone records if db.update.purge.404 is true.
>> That's not really wrong (the next update job would remove the gone pages
>> anyway) but
>> should be clearly documented.
>>
>> Thanks again,
>> Sebastian
>>
>> 2015-07-07 10:30 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>>
>>>
>>> Created https://issues.apache.org/jira/browse/NUTCH-2060
>>>
>>> In fact, "bin/crawl" uses "bin/nutch clean" rather than the -deleteGone
>>> option on "bin/nutch index".
>>>
>>> As a work around, I've added "bin/nutch clean" before "bin/nutch dedup"
>>>
>>> Steven Hayles
>>> Systems Analyst
>>>
>>> IT Services, University of Leicester,
>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>
>>> T: +44 (0)116 229 7950
>>> E: sh23@le.ac.uk
>>>
>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>> THE Awards Winners 2007-2013
>>>
>>> Elite without being elitist
>>>
>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>
>>>
>>> On Mon, 6 Jul 2015, Sebastian Nagel wrote:
>>>
>>>  Hi Steven,
>>>>
>>>>  After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>>>>
>>>> longer present
>>>> That's a bug. It should be there, no question.  Could you, please, open a
>>>> Jira issue [1]
>>>>
>>>> The index command needs the option
>>>>  -deleteGone
>>>> to send deletions to Solr. But if the db_gone pages disappeared that has
>>>> no
>>>> effect,
>>>> of course :)
>>>>
>>>> Thanks,
>>>> Sebastian
>>>>
>>>>
>>>> 2015-07-06 10:07 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>>>>
>>>>
>>>>> Hi Sebastian
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> I was using readdb to see what was happening. It looked like this
>>>>>
>>>>> Two pages indexed:
>>>>>
>>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>>   Status: 2 (db_fetched)
>>>>>   --
>>>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>>>   Status: 2 (db_fetched)
>>>>>
>>>>> Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
>>>>>
>>>>> After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is
>>>>> marked
>>>>> as db_gone, as expected:
>>>>>
>>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>>   Status: 2 (db_fetched)
>>>>>   --
>>>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>>>   Status: 3 (db_gone)
>>>>>
>>>>> (After invert links, there was no change)
>>>>>
>>>>> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>>>> longer present
>>>>>
>>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>>   Status: 2 (db_fetched)
>>>>>
>>>>> Neither index nor clean clear 404s from Solr.
>>>>>
>>>>>
>>>>> I'm just using the commands as given in bin/crawl from Nutch 1.9:
>>>>>
>>>>>   $bin/nutch dedup $CRAWL_PATH/crawldb
>>>>>
>>>>>   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
>>>>> -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>>>>>
>>>>>
>>>>> When I added an extra clean before dedup, Solr got the instruction to
>>>>> remove the deleted document.
>>>>>
>>>>> There's nothing much in nutch-site.xml. It's mostly limits to make
>>>>> testing
>>>>> easier, static field added, metadata processing removed,
>>>>> db.update.purge.404 enabled.
>>>>>
>>>>> <?xml version="1.0"?>
>>>>> <configuration>
>>>>>  <property>
>>>>>   <name>http.agent.name</name>
>>>>>   <value>nutch-solr-integration</value>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>generate.max.per.host</name>
>>>>>   <value>100</value>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>plugin.includes</name>
>>>>>
>>>>>
>>>>> <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
>>>>>
>>>>>  </property>
>>>>>  <property>
>>>>>    <name>index.static</name>
>>>>>    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
>>>>>    <description>
>>>>>     Used by plugin index-static to adds fields with static data at
>>>>> indexing time.
>>>>>    You can specify a comma-separated list of fieldname:fieldcontent per
>>>>> Nutch job.
>>>>>   Each fieldcontent can have multiple values separated by space, e.g.,
>>>>>    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
>>>>>    It can be useful when collections can't be created by URL patterns,
>>>>>   like in subcollection, but on a job-basis.
>>>>>   </description>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>http.timeout</name>
>>>>>   <value>5000</value>
>>>>>   <description>The default network timeout, in
>>>>> milliseconds.</description>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>fetcher.server.delay</name>
>>>>>   <value>0.1</value>
>>>>>   <description>The number of seconds the fetcher will delay between
>>>>>    successive requests to the same server. Note that this might get
>>>>>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>>>>>    fetcher.threads.per.queue is set to 1.
>>>>>    </description>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>db.fetch.interval.default</name>
>>>>>   <value>60</value>
>>>>>   <description>The default number of seconds between re-fetches of a page
>>>>> (30 days).
>>>>>   </description>
>>>>>  </property>
>>>>>  <property>
>>>>>   <name>db.update.purge.404</name>
>>>>>   <value>true</value>
>>>>>   <description>If true, updatedb will add purge records with status
>>>>> DB_GONE
>>>>>   from the CrawlDB.
>>>>>   </description>
>>>>>  </property>
>>>>> </configuration>
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>> On Sat, 4 Jul 2015, Sebastian Nagel wrote:
>>>>>
>>>>>  Hi Steven,
>>>>>
>>>>>>
>>>>>>  is the ordering of dedup and index wrong
>>>>>>
>>>>>>>
>>>>>>>  No, that's correct: it would be not really efficient to first index
>>>>>> duplicates
>>>>>> and then remove them afterwards.
>>>>>>
>>>>>> If I understand right the db_gone pages have previously been indexed
>>>>>> (and were successfully fetched), right?
>>>>>>
>>>>>>  but "bin/nutch dedup" removes the records entirely
>>>>>>
>>>>>>>
>>>>>>>  A dedup job should neither remove records entirely,
>>>>>> they are only set to status db_duplicate, nor should
>>>>>> it touch anything except db_fetched and db_notmodified.
>>>>>> If it does that's a bug.
>>>>>>
>>>>>> Can you send the exact commands of "nutch dedup" and "nutch index"?
>>>>>> Have you checked the crawldb before and after using "bin/nutch readdb"
>>>>>> to get some hints what's special with these urls or documents?
>>>>>>
>>>>>> Thanks,
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>>>>>>
>>>>>>  I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>>>>>>
>>>>>>> What I see is that "bin/nutch update" sets db_gone status correctly,
>>>>>>> but
>>>>>>> "bin/nutch dedup" removes the records entirely before "bin/nutch
>>>>>>> index" can
>>>>>>> tell Sol to remove them from its index.
>>>>>>>
>>>>>>> Is dedup doing more than it should, is the ordering of dedup and index
>>>>>>> wrong, or is there some configuration that I have wrong?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Steven Hayles
>>>>>>> Systems Analyst
>>>>>>>
>>>>>>> IT Services, University of Leicester,
>>>>>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>>>>>
>>>>>>> T: +44 (0)116 229 7950
>>>>>>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>>>>>>
>>>>>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>>>>>> THE Awards Winners 2007-2013
>>>>>>>
>>>>>>> Elite without being elitist
>>>>>>>
>>>>>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>>>>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>

Re: Gone content not reported to Solr

Posted by Steven Hayles <sh...@leicester.ac.uk>.

Hi Sebastian

Thanks for the explanation.

If db.update.purge.404 is not set, would records with status DB_GONE stay 
forever, and Solr be repeatedly told to remove them?

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: sh23@le.ac.uk

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester


On Tue, 14 Jul 2015, Sebastian Nagel wrote:

> Hi Steven,
>
> thanks for reporting the issue.
>
> I tried to reproduce the problem without success.
> While looking back to the conversation I found that this property could be
> the reason:
>
> <property>
>  <name>db.update.purge.404</name>
>  <value>true</value>
>  <description>If true, updatedb will add purge records with status DB_GONE
>  from the CrawlDB.
>  </description>
> </property>
>
> The dedup job shares some code with the update job, namely the
> CrawlDbFilter as mapper
> which will filter away all db_gone records if db.update.purge.404 is true.
> That's not really wrong (the next update job would remove the gone pages
> anyway) but
> should be clearly documented.
>
> Thanks again,
> Sebastian
>
> 2015-07-07 10:30 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>
>>
>> Created https://issues.apache.org/jira/browse/NUTCH-2060
>>
>> In fact, "bin/crawl" uses "bin/nutch clean" rather than the -deleteGone
>> option on "bin/nutch index".
>>
>> As a work around, I've added "bin/nutch clean" before "bin/nutch dedup"
>>
>> Steven Hayles
>> Systems Analyst
>>
>> IT Services, University of Leicester,
>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>
>> T: +44 (0)116 229 7950
>> E: sh23@le.ac.uk
>>
>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>> THE Awards Winners 2007-2013
>>
>> Elite without being elitist
>>
>> Follow us on Twitter http://twitter.com/uniofleicester or
>> visit our Facebook page https://facebook.com/UniofLeicester
>>
>>
>> On Mon, 6 Jul 2015, Sebastian Nagel wrote:
>>
>>  Hi Steven,
>>>
>>>  After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>>>
>>> longer present
>>> That's a bug. It should be there, no question.  Could you, please, open a
>>> Jira issue [1]
>>>
>>> The index command needs the option
>>>  -deleteGone
>>> to send deletions to Solr. But if the db_gone pages disappeared that has
>>> no
>>> effect,
>>> of course :)
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>> 2015-07-06 10:07 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>>>
>>>
>>>> Hi Sebastian
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I was using readdb to see what was happening. It looked like this
>>>>
>>>> Two pages indexed:
>>>>
>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>   Status: 2 (db_fetched)
>>>>   --
>>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>>   Status: 2 (db_fetched)
>>>>
>>>> Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
>>>>
>>>> After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is
>>>> marked
>>>> as db_gone, as expected:
>>>>
>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>   Status: 2 (db_fetched)
>>>>   --
>>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>>   Status: 3 (db_gone)
>>>>
>>>> (After invert links, there was no change)
>>>>
>>>> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>>> longer present
>>>>
>>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>>   Status: 2 (db_fetched)
>>>>
>>>> Neither index nor clean clear 404s from Solr.
>>>>
>>>>
>>>> I'm just using the commands as given in bin/crawl from Nutch 1.9:
>>>>
>>>>   $bin/nutch dedup $CRAWL_PATH/crawldb
>>>>
>>>>   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
>>>> -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>>>>
>>>>
>>>> When I added an extra clean before dedup, Solr got the instruction to
>>>> remove the deleted document.
>>>>
>>>> There's nothing much in nutch-site.xml. It's mostly limits to make
>>>> testing
>>>> easier, static field added, metadata processing removed,
>>>> db.update.purge.404 enabled.
>>>>
>>>> <?xml version="1.0"?>
>>>> <configuration>
>>>>  <property>
>>>>   <name>http.agent.name</name>
>>>>   <value>nutch-solr-integration</value>
>>>>  </property>
>>>>  <property>
>>>>   <name>generate.max.per.host</name>
>>>>   <value>100</value>
>>>>  </property>
>>>>  <property>
>>>>   <name>plugin.includes</name>
>>>>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
>>>>  </property>
>>>>  <property>
>>>>    <name>index.static</name>
>>>>    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
>>>>    <description>
>>>>     Used by plugin index-static to adds fields with static data at
>>>> indexing time.
>>>>    You can specify a comma-separated list of fieldname:fieldcontent per
>>>> Nutch job.
>>>>   Each fieldcontent can have multiple values separated by space, e.g.,
>>>>    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
>>>>    It can be useful when collections can't be created by URL patterns,
>>>>   like in subcollection, but on a job-basis.
>>>>   </description>
>>>>  </property>
>>>>  <property>
>>>>   <name>http.timeout</name>
>>>>   <value>5000</value>
>>>>   <description>The default network timeout, in
>>>> milliseconds.</description>
>>>>  </property>
>>>>  <property>
>>>>   <name>fetcher.server.delay</name>
>>>>   <value>0.1</value>
>>>>   <description>The number of seconds the fetcher will delay between
>>>>    successive requests to the same server. Note that this might get
>>>>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>>>>    fetcher.threads.per.queue is set to 1.
>>>>    </description>
>>>>  </property>
>>>>  <property>
>>>>   <name>db.fetch.interval.default</name>
>>>>   <value>60</value>
>>>>   <description>The default number of seconds between re-fetches of a page
>>>> (30 days).
>>>>   </description>
>>>>  </property>
>>>>  <property>
>>>>   <name>db.update.purge.404</name>
>>>>   <value>true</value>
>>>>   <description>If true, updatedb will add purge records with status
>>>> DB_GONE
>>>>   from the CrawlDB.
>>>>   </description>
>>>>  </property>
>>>> </configuration>
>>>>
>>>> Steven
>>>>
>>>>
>>>> On Sat, 4 Jul 2015, Sebastian Nagel wrote:
>>>>
>>>>  Hi Steven,
>>>>
>>>>>
>>>>>  is the ordering of dedup and index wrong
>>>>>
>>>>>>
>>>>>>  No, that's correct: it would be not really efficient to first index
>>>>> duplicates
>>>>> and then remove them afterwards.
>>>>>
>>>>> If I understand right the db_gone pages have previously been indexed
>>>>> (and were successfully fetched), right?
>>>>>
>>>>>  but "bin/nutch dedup" removes the records entirely
>>>>>
>>>>>>
>>>>>>  A dedup job should neither remove records entirely,
>>>>> they are only set to status db_duplicate, nor should
>>>>> it touch anything except db_fetched and db_notmodified.
>>>>> If it does that's a bug.
>>>>>
>>>>> Can you send the exact commands of "nutch dedup" and "nutch index"?
>>>>> Have you checked the crawldb before and after using "bin/nutch readdb"
>>>>> to get some hints what's special with these urls or documents?
>>>>>
>>>>> Thanks,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>>>>>
>>>>>  I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>>>>>
>>>>>> What I see is that "bin/nutch update" sets db_gone status correctly,
>>>>>> but
>>>>>> "bin/nutch dedup" removes the records entirely before "bin/nutch
>>>>>> index" can
>>>>>> tell Sol to remove them from its index.
>>>>>>
>>>>>> Is dedup doing more than it should, is the ordering of dedup and index
>>>>>> wrong, or is there some configuration that I have wrong?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Steven Hayles
>>>>>> Systems Analyst
>>>>>>
>>>>>> IT Services, University of Leicester,
>>>>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>>>>
>>>>>> T: +44 (0)116 229 7950
>>>>>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>>>>>
>>>>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>>>>> THE Awards Winners 2007-2013
>>>>>>
>>>>>> Elite without being elitist
>>>>>>
>>>>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>>>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>

Re: Gone content not reported to Solr

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Steven,

thanks for reporting the issue.

I tried to reproduce the problem without success.
While looking back to the conversation I found that this property could be
the reason:

 <property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
 </property>

The dedup job shares some code with the update job, namely the
CrawlDbFilter as mapper
which will filter away all db_gone records if db.update.purge.404 is true.
That's not really wrong (the next update job would remove the gone pages
anyway) but
should be clearly documented.

Thanks again,
Sebastian

2015-07-07 10:30 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:

>
> Created https://issues.apache.org/jira/browse/NUTCH-2060
>
> In fact, "bin/crawl" uses "bin/nutch clean" rather than the -deleteGone
> option on "bin/nutch index".
>
> As a work around, I've added "bin/nutch clean" before "bin/nutch dedup"
>
> Steven Hayles
> Systems Analyst
>
> IT Services, University of Leicester,
> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>
> T: +44 (0)116 229 7950
> E: sh23@le.ac.uk
>
> The Queen's Anniversary Prizes 1994, 2002 & 2013
> THE Awards Winners 2007-2013
>
> Elite without being elitist
>
> Follow us on Twitter http://twitter.com/uniofleicester or
> visit our Facebook page https://facebook.com/UniofLeicester
>
>
> On Mon, 6 Jul 2015, Sebastian Nagel wrote:
>
>  Hi Steven,
>>
>>  After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>>
>> longer present
>> That's a bug. It should be there, no question.  Could you, please, open a
>> Jira issue [1]
>>
>> The index command needs the option
>>  -deleteGone
>> to send deletions to Solr. But if the db_gone pages disappeared that has
>> no
>> effect,
>> of course :)
>>
>> Thanks,
>> Sebastian
>>
>>
>> 2015-07-06 10:07 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>>
>>
>>> Hi Sebastian
>>>
>>> Thanks for your reply.
>>>
>>> I was using readdb to see what was happening. It looked like this
>>>
>>> Two pages indexed:
>>>
>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>   Status: 2 (db_fetched)
>>>   --
>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>   Status: 2 (db_fetched)
>>>
>>> Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
>>>
>>> After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is
>>> marked
>>> as db_gone, as expected:
>>>
>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>   Status: 2 (db_fetched)
>>>   --
>>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>>   Status: 3 (db_gone)
>>>
>>> (After invert links, there was no change)
>>>
>>> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>>> longer present
>>>
>>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>>   Status: 2 (db_fetched)
>>>
>>> Neither index nor clean clear 404s from Solr.
>>>
>>>
>>> I'm just using the commands as given in bin/crawl from Nutch 1.9:
>>>
>>>   $bin/nutch dedup $CRAWL_PATH/crawldb
>>>
>>>   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
>>> -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>>>
>>>
>>> When I added an extra clean before dedup, Solr got the instruction to
>>> remove the deleted document.
>>>
>>> There's nothing much in nutch-site.xml. It's mostly limits to make
>>> testing
>>> easier, static field added, metadata processing removed,
>>> db.update.purge.404 enabled.
>>>
>>> <?xml version="1.0"?>
>>> <configuration>
>>>  <property>
>>>   <name>http.agent.name</name>
>>>   <value>nutch-solr-integration</value>
>>>  </property>
>>>  <property>
>>>   <name>generate.max.per.host</name>
>>>   <value>100</value>
>>>  </property>
>>>  <property>
>>>   <name>plugin.includes</name>
>>>
>>>
>>> <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
>>>  </property>
>>>  <property>
>>>    <name>index.static</name>
>>>    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
>>>    <description>
>>>     Used by plugin index-static to adds fields with static data at
>>> indexing time.
>>>    You can specify a comma-separated list of fieldname:fieldcontent per
>>> Nutch job.
>>>   Each fieldcontent can have multiple values separated by space, e.g.,
>>>    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
>>>    It can be useful when collections can't be created by URL patterns,
>>>   like in subcollection, but on a job-basis.
>>>   </description>
>>>  </property>
>>>  <property>
>>>   <name>http.timeout</name>
>>>   <value>5000</value>
>>>   <description>The default network timeout, in
>>> milliseconds.</description>
>>>  </property>
>>>  <property>
>>>   <name>fetcher.server.delay</name>
>>>   <value>0.1</value>
>>>   <description>The number of seconds the fetcher will delay between
>>>    successive requests to the same server. Note that this might get
>>>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>>>    fetcher.threads.per.queue is set to 1.
>>>    </description>
>>>  </property>
>>>  <property>
>>>   <name>db.fetch.interval.default</name>
>>>   <value>60</value>
>>>   <description>The default number of seconds between re-fetches of a page
>>> (30 days).
>>>   </description>
>>>  </property>
>>>  <property>
>>>   <name>db.update.purge.404</name>
>>>   <value>true</value>
>>>   <description>If true, updatedb will add purge records with status
>>> DB_GONE
>>>   from the CrawlDB.
>>>   </description>
>>>  </property>
>>> </configuration>
>>>
>>> Steven
>>>
>>>
>>> On Sat, 4 Jul 2015, Sebastian Nagel wrote:
>>>
>>>  Hi Steven,
>>>
>>>>
>>>>  is the ordering of dedup and index wrong
>>>>
>>>>>
>>>>>  No, that's correct: it would be not really efficient to first index
>>>> duplicates
>>>> and then remove them afterwards.
>>>>
>>>> If I understand right the db_gone pages have previously been indexed
>>>> (and were successfully fetched), right?
>>>>
>>>>  but "bin/nutch dedup" removes the records entirely
>>>>
>>>>>
>>>>>  A dedup job should neither remove records entirely,
>>>> they are only set to status db_duplicate, nor should
>>>> it touch anything except db_fetched and db_notmodified.
>>>> If it does that's a bug.
>>>>
>>>> Can you send the exact commands of "nutch dedup" and "nutch index"?
>>>> Have you checked the crawldb before and after using "bin/nutch readdb"
>>>> to get some hints what's special with these urls or documents?
>>>>
>>>> Thanks,
>>>> Sebastian
>>>>
>>>>
>>>> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>>>>
>>>>  I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>>>>
>>>>> What I see is that "bin/nutch update" sets db_gone status correctly,
>>>>> but
>>>>> "bin/nutch dedup" removes the records entirely before "bin/nutch
>>>>> index" can
>>>>> tell Sol to remove them from its index.
>>>>>
>>>>> Is dedup doing more than it should, is the ordering of dedup and index
>>>>> wrong, or is there some configuration that I have wrong?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Steven Hayles
>>>>> Systems Analyst
>>>>>
>>>>> IT Services, University of Leicester,
>>>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>>>
>>>>> T: +44 (0)116 229 7950
>>>>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>>>>
>>>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>>>> THE Awards Winners 2007-2013
>>>>>
>>>>> Elite without being elitist
>>>>>
>>>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>

Re: Gone content not reported to Solr

Posted by Steven Hayles <sh...@leicester.ac.uk>.

Created https://issues.apache.org/jira/browse/NUTCH-2060

In fact, "bin/crawl" uses "bin/nutch clean" rather than the -deleteGone 
option on "bin/nutch index".

As a work around, I've added "bin/nutch clean" before "bin/nutch dedup"

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: sh23@le.ac.uk

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester


On Mon, 6 Jul 2015, Sebastian Nagel wrote:

> Hi Steven,
>
>> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
> longer present
> That's a bug. It should be there, no question.  Could you, please, open a
> Jira issue [1]
>
> The index command needs the option
>  -deleteGone
> to send deletions to Solr. But if the db_gone pages disappeared that has no
> effect,
> of course :)
>
> Thanks,
> Sebastian
>
>
> 2015-07-06 10:07 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:
>
>>
>> Hi Sebastian
>>
>> Thanks for your reply.
>>
>> I was using readdb to see what was happening. It looked like this
>>
>> Two pages indexed:
>>
>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>   Status: 2 (db_fetched)
>>   --
>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>   Status: 2 (db_fetched)
>>
>> Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
>>
>> After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is marked
>> as db_gone, as expected:
>>
>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>   Status: 2 (db_fetched)
>>   --
>>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>>   Status: 3 (db_gone)
>>
>> (After invert links, there was no change)
>>
>> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
>> longer present
>>
>>   https://www2.test.le.ac.uk/sh23 Version: 7
>>   Status: 2 (db_fetched)
>>
>> Neither index nor clean clear 404s from Solr.
>>
>>
>> I'm just using the commands as given in bin/crawl from Nutch 1.9:
>>
>>   $bin/nutch dedup $CRAWL_PATH/crawldb
>>
>>   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
>> -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>>
>>
>> When I added an extra clean before dedup, Solr got the instruction to
>> remove the deleted document.
>>
>> There's nothing much in nutch-site.xml. It's mostly limits to make testing
>> easier, static field added, metadata processing removed,
>> db.update.purge.404 enabled.
>>
>> <?xml version="1.0"?>
>> <configuration>
>>  <property>
>>   <name>http.agent.name</name>
>>   <value>nutch-solr-integration</value>
>>  </property>
>>  <property>
>>   <name>generate.max.per.host</name>
>>   <value>100</value>
>>  </property>
>>  <property>
>>   <name>plugin.includes</name>
>>
>> <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
>>  </property>
>>  <property>
>>    <name>index.static</name>
>>    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
>>    <description>
>>     Used by plugin index-static to adds fields with static data at
>> indexing time.
>>    You can specify a comma-separated list of fieldname:fieldcontent per
>> Nutch job.
>>   Each fieldcontent can have multiple values separated by space, e.g.,
>>    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
>>    It can be useful when collections can't be created by URL patterns,
>>   like in subcollection, but on a job-basis.
>>   </description>
>>  </property>
>>  <property>
>>   <name>http.timeout</name>
>>   <value>5000</value>
>>   <description>The default network timeout, in milliseconds.</description>
>>  </property>
>>  <property>
>>   <name>fetcher.server.delay</name>
>>   <value>0.1</value>
>>   <description>The number of seconds the fetcher will delay between
>>    successive requests to the same server. Note that this might get
>>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>>    fetcher.threads.per.queue is set to 1.
>>    </description>
>>  </property>
>>  <property>
>>   <name>db.fetch.interval.default</name>
>>   <value>60</value>
>>   <description>The default number of seconds between re-fetches of a page
>> (30 days).
>>   </description>
>>  </property>
>>  <property>
>>   <name>db.update.purge.404</name>
>>   <value>true</value>
>>   <description>If true, updatedb will add purge records with status DB_GONE
>>   from the CrawlDB.
>>   </description>
>>  </property>
>> </configuration>
>>
>> Steven
>>
>>
>> On Sat, 4 Jul 2015, Sebastian Nagel wrote:
>>
>>  Hi Steven,
>>>
>>>  is the ordering of dedup and index wrong
>>>>
>>> No, that's correct: it would be not really efficient to first index
>>> duplicates
>>> and then remove them afterwards.
>>>
>>> If I understand right the db_gone pages have previously been indexed
>>> (and were successfully fetched), right?
>>>
>>>  but "bin/nutch dedup" removes the records entirely
>>>>
>>> A dedup job should neither remove records entirely,
>>> they are only set to status db_duplicate, nor should
>>> it touch anything except db_fetched and db_notmodified.
>>> If it does that's a bug.
>>>
>>> Can you send the exact commands of "nutch dedup" and "nutch index"?
>>> Have you checked the crawldb before and after using "bin/nutch readdb"
>>> to get some hints what's special with these urls or documents?
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>>>
>>>> I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>>>
>>>> What I see is that "bin/nutch update" sets db_gone status correctly, but
>>>> "bin/nutch dedup" removes the records entirely before "bin/nutch index" can
>>>> tell Sol to remove them from its index.
>>>>
>>>> Is dedup doing more than it should, is the ordering of dedup and index
>>>> wrong, or is there some configuration that I have wrong?
>>>>
>>>> Thanks
>>>>
>>>> Steven Hayles
>>>> Systems Analyst
>>>>
>>>> IT Services, University of Leicester,
>>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>>
>>>> T: +44 (0)116 229 7950
>>>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>>>
>>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>>> THE Awards Winners 2007-2013
>>>>
>>>> Elite without being elitist
>>>>
>>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>>
>>>>
>>>>
>>>
>>>
>

Re: Gone content not reported to Solr

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Steven,

> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
longer present
That's a bug. It should be there, no question.  Could you, please, open a
Jira issue [1]

The index command needs the option
  -deleteGone
to send deletions to Solr. But if the db_gone pages disappeared that has no
effect,
of course :)

Thanks,
Sebastian


2015-07-06 10:07 GMT+02:00 Steven Hayles <sh...@leicester.ac.uk>:

>
> Hi Sebastian
>
> Thanks for your reply.
>
> I was using readdb to see what was happening. It looked like this
>
> Two pages indexed:
>
>   https://www2.test.le.ac.uk/sh23 Version: 7
>   Status: 2 (db_fetched)
>   --
>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>   Status: 2 (db_fetched)
>
> Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
>
> After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is marked
> as db_gone, as expected:
>
>   https://www2.test.le.ac.uk/sh23 Version: 7
>   Status: 2 (db_fetched)
>   --
>   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
>   Status: 3 (db_gone)
>
> (After invert links, there was no change)
>
> After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
> longer present
>
>   https://www2.test.le.ac.uk/sh23 Version: 7
>   Status: 2 (db_fetched)
>
> Neither index nor clean clear 404s from Solr.
>
>
> I'm just using the commands as given in bin/crawl from Nutch 1.9:
>
>   $bin/nutch dedup $CRAWL_PATH/crawldb
>
>   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
> -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>
>
> When I added an extra clean before dedup, Solr got the instruction to
> remove the deleted document.
>
> There's nothing much in nutch-site.xml. It's mostly limits to make testing
> easier, static field added, metadata processing removed,
> db.update.purge.404 enabled.
>
> <?xml version="1.0"?>
> <configuration>
>  <property>
>   <name>http.agent.name</name>
>   <value>nutch-solr-integration</value>
>  </property>
>  <property>
>   <name>generate.max.per.host</name>
>   <value>100</value>
>  </property>
>  <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
>  </property>
>  <property>
>    <name>index.static</name>
>    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
>    <description>
>     Used by plugin index-static to adds fields with static data at
> indexing time.
>    You can specify a comma-separated list of fieldname:fieldcontent per
> Nutch job.
>   Each fieldcontent can have multiple values separated by space, e.g.,
>    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
>    It can be useful when collections can't be created by URL patterns,
>   like in subcollection, but on a job-basis.
>   </description>
>  </property>
>  <property>
>   <name>http.timeout</name>
>   <value>5000</value>
>   <description>The default network timeout, in milliseconds.</description>
>  </property>
>  <property>
>   <name>fetcher.server.delay</name>
>   <value>0.1</value>
>   <description>The number of seconds the fetcher will delay between
>    successive requests to the same server. Note that this might get
>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
>    fetcher.threads.per.queue is set to 1.
>    </description>
>  </property>
>  <property>
>   <name>db.fetch.interval.default</name>
>   <value>60</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
>  </property>
>  <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
>  </property>
> </configuration>
>
> Steven
>
>
> On Sat, 4 Jul 2015, Sebastian Nagel wrote:
>
>  Hi Steven,
>>
>>  is the ordering of dedup and index wrong
>>>
>> No, that's correct: it would be not really efficient to first index
>> duplicates
>> and then remove them afterwards.
>>
>> If I understand right the db_gone pages have previously been indexed
>> (and were successfully fetched), right?
>>
>>  but "bin/nutch dedup" removes the records entirely
>>>
>> A dedup job should neither remove records entirely,
>> they are only set to status db_duplicate, nor should
>> it touch anything except db_fetched and db_notmodified.
>> If it does that's a bug.
>>
>> Can you send the exact commands of "nutch dedup" and "nutch index"?
>> Have you checked the crawldb before and after using "bin/nutch readdb"
>> to get some hints what's special with these urls or documents?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>>
>>> I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>>
>>> What I see is that "bin/nutch update" sets db_gone status correctly, but
>>> "bin/nutch dedup" removes the records entirely before "bin/nutch index" can
>>> tell Sol to remove them from its index.
>>>
>>> Is dedup doing more than it should, is the ordering of dedup and index
>>> wrong, or is there some configuration that I have wrong?
>>>
>>> Thanks
>>>
>>> Steven Hayles
>>> Systems Analyst
>>>
>>> IT Services, University of Leicester,
>>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>>
>>> T: +44 (0)116 229 7950
>>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>>
>>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>>> THE Awards Winners 2007-2013
>>>
>>> Elite without being elitist
>>>
>>> Follow us on Twitter http://twitter.com/uniofleicester or
>>> visit our Facebook page https://facebook.com/UniofLeicester
>>>
>>>
>>>
>>
>>

Re: Gone content not reported to Solr

Posted by Steven Hayles <sh...@leicester.ac.uk>.

Hi Sebastian

Thanks for your reply.

I was using readdb to see what was happening. It looked like this

Two pages indexed:

   https://www2.test.le.ac.uk/sh23 Version: 7
   Status: 2 (db_fetched)
   --
   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
   Status: 2 (db_fetched)

Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page

After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is marked 
as db_gone, as expected:

   https://www2.test.le.ac.uk/sh23 Version: 7
   Status: 2 (db_fetched)
   --
   https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
   Status: 3 (db_gone)

(After invert links, there was no change)

After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no 
longer present

   https://www2.test.le.ac.uk/sh23 Version: 7
   Status: 2 (db_fetched)

Neither index nor clean clear 404s from Solr.


I'm just using the commands as given in bin/crawl from Nutch 1.9:

   $bin/nutch dedup $CRAWL_PATH/crawldb

   "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb 
-linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT


When I added an extra clean before dedup, Solr got the instruction to 
remove the deleted document.

There's nothing much in nutch-site.xml. It's mostly limits to make testing 
easier, static field added, metadata processing removed, 
db.update.purge.404 enabled.

<?xml version="1.0"?>
<configuration>
  <property>
   <name>http.agent.name</name>
   <value>nutch-solr-integration</value>
  </property>
  <property>
   <name>generate.max.per.host</name>
   <value>100</value>
  </property>
  <property>
   <name>plugin.includes</name>
   <value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
  </property>
  <property>
    <name>index.static</name>
    <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
    <description>
     Used by plugin index-static to adds fields with static data at 
indexing 
time.
    You can specify a comma-separated list of fieldname:fieldcontent per 
Nutch job.
   Each fieldcontent can have multiple values separated by space, e.g.,
    field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
    It can be useful when collections can't be created by URL patterns,
   like in subcollection, but on a job-basis.
   </description>
  </property>
  <property>
   <name>http.timeout</name>
   <value>5000</value>
   <description>The default network timeout, in milliseconds.</description>
  </property>
  <property>
   <name>fetcher.server.delay</name>
   <value>0.1</value>
   <description>The number of seconds the fetcher will delay between
    successive requests to the same server. Note that this might get
    overriden by a Crawl-Delay from a robots.txt and is used ONLY if
    fetcher.threads.per.queue is set to 1.
    </description>
  </property>
  <property>
   <name>db.fetch.interval.default</name>
   <value>60</value>
   <description>The default number of seconds between re-fetches of a page 
(30 days).
   </description>
  </property>
  <property>
   <name>db.update.purge.404</name>
   <value>true</value>
   <description>If true, updatedb will add purge records with status DB_GONE
   from the CrawlDB.
   </description>
  </property>
</configuration>

Steven

On Sat, 4 Jul 2015, Sebastian Nagel wrote:

> Hi Steven,
>
>> is the ordering of dedup and index wrong
> No, that's correct: it would be not really efficient to first index duplicates
> and then remove them afterwards.
>
> If I understand right the db_gone pages have previously been indexed
> (and were successfully fetched), right?
>
>> but "bin/nutch dedup" removes the records entirely
> A dedup job should neither remove records entirely,
> they are only set to status db_duplicate, nor should
> it touch anything except db_fetched and db_notmodified.
> If it does that's a bug.
>
> Can you send the exact commands of "nutch dedup" and "nutch index"?
> Have you checked the crawldb before and after using "bin/nutch readdb"
> to get some hints what's special with these urls or documents?
>
> Thanks,
> Sebastian
>
>
> On 07/03/2015 11:37 AM, Hayles, Steven wrote:
>> I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
>>
>> What I see is that "bin/nutch update" sets db_gone status correctly, but "bin/nutch dedup" removes the records entirely before "bin/nutch index" can tell Sol to remove them from its index.
>>
>> Is dedup doing more than it should, is the ordering of dedup and index wrong, or is there some configuration that I have wrong?
>>
>> Thanks
>>
>> Steven Hayles
>> Systems Analyst
>>
>> IT Services, University of Leicester,
>> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
>>
>> T: +44 (0)116 229 7950
>> E: sh23@le.ac.uk<ma...@le.ac.uk>
>>
>> The Queen's Anniversary Prizes 1994, 2002 & 2013
>> THE Awards Winners 2007-2013
>>
>> Elite without being elitist
>>
>> Follow us on Twitter http://twitter.com/uniofleicester or
>> visit our Facebook page https://facebook.com/UniofLeicester
>>
>>
>
>

Re: Gone content not reported to Solr

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Steven,

> is the ordering of dedup and index wrong
No, that's correct: it would be not really efficient to first index duplicates
and then remove them afterwards.

If I understand right the db_gone pages have previously been indexed
(and were successfully fetched), right?

> but "bin/nutch dedup" removes the records entirely
A dedup job should neither remove records entirely,
they are only set to status db_duplicate, nor should
it touch anything except db_fetched and db_notmodified.
If it does that's a bug.

Can you send the exact commands of "nutch dedup" and "nutch index"?
Have you checked the crawldb before and after using "bin/nutch readdb"
to get some hints what's special with these urls or documents?

Thanks,
Sebastian


On 07/03/2015 11:37 AM, Hayles, Steven wrote:
> I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
> 
> What I see is that "bin/nutch update" sets db_gone status correctly, but "bin/nutch dedup" removes the records entirely before "bin/nutch index" can tell Sol to remove them from its index.
> 
> Is dedup doing more than it should, is the ordering of dedup and index wrong, or is there some configuration that I have wrong?
> 
> Thanks
> 
> Steven Hayles
> Systems Analyst
> 
> IT Services, University of Leicester,
> Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
> 
> T: +44 (0)116 229 7950
> E: sh23@le.ac.uk<ma...@le.ac.uk>
> 
> The Queen's Anniversary Prizes 1994, 2002 & 2013
> THE Awards Winners 2007-2013
> 
> Elite without being elitist
> 
> Follow us on Twitter http://twitter.com/uniofleicester or
> visit our Facebook page https://facebook.com/UniofLeicester
> 
>