You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/08/01 14:57:40 UTC

Re: AW: Crawling with nutch, check Links

> as the ticket is more than 2 years old, I assume it won't be fixed.. :-(

Not necessarily. Other features got in after more than two years ;)

On 07/31/2017 07:32 AM, d.kumar@technisat.de wrote:
> Hey Sebastian,
> 
> 
> thanks. What I did so far is: delete the database and start a whole new crawl. 
> I saw that jira with orphaned pages, before. That is exactly, what I'm looking for: as the ticket is more than 2 years old, I assume it won't be fixed.. :-(
> 
> Thanks
> 
> David
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Gesendet: Freitag, 28. Juli 2017 12:09
> An: user@nutch.apache.org
> Betreff: Re: Crawling with nutch, check Links
> 
> Hi David,
> 
> the easiest way is to delete the CrawlDb and to start the crawl from scratch.
> Since it's a site crawl this should be possible, at least, from time to time.
> Then delete documents from the index which haven't been updated.
> 
> A more sophisticated solution is not yet ready, see
>   https://issues.apache.org/jira/browse/NUTCH-1932
> 
> Best,
> Sebastian
> 
> On 07/27/2017 10:11 AM, d.kumar@technisat.de wrote:
>> Hey,
>>
>> currently I'm working on nutch with solr for our company pages.
>>
>> Assuming the following situation:
>> We have a website:
>>
>> www.mysite.lol<http://www.mysite.lol>
>>
>> at this site there is a Link:
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/>
>>
>> As you can see there is a type I should be /testpage/:
>>
>> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512
>> -1564/>
>>
>> As our Framework doesn't care about the text before the ID, we could type everything we want and the site will be displayed because of the id. That is why both link are fine and there is no 404.
>> If I change the link from the mainpage to the correct one, let nutch crawl the site again, an send is to solr, the old one is still found.
>>
>> So the link
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/> is still at the nutch db, because the link is valid --> no 404. 
>> But there is no mainpage pointing to this website. How do I tell nutch to ignore sites, which doesn't have a link to it.
>> Basically --> revalidating links and removing site without links to it?
>>
>>
>>
>> Mit freundlichen Grüßen
>> David Kumar
>>
>> Senior Software Engineer Java, B. Sc.
>> Projektmanager PIM
>> Abteilung Infotech
>> TechniSat Digital GmbH
>> Julius-Saxler-Straße 3
>> TechniPark
>> D-54550 Daun / Germany
>>
>> Tel.: + 49 (0) 6592 / 712 -2826
>> Fax: + 49 (0) 6592 / 712 -2829
>>
>> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
>> www.facebook.com/technisat
>>
>>
>