You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Semyon Semyonov <se...@mail.com> on 2017/10/19 11:51:37 UTC

Parsing and URL filter plugins that depend on URL pattern.

Dear all,

I want to adjust Nutch for crawling of only one big text-based website and therefore to set up the develop plugins/set up settings for the best crawling performance.

Precisely, there is a website that has 3 category : A,B,C. The urls therefore website/A/itemN, website/B/articleN, website/C/descriptioN
For example category A contains pages web-shop like kind of pages with price, ratings etc. B has articles pages including header, text, author and so on.

1) How to write the html-parser that produces diffirent key-values pairs for diffirent urls patterns(diffirent HTML patterns) e.g NameOfItem, Price for website/A/ children, Header, Text for website/B children?
Should I implement HTML parser from scratch or I can add parsing afterwards? What is the best place to do it and how should I distinguish between different URLs categories?

2) Assuming I turned off external links and I crawl only internaly. I would like to crawl each category to the diffirent depth. For example, I want to crawl 50000 pages in A category, 10000 in B and only 100 in C. How can I make it in the best way? 
There is an URL filter plugin, but I don't know how to use it based on URL pattern or parrent URL metadata.

Thank you.
Semyon.

Re: Parsing and URL filter plugins that depend on URL pattern.

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Semyon,

(bringing the conversation back to the user list, sorry)

> I have some questions about scoring-depth plugin.

> Is it recursive?
> I saw that the plugin divides overall score on number of outonding links(score /= validCount), in
> other words if the initial score is 1 and we have 100 links, each link will have 1/100 = 0.01.
> But does it mean that the next round of crawling will work with 0.01 as a score and divide it on
> new number of valid Count?

Yes, of course. Unless the score isn't also affected by another scoring plugin.

> The second question.
> Can I access somehow the overall number of links for a host from this plugin?

No.

> The third question.
> How can I use the result of the plugin CrawlDatum.score to prevent the crawling at the specific
> monent. In other words, how can I use it in generate or updatedb to stop crawling after a threshold.

That's the method ScoringFilter.generatorSortValue(...) - it's implemented by scoring-depth
in combination with

<property>
  <name>generate.min.score</name>
  <value>0</value>
  <description>Select only entries with a score larger than
  generate.min.score.</description>
</property>

Best,
Sebastian


On 10/19/2017 05:33 PM, Semyon Semyonov wrote:
> Hi Sebastian,
> 
> Thank you for your answer.
> 
> I have some questions about scoring-depth plugin.
> 
> The first question.
> 
> Is it recursive?
> I saw that the plugin divides overall score on number of outonding links(score /= validCount), in
> other words if the initial score is 1 and we have 100 links, each link will have 1/100 = 0.01. 
> But does it mean that the next round of crawling will work with 0.01 as a score and divide it on new
> number of valid Count?
>  
> The second question. 
> Can I access somehow the overall number of links for a host from this plugin?
> 
> The third question. 
> How can I use the result of the plugin CrawlDatum.score to prevent the crawling at the specific
> monent. In other words, how can I use it in generate or updatedb to stop crawling after a threshold. 
> 
> Semyon.
> *Sent:* Thursday, October 19, 2017 at 3:34 PM
> *From:* "Sebastian Nagel" <wa...@googlemail.com>
> *To:* "Semyon Semyonov" <se...@mail.com>
> *Subject:* Re: Parsing and URL filter plugins that depend on URL pattern.
> Hi Semyon,
> 
>> Should I implement HTML parser from scratch or I can add parsing afterwards? What is the best
> place to do it and how should I distinguish between different URLs categories?
> 
> Have a look at parse-filter plugin interface
> http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html
> You'll get the DOM tree and if needed the URL via content.getUrl()
> 
>> I would like to crawl each category to the diffirent depth.
> 
> There is plugin scoring-depth, you could extend it.
> 
> But a pragmatic solution could be to set up 3 crawls with similar configuration,
> only a slightly different URL filters to accept only (website/A/, B or C) and
> with different depth.
> 
>> There is an URL filter plugin,
> 
> There are multiple URL filters, all operate only on the URL string matched by
> suffix, prefix, regular expression, ...
> 
> 
> Best,
> Sebastian
> 
> On 10/19/2017 01:51 PM, Semyon Semyonov wrote:
>> Dear all,
>>
>> I want to adjust Nutch for crawling of only one big text-based website and therefore to set up the
> develop plugins/set up settings for the best crawling performance.
>>
>> Precisely, there is a website that has 3 category : A,B,C. The urls therefore website/A/itemN,
> website/B/articleN, website/C/descriptioN
>> For example category A contains pages web-shop like kind of pages with price, ratings etc. B has
> articles pages including header, text, author and so on.
>>
>> 1) How to write the html-parser that produces diffirent key-values pairs for diffirent urls
> patterns(diffirent HTML patterns) e.g NameOfItem, Price for website/A/ children, Header, Text for
> website/B children?
>> Should I implement HTML parser from scratch or I can add parsing afterwards? What is the best
> place to do it and how should I distinguish between different URLs categories?
>>
>> 2) Assuming I turned off external links and I crawl only internaly. I would like to crawl each
> category to the diffirent depth. For example, I want to crawl 50000 pages in A category, 10000 in B
> and only 100 in C. How can I make it in the best way? 
>> There is an URL filter plugin, but I don't know how to use it based on URL pattern or parrent URL
> metadata.
>>
>> Thank you.
>> Semyon.
>>
>