You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Semyon Semyonov <se...@mail.com> on 2018/03/12 10:47:25 UTC

UrlRegexFilter is getting destroyed for unrealistically long links

Dear all,

There is an issue with UrlRegexFilter and parsing. In average, parsing takes about 1 millisecond, but sometimes the websites have the crazy links that destroy the parsing(takes 3+ hours and destroy the next steps of the crawling). 
For example, below you can see shortened logged version of url with encoded image, the real lenght of the link is 532572 characters.
 
Any idea what should I do with such behavior?  Should I modify the plugin to reject links with lenght > MAX or use more comlex logic/check extra configuration?
2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization 
2018-03-10 23:39:52,178 INFO [main] org.apache.nutch.urlfilter.api.RegexURLFilterBase: ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju... [532572 characters]
2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization 

Semyon.

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yossi,

it's used in FetcherThread and ParseOutputFormat:
  git grep -F db.max.outlinks.per.page

However, it's not to limit the length of single outlink in characters
but the number of outlinks followed (added to CrawlDb).

There was NUTCH-1106 to add a property to limit the outlink length.

Sebastian

On 03/12/2018 12:56 PM, Yossi Tamari wrote:
> Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? 
> 
>> -----Original Message-----
>> From: Semyon Semyonov <se...@mail.com>
>> Sent: 12 March 2018 12:47
>> To: usernutch.apache.org <us...@nutch.apache.org>
>> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> Dear all,
>>
>> There is an issue with UrlRegexFilter and parsing. In average, parsing takes
>> about 1 millisecond, but sometimes the websites have the crazy links that
>> destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).
>> For example, below you can see shortened logged version of url with encoded
>> image, the real lenght of the link is 532572 characters.
>>
>> Any idea what should I do with such behavior?  Should I modify the plugin to
>> reject links with lenght > MAX or use more comlex logic/check extra
>> configuration?
>> 2018-03-10 23:39:52,082 INFO [main]
>> org.apache.nutch.parse.ParseOutputFormat:
>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization
>> 2018-03-10 23:39:52,178 INFO [main]
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>> dbnu50253lju... [532572 characters]
>> 2018-03-11 03:56:26,118 INFO [main]
>> org.apache.nutch.parse.ParseOutputFormat:
>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization
>>
>> Semyon.
> 


Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yossi,

ok, I see, you need administrator privileges to reopen old issues.
Done: reopened NUTCH-1106.

Opened a new issue NUTCH-2530 instead of reopening NUTCH-2220
to avoid that we accidentally modify release notes, e.g.
   https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218
when updating affects/fix versions of resolved issues.

Thanks,
Sebastian


On 03/12/2018 04:50 PM, Yossi Tamari wrote:
> I think the first one should also be handled by reopening NUTCH-2220, which specifically mentions renaming db.max.anchor.length. The problem is that it seems like I am not able to reopen a closed/resolved issue. Sorry...
> 
>> -----Original Message-----
>> From: Sebastian Nagel <wa...@googlemail.com>
>> Sent: 12 March 2018 17:39
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>>>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
>>> OK, agreed, but it should also be moved to the LinkDB section in nutch-
>> default.xml.
>>
>> Yes, of course, plus make the description more explicit.
>> Could you open a Jira issue for this?
>>
>>> It should apply to outlinks received from the parser, not to injected URLs, for
>> example.
>>
>> Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps
>> and ev. redirects?
>> But agreed, you always could also add a rule to regex-urlfilter.txt if required. But
>> it should be made clear that only outlinks are checked for length.
>> Could you reopen NUTCH-1106 to address this?
>>
>>
>> Thanks!
>>
>>
>> On 03/12/2018 03:27 PM, Yossi Tamari wrote:
>>>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>> db.max.anchor.length, I already said that when I wrote
>> "db.max.outlinks.per.page" it was a copy/paste error.
>>>
>>>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
>>> OK, agreed, but it should also be moved to the LinkDB section in nutch-
>> default.xml.
>>>
>>>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>>>> - it should be applied before URL normalizers
>>> Agreed, but it seems to me the most natural place to add it is where
>> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It
>> should apply to outlinks received from the parser, not to injected URLs, for
>> example. The only other place I can think of where this may be needed is after
>> redirect.
>>> This is pretty much the same as what Semyon suggests, whether we push it
>> down into the filterNormalize method or do it before calling it.
>>>
>>> 	Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel <wa...@googlemail.com>
>>>> Sent: 12 March 2018 15:57
>>>> To: user@nutch.apache.org
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> Hi Semyon, Yossi, Markus,
>>>>
>>>>> what db.max.anchor.length was supposed to do
>>>>
>>>> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>>>>   <a href="url">anchor text</a>
>>>> Can we agree to use the term "anchor" in this meaning?
>>>> At least, that's how it is used in the class Outlink and hopefully
>>>> throughout Nutch.
>>>>
>>>>> Personally, I still think the property should be used to limit
>>>>> outlink length in parsing,
>>>>
>>>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>>>
>>>> I was about renaming
>>>>   db.max.anchor.length -> linkdb.max.anchor.length This property was
>>>> forgotten when making the naming more consistent in
>>>>   [NUTCH-2220] - Rename db.* options used only by the linkdb to
>>>> linkdb.*
>>>>
>>>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>>>> - it should be applied before URL normalizers
>>>>   (that would be the main advantage over adding a regex filter rule)
>>>> - but probably for all tools / places where URLs are filtered
>>>>   (ugly because there are many of them)
>>>> - one option would be to rethink the pipeline of URL normalizers and filters
>>>>   as Julien did it for Storm-crawler [1].
>>>> - a pragmatic solution to keep the code changes limited:
>>>>   do the length check twice at the beginning of
>>>>    URLNormalizers.normalize(...)
>>>>   and
>>>>    URLFilters.filter(...)
>>>>   (it's not guaranteed that normalizers are always called)
>>>> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>>>>   to limit the length to 512 (or 1024/2048) characters
>>>>
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> [1]
>>>> https://github.com/DigitalPebble/storm-
>>>> crawler/blob/master/archetype/src/main/resources/archetype-
>>>> resources/src/main/resources/urlfilters.json
>>>>
>>>>
>>>>
>>>> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
>>>>> The other properties in this section actually affect parsing (e.g.
>>>> db.max.outlinks.per.page). I was under the impression that this is
>>>> what db.max.anchor.length was supposed to do, and actually increased its
>> value.
>>>> Turns out this is one of the many things in Nutch that are not
>>>> intuitive (or in this case, does nothing at all).
>>>>> One of the reasons I thought so is that very long links can be used
>>>>> as an attack
>>>> on crawlers.
>>>>> Personally, I still think the property should be used to limit
>>>>> outlink length in
>>>> parsing, but if that is not what it's supposed to do, I guess it
>>>> needs to be renamed (to match the code), moved to a different section
>>>> of the properties file, and perhaps better documented. In that case, you'll
>> need to use Markus'
>>>> solution, and basically everybody should use Markus' first rule...
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>> Sent: 12 March 2018 14:51
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Re: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links
>>>>>>
>>>>>> So, which is the conclusion?
>>>>>>
>>>>>> Should it be solved in regex file or through this property?
>>>>>>
>>>>>> Though, how the property of crawldb/linkdb suppose to prevent this
>>>>>> problem in Parse?
>>>>>>
>>>>>> Sent: Monday, March 12, 2018 at 1:42 PM
>>>>>> From: "Edward Capriolo" <ed...@gmail.com>
>>>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>>>> Subject: Re: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links Some regular expressions (those with
>>>>>> backtracing) can be very expensive for lomg strings
>>>>>>
>>>>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>>>>>
>>>>>> Maybe that is your issue.
>>>>>>
>>>>>> On Monday, March 12, 2018, Sebastian Nagel
>>>>>> <wa...@googlemail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Good catch. It should be renamed to be consistent with other
>>>>>>> properties, right?
>>>>>>>
>>>>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>>>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>>>>>> linkdb
>>>>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>>>>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>>>> Sent: 12 March 2018 14:05
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically
>>>>>>> long links
>>>>>>>>>
>>>>>>>>> That is for the LinkDB.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original message-----
>>>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>> unrealistically long links
>>>>>>>>>>
>>>>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>>>>>> paste
>>>>>>>>> error...
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>>>>>> Sent: 12 March 2018 14:01
>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>
>>>>>>>>>>> scripts/apache-nutch-
>>>>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>>>>>> 100);
>>>>>>>>>>> scripts/apache-nutch-
>>>>>>>>>>>
>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>>>>>> int
>>>>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
>>>>>>>>>>> 100);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>>
>>>>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>>>>>> which I think is
>>>>>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
>>>>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In
>>>>>>>>>>>>> average, parsing takes about 1 millisecond, but sometimes
>>>>>>>>>>>>> the websites have the crazy links that destroy the
>>>>>>>>>>>>> parsing(takes 3+ hours and destroy the next
>>>>>>>>>>> steps of the crawling).
>>>>>>>>>>>>> For example, below you can see shortened logged version of
>>>>>>>>>>>>> url with encoded image, the real lenght of the link is
>>>>>>>>>>>>> 532572
>>>>>>> characters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any idea what should I do with such behavior? Should I
>>>>>>>>>>>>> modify the plugin to reject links with lenght > MAX or use
>>>>>>>>>>>>> more comlex logic/check extra configuration?
>>>>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>>>>>> and normalization
>>>>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>>>>>> filter for url
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
>>>>>>>>>
>>>>>>
>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>>>>>> 7
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
>>>>>>>>>>>>> and normalization
>>>>>>>>>>>>>
>>>>>>>>>>>>> Semyon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>> check than usual.
>>>>>
>>>
>>>
> 
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
I think the first one should also be handled by reopening NUTCH-2220, which specifically mentions renaming db.max.anchor.length. The problem is that it seems like I am not able to reopen a closed/resolved issue. Sorry...

> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com>
> Sent: 12 March 2018 17:39
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> 
> Yes, of course, plus make the description more explicit.
> Could you open a Jira issue for this?
> 
> > It should apply to outlinks received from the parser, not to injected URLs, for
> example.
> 
> Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps
> and ev. redirects?
> But agreed, you always could also add a rule to regex-urlfilter.txt if required. But
> it should be made clear that only outlinks are checked for length.
> Could you reopen NUTCH-1106 to address this?
> 
> 
> Thanks!
> 
> 
> On 03/12/2018 03:27 PM, Yossi Tamari wrote:
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> > db.max.anchor.length, I already said that when I wrote
> "db.max.outlinks.per.page" it was a copy/paste error.
> >
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> >
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> > Agreed, but it seems to me the most natural place to add it is where
> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It
> should apply to outlinks received from the parser, not to injected URLs, for
> example. The only other place I can think of where this may be needed is after
> redirect.
> > This is pretty much the same as what Semyon suggests, whether we push it
> down into the filterNormalize method or do it before calling it.
> >
> > 	Yossi.
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel <wa...@googlemail.com>
> >> Sent: 12 March 2018 15:57
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> Hi Semyon, Yossi, Markus,
> >>
> >>> what db.max.anchor.length was supposed to do
> >>
> >> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
> >>   <a href="url">anchor text</a>
> >> Can we agree to use the term "anchor" in this meaning?
> >> At least, that's how it is used in the class Outlink and hopefully
> >> throughout Nutch.
> >>
> >>> Personally, I still think the property should be used to limit
> >>> outlink length in parsing,
> >>
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> >>
> >> I was about renaming
> >>   db.max.anchor.length -> linkdb.max.anchor.length This property was
> >> forgotten when making the naming more consistent in
> >>   [NUTCH-2220] - Rename db.* options used only by the linkdb to
> >> linkdb.*
> >>
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> >>   (that would be the main advantage over adding a regex filter rule)
> >> - but probably for all tools / places where URLs are filtered
> >>   (ugly because there are many of them)
> >> - one option would be to rethink the pipeline of URL normalizers and filters
> >>   as Julien did it for Storm-crawler [1].
> >> - a pragmatic solution to keep the code changes limited:
> >>   do the length check twice at the beginning of
> >>    URLNormalizers.normalize(...)
> >>   and
> >>    URLFilters.filter(...)
> >>   (it's not guaranteed that normalizers are always called)
> >> - the minimal solution: add a default rule to regex-urlfilter.txt.template
> >>   to limit the length to 512 (or 1024/2048) characters
> >>
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1]
> >> https://github.com/DigitalPebble/storm-
> >> crawler/blob/master/archetype/src/main/resources/archetype-
> >> resources/src/main/resources/urlfilters.json
> >>
> >>
> >>
> >> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> >>> The other properties in this section actually affect parsing (e.g.
> >> db.max.outlinks.per.page). I was under the impression that this is
> >> what db.max.anchor.length was supposed to do, and actually increased its
> value.
> >> Turns out this is one of the many things in Nutch that are not
> >> intuitive (or in this case, does nothing at all).
> >>> One of the reasons I thought so is that very long links can be used
> >>> as an attack
> >> on crawlers.
> >>> Personally, I still think the property should be used to limit
> >>> outlink length in
> >> parsing, but if that is not what it's supposed to do, I guess it
> >> needs to be renamed (to match the code), moved to a different section
> >> of the properties file, and perhaps better documented. In that case, you'll
> need to use Markus'
> >> solution, and basically everybody should use Markus' first rule...
> >>>
> >>>> -----Original Message-----
> >>>> From: Semyon Semyonov <se...@mail.com>
> >>>> Sent: 12 March 2018 14:51
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: UrlRegexFilter is getting destroyed for
> >>>> unrealistically long links
> >>>>
> >>>> So, which is the conclusion?
> >>>>
> >>>> Should it be solved in regex file or through this property?
> >>>>
> >>>> Though, how the property of crawldb/linkdb suppose to prevent this
> >>>> problem in Parse?
> >>>>
> >>>> Sent: Monday, March 12, 2018 at 1:42 PM
> >>>> From: "Edward Capriolo" <ed...@gmail.com>
> >>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >>>> Subject: Re: UrlRegexFilter is getting destroyed for
> >>>> unrealistically long links Some regular expressions (those with
> >>>> backtracing) can be very expensive for lomg strings
> >>>>
> >>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
> >>>>
> >>>> Maybe that is your issue.
> >>>>
> >>>> On Monday, March 12, 2018, Sebastian Nagel
> >>>> <wa...@googlemail.com>
> >>>> wrote:
> >>>>
> >>>>> Good catch. It should be renamed to be consistent with other
> >>>>> properties, right?
> >>>>>
> >>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> >>>>>> Perhaps, however it starts with db, not linkdb (like the other
> >>>>>> linkdb
> >>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
> >>>>> LinkDB code uses the property name linkdb.max.anchor.length.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Markus Jelsma <ma...@openindex.io>
> >>>>>>> Sent: 12 March 2018 14:05
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>> unrealistically
> >>>>> long links
> >>>>>>>
> >>>>>>> That is for the LinkDB.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>>>>> Sent: Monday 12th March 2018 13:02
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>> unrealistically long links
> >>>>>>>>
> >>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> >>>>>>>> paste
> >>>>>>> error...
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Markus Jelsma <ma...@openindex.io>
> >>>>>>>>> Sent: 12 March 2018 14:01
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>>> unrealistically long links
> >>>>>>>>>
> >>>>>>>>> scripts/apache-nutch-
> >>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> >>>>>>>>> 100);
> >>>>>>>>> scripts/apache-nutch-
> >>>>>>>>>
> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> >>>>> int
> >>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
> >>>>>>>>> 100);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -----Original message-----
> >>>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>>>>>>> Sent: Monday 12th March 2018 12:56
> >>>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>>>> unrealistically long links
> >>>>>>>>>>
> >>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
> >>>>>>>>>> which I think is
> >>>>>>>>> supposed to prevent these cases. However, I just searched the
> >>>>>>>>> code and couldn't find where it is used. Bug?
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Semyon Semyonov <se...@mail.com>
> >>>>>>>>>>> Sent: 12 March 2018 12:47
> >>>>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
> >>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
> >>>>>>>>>>> unrealistically long links
> >>>>>>>>>>>
> >>>>>>>>>>> Dear all,
> >>>>>>>>>>>
> >>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In
> >>>>>>>>>>> average, parsing takes about 1 millisecond, but sometimes
> >>>>>>>>>>> the websites have the crazy links that destroy the
> >>>>>>>>>>> parsing(takes 3+ hours and destroy the next
> >>>>>>>>> steps of the crawling).
> >>>>>>>>>>> For example, below you can see shortened logged version of
> >>>>>>>>>>> url with encoded image, the real lenght of the link is
> >>>>>>>>>>> 532572
> >>>>> characters.
> >>>>>>>>>>>
> >>>>>>>>>>> Any idea what should I do with such behavior? Should I
> >>>>>>>>>>> modify the plugin to reject links with lenght > MAX or use
> >>>>>>>>>>> more comlex logic/check extra configuration?
> >>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> >>>>>>>>>>> and normalization
> >>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>>>>>>> filter for url
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> >>>>>>>
> >>>>
> >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> >>>>>>> 7
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>>>>>>> dbnu50253lju... [532572 characters]
> >>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
> >>>>>>>>>>> and normalization
> >>>>>>>>>>>
> >>>>>>>>>>> Semyon.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Sorry this was sent from mobile. Will do less grammar and spell
> >>>> check than usual.
> >>>
> >
> >



Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Sebastian Nagel <wa...@googlemail.com>.
>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> OK, agreed, but it should also be moved to the LinkDB section in nutch-default.xml.

Yes, of course, plus make the description more explicit.
Could you open a Jira issue for this?

> It should apply to outlinks received from the parser, not to injected URLs, for example.

Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps and ev. redirects?
But agreed, you always could also add a rule to regex-urlfilter.txt if required. But it should be
made clear that only outlinks are checked for length.
Could you reopen NUTCH-1106 to address this?


Thanks!


On 03/12/2018 03:27 PM, Yossi Tamari wrote:
>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> db.max.anchor.length, I already said that when I wrote "db.max.outlinks.per.page" it was a copy/paste error.
> 
>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> OK, agreed, but it should also be moved to the LinkDB section in nutch-default.xml.
> 
>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>> - it should be applied before URL normalizers
> Agreed, but it seems to me the most natural place to add it is where db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It should apply to outlinks received from the parser, not to injected URLs, for example. The only other place I can think of where this may be needed is after redirect.
> This is pretty much the same as what Semyon suggests, whether we push it down into the filterNormalize method or do it before calling it.
> 
> 	Yossi.
> 
>> -----Original Message-----
>> From: Sebastian Nagel <wa...@googlemail.com>
>> Sent: 12 March 2018 15:57
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> Hi Semyon, Yossi, Markus,
>>
>>> what db.max.anchor.length was supposed to do
>>
>> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>>   <a href="url">anchor text</a>
>> Can we agree to use the term "anchor" in this meaning?
>> At least, that's how it is used in the class Outlink and hopefully throughout
>> Nutch.
>>
>>> Personally, I still think the property should be used to limit outlink
>>> length in parsing,
>>
>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>
>> I was about renaming
>>   db.max.anchor.length -> linkdb.max.anchor.length This property was forgotten
>> when making the naming more consistent in
>>   [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
>>
>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>> - it should be applied before URL normalizers
>>   (that would be the main advantage over adding a regex filter rule)
>> - but probably for all tools / places where URLs are filtered
>>   (ugly because there are many of them)
>> - one option would be to rethink the pipeline of URL normalizers and filters
>>   as Julien did it for Storm-crawler [1].
>> - a pragmatic solution to keep the code changes limited:
>>   do the length check twice at the beginning of
>>    URLNormalizers.normalize(...)
>>   and
>>    URLFilters.filter(...)
>>   (it's not guaranteed that normalizers are always called)
>> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>>   to limit the length to 512 (or 1024/2048) characters
>>
>>
>> Best,
>> Sebastian
>>
>> [1]
>> https://github.com/DigitalPebble/storm-
>> crawler/blob/master/archetype/src/main/resources/archetype-
>> resources/src/main/resources/urlfilters.json
>>
>>
>>
>> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
>>> The other properties in this section actually affect parsing (e.g.
>> db.max.outlinks.per.page). I was under the impression that this is what
>> db.max.anchor.length was supposed to do, and actually increased its value.
>> Turns out this is one of the many things in Nutch that are not intuitive (or in this
>> case, does nothing at all).
>>> One of the reasons I thought so is that very long links can be used as an attack
>> on crawlers.
>>> Personally, I still think the property should be used to limit outlink length in
>> parsing, but if that is not what it's supposed to do, I guess it needs to be
>> renamed (to match the code), moved to a different section of the properties
>> file, and perhaps better documented. In that case, you'll need to use Markus'
>> solution, and basically everybody should use Markus' first rule...
>>>
>>>> -----Original Message-----
>>>> From: Semyon Semyonov <se...@mail.com>
>>>> Sent: 12 March 2018 14:51
>>>> To: user@nutch.apache.org
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> So, which is the conclusion?
>>>>
>>>> Should it be solved in regex file or through this property?
>>>>
>>>> Though, how the property of crawldb/linkdb suppose to prevent this
>>>> problem in Parse?
>>>>
>>>> Sent: Monday, March 12, 2018 at 1:42 PM
>>>> From: "Edward Capriolo" <ed...@gmail.com>
>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links Some regular expressions (those with backtracing) can be
>>>> very expensive for lomg strings
>>>>
>>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>>>
>>>> Maybe that is your issue.
>>>>
>>>> On Monday, March 12, 2018, Sebastian Nagel
>>>> <wa...@googlemail.com>
>>>> wrote:
>>>>
>>>>> Good catch. It should be renamed to be consistent with other
>>>>> properties, right?
>>>>>
>>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>>>> linkdb
>>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>> Sent: 12 March 2018 14:05
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>> unrealistically
>>>>> long links
>>>>>>>
>>>>>>> That is for the LinkDB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original message-----
>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>> unrealistically long links
>>>>>>>>
>>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>>>> paste
>>>>>>> error...
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>>>> Sent: 12 March 2018 14:01
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically long links
>>>>>>>>>
>>>>>>>>> scripts/apache-nutch-
>>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>>>> 100);
>>>>>>>>> scripts/apache-nutch-
>>>>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>>>> int
>>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
>>>>>>>>> 100);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original message-----
>>>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>> unrealistically long links
>>>>>>>>>>
>>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>>>> which I think is
>>>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
>>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>
>>>>>>>>>>> Dear all,
>>>>>>>>>>>
>>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
>>>>>>>>>>> and destroy the next
>>>>>>>>> steps of the crawling).
>>>>>>>>>>> For example, below you can see shortened logged version of url
>>>>>>>>>>> with encoded image, the real lenght of the link is 532572
>>>>> characters.
>>>>>>>>>>>
>>>>>>>>>>> Any idea what should I do with such behavior? Should I modify
>>>>>>>>>>> the plugin to reject links with lenght > MAX or use more
>>>>>>>>>>> comlex logic/check extra configuration?
>>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>>>> and normalization
>>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>>>> filter for url
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
>>>>>>>
>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>>>> 7
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
>>>>>>>>>>> and normalization
>>>>>>>>>>>
>>>>>>>>>>> Semyon.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Sorry this was sent from mobile. Will do less grammar and spell check
>>>> than usual.
>>>
> 
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
> Which property, db.max.outlinks.per.page or db.max.anchor.length?
db.max.anchor.length, I already said that when I wrote "db.max.outlinks.per.page" it was a copy/paste error.

> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
OK, agreed, but it should also be moved to the LinkDB section in nutch-default.xml.

> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
Agreed, but it seems to me the most natural place to add it is where db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It should apply to outlinks received from the parser, not to injected URLs, for example. The only other place I can think of where this may be needed is after redirect.
This is pretty much the same as what Semyon suggests, whether we push it down into the filterNormalize method or do it before calling it.

	Yossi.

> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com>
> Sent: 12 March 2018 15:57
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Hi Semyon, Yossi, Markus,
> 
> > what db.max.anchor.length was supposed to do
> 
> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>   <a href="url">anchor text</a>
> Can we agree to use the term "anchor" in this meaning?
> At least, that's how it is used in the class Outlink and hopefully throughout
> Nutch.
> 
> > Personally, I still think the property should be used to limit outlink
> > length in parsing,
> 
> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> 
> I was about renaming
>   db.max.anchor.length -> linkdb.max.anchor.length This property was forgotten
> when making the naming more consistent in
>   [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
> 
> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
>   (that would be the main advantage over adding a regex filter rule)
> - but probably for all tools / places where URLs are filtered
>   (ugly because there are many of them)
> - one option would be to rethink the pipeline of URL normalizers and filters
>   as Julien did it for Storm-crawler [1].
> - a pragmatic solution to keep the code changes limited:
>   do the length check twice at the beginning of
>    URLNormalizers.normalize(...)
>   and
>    URLFilters.filter(...)
>   (it's not guaranteed that normalizers are always called)
> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>   to limit the length to 512 (or 1024/2048) characters
> 
> 
> Best,
> Sebastian
> 
> [1]
> https://github.com/DigitalPebble/storm-
> crawler/blob/master/archetype/src/main/resources/archetype-
> resources/src/main/resources/urlfilters.json
> 
> 
> 
> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> > The other properties in this section actually affect parsing (e.g.
> db.max.outlinks.per.page). I was under the impression that this is what
> db.max.anchor.length was supposed to do, and actually increased its value.
> Turns out this is one of the many things in Nutch that are not intuitive (or in this
> case, does nothing at all).
> > One of the reasons I thought so is that very long links can be used as an attack
> on crawlers.
> > Personally, I still think the property should be used to limit outlink length in
> parsing, but if that is not what it's supposed to do, I guess it needs to be
> renamed (to match the code), moved to a different section of the properties
> file, and perhaps better documented. In that case, you'll need to use Markus'
> solution, and basically everybody should use Markus' first rule...
> >
> >> -----Original Message-----
> >> From: Semyon Semyonov <se...@mail.com>
> >> Sent: 12 March 2018 14:51
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> So, which is the conclusion?
> >>
> >> Should it be solved in regex file or through this property?
> >>
> >> Though, how the property of crawldb/linkdb suppose to prevent this
> >> problem in Parse?
> >>
> >> Sent: Monday, March 12, 2018 at 1:42 PM
> >> From: "Edward Capriolo" <ed...@gmail.com>
> >> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links Some regular expressions (those with backtracing) can be
> >> very expensive for lomg strings
> >>
> >> https://regular-expressions.mobi/catastrophic.html?wlr=1
> >>
> >> Maybe that is your issue.
> >>
> >> On Monday, March 12, 2018, Sebastian Nagel
> >> <wa...@googlemail.com>
> >> wrote:
> >>
> >>> Good catch. It should be renamed to be consistent with other
> >>> properties, right?
> >>>
> >>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> >>>> Perhaps, however it starts with db, not linkdb (like the other
> >>>> linkdb
> >>> properties), it is in the CrawlDB part of nutch-default.xml, and
> >>> LinkDB code uses the property name linkdb.max.anchor.length.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Markus Jelsma <ma...@openindex.io>
> >>>>> Sent: 12 March 2018 14:05
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>> unrealistically
> >>> long links
> >>>>>
> >>>>> That is for the LinkDB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>>> Sent: Monday 12th March 2018 13:02
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>> unrealistically long links
> >>>>>>
> >>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> >>>>>> paste
> >>>>> error...
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Markus Jelsma <ma...@openindex.io>
> >>>>>>> Sent: 12 March 2018 14:01
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>> unrealistically long links
> >>>>>>>
> >>>>>>> scripts/apache-nutch-
> >>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> >>>>>>> 100);
> >>>>>>> scripts/apache-nutch-
> >>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> >>> int
> >>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
> >>>>>>> 100);
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>>>>> Sent: Monday 12th March 2018 12:56
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>> unrealistically long links
> >>>>>>>>
> >>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
> >>>>>>>> which I think is
> >>>>>>> supposed to prevent these cases. However, I just searched the
> >>>>>>> code and couldn't find where it is used. Bug?
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Semyon Semyonov <se...@mail.com>
> >>>>>>>>> Sent: 12 March 2018 12:47
> >>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
> >>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
> >>>>>>>>> unrealistically long links
> >>>>>>>>>
> >>>>>>>>> Dear all,
> >>>>>>>>>
> >>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> >>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
> >>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
> >>>>>>>>> and destroy the next
> >>>>>>> steps of the crawling).
> >>>>>>>>> For example, below you can see shortened logged version of url
> >>>>>>>>> with encoded image, the real lenght of the link is 532572
> >>> characters.
> >>>>>>>>>
> >>>>>>>>> Any idea what should I do with such behavior? Should I modify
> >>>>>>>>> the plugin to reject links with lenght > MAX or use more
> >>>>>>>>> comlex logic/check extra configuration?
> >>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> >>>>>>>>> and normalization
> >>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>>>>> filter for url
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> >>>>>
> >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> >>>>> 7
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>>>>> dbnu50253lju... [532572 characters]
> >>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
> >>>>>>>>> and normalization
> >>>>>>>>>
> >>>>>>>>> Semyon.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >> --
> >> Sorry this was sent from mobile. Will do less grammar and spell check
> >> than usual.
> >



Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Semyon Semyonov <se...@mail.com>.
Hi Sebastian,

I think that the simplest(and more solid way then the regex modification) would be modification of ParseOutputFormat.filterNormalize.

As far as I can see all the url modifications/filtrations occur there. Therefore if in the beginning we add to 
    if (fromUrl.equals(toUrl)) {
      return null;
    }

condition 
if(len(fromUrl) > MAX OR len(toUrl)> MAX){
   return null
}

that should be it.

Do I miss something?

Semyon

Sent: Monday, March 12, 2018 at 2:57 PM
From: "Sebastian Nagel" <wa...@googlemail.com>
To: user@nutch.apache.org
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Hi Semyon, Yossi, Markus,

> what db.max.anchor.length was supposed to do

it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
<a href="url">anchor text</a>
Can we agree to use the term "anchor" in this meaning?
At least, that's how it is used in the class Outlink and hopefully throughout Nutch.

> Personally, I still think the property should be used to limit outlink length in parsing,

Which property, db.max.outlinks.per.page or db.max.anchor.length?

I was about renaming
db.max.anchor.length -> linkdb.max.anchor.length
This property was forgotten when making the naming more consistent in
[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*

Regarding a property to limit the URL length as discussed in NUTCH-1106:
- it should be applied before URL normalizers
(that would be the main advantage over adding a regex filter rule)
- but probably for all tools / places where URLs are filtered
(ugly because there are many of them)
- one option would be to rethink the pipeline of URL normalizers and filters
as Julien did it for Storm-crawler [1].
- a pragmatic solution to keep the code changes limited:
do the length check twice at the beginning of
URLNormalizers.normalize(...)
and
URLFilters.filter(...)
(it's not guaranteed that normalizers are always called)
- the minimal solution: add a default rule to regex-urlfilter.txt.template
to limit the length to 512 (or 1024/2048) characters


Best,
Sebastian

[1]
https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json[https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json]



On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> The other properties in this section actually affect parsing (e.g. db.max.outlinks.per.page). I was under the impression that this is what db.max.anchor.length was supposed to do, and actually increased its value. Turns out this is one of the many things in Nutch that are not intuitive (or in this case, does nothing at all).
> One of the reasons I thought so is that very long links can be used as an attack on crawlers.
> Personally, I still think the property should be used to limit outlink length in parsing, but if that is not what it's supposed to do, I guess it needs to be renamed (to match the code), moved to a different section of the properties file, and perhaps better documented. In that case, you'll need to use Markus' solution, and basically everybody should use Markus' first rule...
>
>> -----Original Message-----
>> From: Semyon Semyonov <se...@mail.com>
>> Sent: 12 March 2018 14:51
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> So, which is the conclusion?
>>
>> Should it be solved in regex file or through this property?
>>
>> Though, how the property of crawldb/linkdb suppose to prevent this problem in
>> Parse?
>>
>> Sent: Monday, March 12, 2018 at 1:42 PM
>> From: "Edward Capriolo" <ed...@gmail.com>
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>> Some regular expressions (those with backtracing) can be very expensive for
>> lomg strings
>>
>> https://regular-expressions.mobi/catastrophic.html?wlr=1[https://regular-expressions.mobi/catastrophic.html?wlr=1]
>>
>> Maybe that is your issue.
>>
>> On Monday, March 12, 2018, Sebastian Nagel <wa...@googlemail.com>
>> wrote:
>>
>>> Good catch. It should be renamed to be consistent with other
>>> properties, right?
>>>
>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>> linkdb
>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>> Sent: 12 March 2018 14:05
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>> unrealistically
>>> long links
>>>>>
>>>>> That is for the LinkDB.
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links
>>>>>>
>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>> paste
>>>>> error...
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>> Sent: 12 March 2018 14:01
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>> unrealistically long links
>>>>>>>
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>> 100);
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>> int
>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original message-----
>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>> unrealistically long links
>>>>>>>>
>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>> which I think is
>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically long links
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
>>>>>>>>> and destroy the next
>>>>>>> steps of the crawling).
>>>>>>>>> For example, below you can see shortened logged version of url
>>>>>>>>> with encoded image, the real lenght of the link is 532572
>>> characters.
>>>>>>>>>
>>>>>>>>> Any idea what should I do with such behavior? Should I modify
>>>>>>>>> the plugin to reject links with lenght > MAX or use more comlex
>>>>>>>>> logic/check extra configuration?
>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>> and normalization
>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>> filter for url
>>>>>>>>>
>>>>>>>
>>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS][
>>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]]
>>>>>>>>>
>>>>>>>
>>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>
>>>>>>>
>>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>> 7
>>>>>>>>>
>>>>>>>
>>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>
>>>>>>>
>>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
>>>>>>>>> normalization
>>>>>>>>>
>>>>>>>>> Semyon.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check than
>> usual.
>
 

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Semyon, Yossi, Markus,

> what db.max.anchor.length was supposed to do

it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
  <a href="url">anchor text</a>
Can we agree to use the term "anchor" in this meaning?
At least, that's how it is used in the class Outlink and hopefully throughout Nutch.

> Personally, I still think the property should be used to limit outlink length in parsing,

Which property, db.max.outlinks.per.page or db.max.anchor.length?

I was about renaming
  db.max.anchor.length -> linkdb.max.anchor.length
This property was forgotten when making the naming more consistent in
  [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*

Regarding a property to limit the URL length as discussed in NUTCH-1106:
- it should be applied before URL normalizers
  (that would be the main advantage over adding a regex filter rule)
- but probably for all tools / places where URLs are filtered
  (ugly because there are many of them)
- one option would be to rethink the pipeline of URL normalizers and filters
  as Julien did it for Storm-crawler [1].
- a pragmatic solution to keep the code changes limited:
  do the length check twice at the beginning of
   URLNormalizers.normalize(...)
  and
   URLFilters.filter(...)
  (it's not guaranteed that normalizers are always called)
- the minimal solution: add a default rule to regex-urlfilter.txt.template
  to limit the length to 512 (or 1024/2048) characters


Best,
Sebastian

[1]
https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json



On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> The other properties in this section actually affect parsing (e.g. db.max.outlinks.per.page). I was under the impression that this is what db.max.anchor.length was supposed to do, and actually increased its value. Turns out this is one of the many things in Nutch that are not intuitive (or in this case, does nothing at all).
> One of the reasons I thought so is that very long links can be used as an attack on crawlers.
> Personally, I still think the property should be used to limit outlink length in parsing, but if that is not what it's supposed to do, I guess it needs to be renamed (to match the code), moved to a different section of the properties file, and perhaps better documented. In that case, you'll need to use Markus' solution, and basically everybody should use Markus' first rule...
> 
>> -----Original Message-----
>> From: Semyon Semyonov <se...@mail.com>
>> Sent: 12 March 2018 14:51
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> So, which is the conclusion?
>>
>> Should it be solved in regex file or through this property?
>>
>> Though, how the property of crawldb/linkdb suppose to prevent this problem in
>> Parse?
>>
>> Sent: Monday, March 12, 2018 at 1:42 PM
>> From: "Edward Capriolo" <ed...@gmail.com>
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
>> Some regular expressions (those with backtracing) can be very expensive for
>> lomg strings
>>
>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>
>> Maybe that is your issue.
>>
>> On Monday, March 12, 2018, Sebastian Nagel <wa...@googlemail.com>
>> wrote:
>>
>>> Good catch. It should be renamed to be consistent with other
>>> properties, right?
>>>
>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>> linkdb
>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>> Sent: 12 March 2018 14:05
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>> unrealistically
>>> long links
>>>>>
>>>>> That is for the LinkDB.
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links
>>>>>>
>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>> paste
>>>>> error...
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Markus Jelsma <ma...@openindex.io>
>>>>>>> Sent: 12 March 2018 14:01
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>> unrealistically long links
>>>>>>>
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>> 100);
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>> int
>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original message-----
>>>>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>> unrealistically long links
>>>>>>>>
>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>> which I think is
>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically long links
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
>>>>>>>>> and destroy the next
>>>>>>> steps of the crawling).
>>>>>>>>> For example, below you can see shortened logged version of url
>>>>>>>>> with encoded image, the real lenght of the link is 532572
>>> characters.
>>>>>>>>>
>>>>>>>>> Any idea what should I do with such behavior? Should I modify
>>>>>>>>> the plugin to reject links with lenght > MAX or use more comlex
>>>>>>>>> logic/check extra configuration?
>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>> and normalization
>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>> filter for url
>>>>>>>>>
>>>>>>>
>>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
>>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
>>>>>>>>>
>>>>>>>
>>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>
>>>>>>>
>>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>> 7
>>>>>>>>>
>>>>>>>
>>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>
>>>>>>>
>>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
>>>>>>>>> normalization
>>>>>>>>>
>>>>>>>>> Semyon.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check than
>> usual.
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
The other properties in this section actually affect parsing (e.g. db.max.outlinks.per.page). I was under the impression that this is what db.max.anchor.length was supposed to do, and actually increased its value. Turns out this is one of the many things in Nutch that are not intuitive (or in this case, does nothing at all).
One of the reasons I thought so is that very long links can be used as an attack on crawlers.
Personally, I still think the property should be used to limit outlink length in parsing, but if that is not what it's supposed to do, I guess it needs to be renamed (to match the code), moved to a different section of the properties file, and perhaps better documented. In that case, you'll need to use Markus' solution, and basically everybody should use Markus' first rule...

> -----Original Message-----
> From: Semyon Semyonov <se...@mail.com>
> Sent: 12 March 2018 14:51
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> So, which is the conclusion?
> 
> Should it be solved in regex file or through this property?
> 
> Though, how the property of crawldb/linkdb suppose to prevent this problem in
> Parse?
> 
> Sent: Monday, March 12, 2018 at 1:42 PM
> From: "Edward Capriolo" <ed...@gmail.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
> Some regular expressions (those with backtracing) can be very expensive for
> lomg strings
> 
> https://regular-expressions.mobi/catastrophic.html?wlr=1
> 
> Maybe that is your issue.
> 
> On Monday, March 12, 2018, Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
> > Good catch. It should be renamed to be consistent with other
> > properties, right?
> >
> > On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > > Perhaps, however it starts with db, not linkdb (like the other
> > > linkdb
> > properties), it is in the CrawlDB part of nutch-default.xml, and
> > LinkDB code uses the property name linkdb.max.anchor.length.
> > >
> > >> -----Original Message-----
> > >> From: Markus Jelsma <ma...@openindex.io>
> > >> Sent: 12 March 2018 14:05
> > >> To: user@nutch.apache.org
> > >> Subject: RE: UrlRegexFilter is getting destroyed for
> > >> unrealistically
> > long links
> > >>
> > >> That is for the LinkDB.
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >>> From:Yossi Tamari <yo...@pipl.com>
> > >>> Sent: Monday 12th March 2018 13:02
> > >>> To: user@nutch.apache.org
> > >>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>> unrealistically long links
> > >>>
> > >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> > >>> paste
> > >> error...
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Markus Jelsma <ma...@openindex.io>
> > >>>> Sent: 12 March 2018 14:01
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>> unrealistically long links
> > >>>>
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> > >>>> 100);
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> > int
> > >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> -----Original message-----
> > >>>>> From:Yossi Tamari <yo...@pipl.com>
> > >>>>> Sent: Monday 12th March 2018 12:56
> > >>>>> To: user@nutch.apache.org
> > >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>>> unrealistically long links
> > >>>>>
> > >>>>> Nutch.default contains a property db.max.outlinks.per.page,
> > >>>>> which I think is
> > >>>> supposed to prevent these cases. However, I just searched the
> > >>>> code and couldn't find where it is used. Bug?
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Semyon Semyonov <se...@mail.com>
> > >>>>>> Sent: 12 March 2018 12:47
> > >>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
> > >>>>>> Subject: UrlRegexFilter is getting destroyed for
> > >>>>>> unrealistically long links
> > >>>>>>
> > >>>>>> Dear all,
> > >>>>>>
> > >>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> > >>>>>> parsing takes about 1 millisecond, but sometimes the websites
> > >>>>>> have the crazy links that destroy the parsing(takes 3+ hours
> > >>>>>> and destroy the next
> > >>>> steps of the crawling).
> > >>>>>> For example, below you can see shortened logged version of url
> > >>>>>> with encoded image, the real lenght of the link is 532572
> > characters.
> > >>>>>>
> > >>>>>> Any idea what should I do with such behavior? Should I modify
> > >>>>>> the plugin to reject links with lenght > MAX or use more comlex
> > >>>>>> logic/check extra configuration?
> > >>>>>> 2018-03-10 23:39:52,082 INFO [main]
> > >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> > >>>>>> and normalization
> > >>>>>> 2018-03-10 23:39:52,178 INFO [main]
> > >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> > >>>>>> filter for url
> > >>>>>>
> > >>>>
> > >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> > >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> > >>>>>>
> > >>>>
> > >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > >>>>>>
> > >>>>
> > >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> > >> 7
> > >>>>>>
> > >>>>
> > >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > >>>>>>
> > >>>>
> > >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > >>>>>> dbnu50253lju... [532572 characters]
> > >>>>>> 2018-03-11 03:56:26,118 INFO [main]
> > >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > >>>>>> normalization
> > >>>>>>
> > >>>>>> Semyon.
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> >
> >
> 
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.


Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Semyon Semyonov <se...@mail.com>.
So, which is the conclusion?

Should it be solved in regex file or through this property?

Though, how the property of crawldb/linkdb suppose to prevent this problem in Parse?

Sent: Monday, March 12, 2018 at 1:42 PM
From: "Edward Capriolo" <ed...@gmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Some regular expressions (those with backtracing) can be very expensive for
lomg strings

https://regular-expressions.mobi/catastrophic.html?wlr=1

Maybe that is your issue.

On Monday, March 12, 2018, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Good catch. It should be renamed to be consistent with other properties,
> right?
>
> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > Perhaps, however it starts with db, not linkdb (like the other linkdb
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB
> code uses the property name linkdb.max.anchor.length.
> >
> >> -----Original Message-----
> >> From: Markus Jelsma <ma...@openindex.io>
> >> Sent: 12 March 2018 14:05
> >> To: user@nutch.apache.org
> >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> long links
> >>
> >> That is for the LinkDB.
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Yossi Tamari <yo...@pipl.com>
> >>> Sent: Monday 12th March 2018 13:02
> >>> To: user@nutch.apache.org
> >>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>> long links
> >>>
> >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> >> error...
> >>>
> >>>> -----Original Message-----
> >>>> From: Markus Jelsma <ma...@openindex.io>
> >>>> Sent: 12 March 2018 14:01
> >>>> To: user@nutch.apache.org
> >>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>>> long links
> >>>>
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> int
> >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original message-----
> >>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>> Sent: Monday 12th March 2018 12:56
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>> unrealistically long links
> >>>>>
> >>>>> Nutch.default contains a property db.max.outlinks.per.page, which
> >>>>> I think is
> >>>> supposed to prevent these cases. However, I just searched the code
> >>>> and couldn't find where it is used. Bug?
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Semyon Semyonov <se...@mail.com>
> >>>>>> Sent: 12 March 2018 12:47
> >>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
> >>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically
> >>>>>> long links
> >>>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> >>>>>> parsing takes about 1 millisecond, but sometimes the websites
> >>>>>> have the crazy links that destroy the parsing(takes 3+ hours and
> >>>>>> destroy the next
> >>>> steps of the crawling).
> >>>>>> For example, below you can see shortened logged version of url
> >>>>>> with encoded image, the real lenght of the link is 532572
> characters.
> >>>>>>
> >>>>>> Any idea what should I do with such behavior? Should I modify
> >>>>>> the plugin to reject links with lenght > MAX or use more comlex
> >>>>>> logic/check extra configuration?
> >>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> >>>>>> normalization
> >>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>> filter for url
> >>>>>>
> >>>>
> >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>
> >>>>
> >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>
> >>>>
> >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> >>>>>>
> >>>>
> >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>
> >>>>
> >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>> dbnu50253lju... [532572 characters]
> >>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> >>>>>> normalization
> >>>>>>
> >>>>>> Semyon.
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

--
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Edward Capriolo <ed...@gmail.com>.
Some regular expressions (those with backtracing) can be very expensive for
lomg strings

https://regular-expressions.mobi/catastrophic.html?wlr=1

Maybe that is your issue.

On Monday, March 12, 2018, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Good catch. It should be renamed to be consistent with other properties,
> right?
>
> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > Perhaps, however it starts with db, not linkdb (like the other linkdb
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB
> code uses the property name linkdb.max.anchor.length.
> >
> >> -----Original Message-----
> >> From: Markus Jelsma <ma...@openindex.io>
> >> Sent: 12 March 2018 14:05
> >> To: user@nutch.apache.org
> >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> long links
> >>
> >> That is for the LinkDB.
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Yossi Tamari <yo...@pipl.com>
> >>> Sent: Monday 12th March 2018 13:02
> >>> To: user@nutch.apache.org
> >>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>> long links
> >>>
> >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> >> error...
> >>>
> >>>> -----Original Message-----
> >>>> From: Markus Jelsma <ma...@openindex.io>
> >>>> Sent: 12 March 2018 14:01
> >>>> To: user@nutch.apache.org
> >>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>>> long links
> >>>>
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> int
> >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original message-----
> >>>>> From:Yossi Tamari <yo...@pipl.com>
> >>>>> Sent: Monday 12th March 2018 12:56
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>> unrealistically long links
> >>>>>
> >>>>> Nutch.default contains a property db.max.outlinks.per.page, which
> >>>>> I think is
> >>>> supposed to prevent these cases. However, I just searched the code
> >>>> and couldn't find where it is used. Bug?
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Semyon Semyonov <se...@mail.com>
> >>>>>> Sent: 12 March 2018 12:47
> >>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
> >>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically
> >>>>>> long links
> >>>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> >>>>>> parsing takes about 1 millisecond, but sometimes the websites
> >>>>>> have the crazy links that destroy the parsing(takes 3+ hours and
> >>>>>> destroy the next
> >>>> steps of the crawling).
> >>>>>> For example, below you can see shortened logged version of url
> >>>>>> with encoded image, the real lenght of the link is 532572
> characters.
> >>>>>>
> >>>>>> Any idea what should I do with such behavior?  Should I modify
> >>>>>> the plugin to reject links with lenght > MAX or use more comlex
> >>>>>> logic/check extra configuration?
> >>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> >>>>>> normalization
> >>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>> filter for url
> >>>>>>
> >>>>
> >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> >>>>>>
> >>>>
> >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>
> >>>>
> >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> >>>>>>
> >>>>
> >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>
> >>>>
> >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>> dbnu50253lju... [532572 characters]
> >>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> >>>>>> normalization
> >>>>>>
> >>>>>> Semyon.
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Sebastian Nagel <wa...@googlemail.com>.
Good catch. It should be renamed to be consistent with other properties, right?

On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> Perhaps, however it starts with db, not linkdb (like the other linkdb properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code uses the property name linkdb.max.anchor.length.
> 
>> -----Original Message-----
>> From: Markus Jelsma <ma...@openindex.io>
>> Sent: 12 March 2018 14:05
>> To: user@nutch.apache.org
>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> That is for the LinkDB.
>>
>>
>>
>> -----Original message-----
>>> From:Yossi Tamari <yo...@pipl.com>
>>> Sent: Monday 12th March 2018 13:02
>>> To: user@nutch.apache.org
>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
>>> long links
>>>
>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
>> error...
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <ma...@openindex.io>
>>>> Sent: 12 March 2018 14:01
>>>> To: user@nutch.apache.org
>>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> scripts/apache-nutch-
>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
>>>> scripts/apache-nutch-
>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>>>>
>>>>
>>>>
>>>>
>>>> -----Original message-----
>>>>> From:Yossi Tamari <yo...@pipl.com>
>>>>> Sent: Monday 12th March 2018 12:56
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>> unrealistically long links
>>>>>
>>>>> Nutch.default contains a property db.max.outlinks.per.page, which
>>>>> I think is
>>>> supposed to prevent these cases. However, I just searched the code
>>>> and couldn't find where it is used. Bug?
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Semyon Semyonov <se...@mail.com>
>>>>>> Sent: 12 March 2018 12:47
>>>>>> To: usernutch.apache.org <us...@nutch.apache.org>
>>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically
>>>>>> long links
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>> have the crazy links that destroy the parsing(takes 3+ hours and
>>>>>> destroy the next
>>>> steps of the crawling).
>>>>>> For example, below you can see shortened logged version of url
>>>>>> with encoded image, the real lenght of the link is 532572 characters.
>>>>>>
>>>>>> Any idea what should I do with such behavior?  Should I modify
>>>>>> the plugin to reject links with lenght > MAX or use more comlex
>>>>>> logic/check extra configuration?
>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
>>>>>> normalization
>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>> filter for url
>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>> dbnu50253lju... [532572 characters]
>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
>>>>>> normalization
>>>>>>
>>>>>> Semyon.
>>>>>
>>>>>
>>>
>>>
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
Perhaps, however it starts with db, not linkdb (like the other linkdb properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code uses the property name linkdb.max.anchor.length.

> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> Sent: 12 March 2018 14:05
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> That is for the LinkDB.
> 
> 
> 
> -----Original message-----
> > From:Yossi Tamari <yo...@pipl.com>
> > Sent: Monday 12th March 2018 13:02
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > long links
> >
> > Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> error...
> >
> > > -----Original Message-----
> > > From: Markus Jelsma <ma...@openindex.io>
> > > Sent: 12 March 2018 14:01
> > > To: user@nutch.apache.org
> > > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > > maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
> > > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Yossi Tamari <yo...@pipl.com>
> > > > Sent: Monday 12th March 2018 12:56
> > > > To: user@nutch.apache.org
> > > > Subject: RE: UrlRegexFilter is getting destroyed for
> > > > unrealistically long links
> > > >
> > > > Nutch.default contains a property db.max.outlinks.per.page, which
> > > > I think is
> > > supposed to prevent these cases. However, I just searched the code
> > > and couldn't find where it is used. Bug?
> > > >
> > > > > -----Original Message-----
> > > > > From: Semyon Semyonov <se...@mail.com>
> > > > > Sent: 12 March 2018 12:47
> > > > > To: usernutch.apache.org <us...@nutch.apache.org>
> > > > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > > > long links
> > > > >
> > > > > Dear all,
> > > > >
> > > > > There is an issue with UrlRegexFilter and parsing. In average,
> > > > > parsing takes about 1 millisecond, but sometimes the websites
> > > > > have the crazy links that destroy the parsing(takes 3+ hours and
> > > > > destroy the next
> > > steps of the crawling).
> > > > > For example, below you can see shortened logged version of url
> > > > > with encoded image, the real lenght of the link is 532572 characters.
> > > > >
> > > > > Any idea what should I do with such behavior?  Should I modify
> > > > > the plugin to reject links with lenght > MAX or use more comlex
> > > > > logic/check extra configuration?
> > > > > 2018-03-10 23:39:52,082 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > > > normalization
> > > > > 2018-03-10 23:39:52,178 INFO [main]
> > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> > > > > filter for url
> > > > >
> > >
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> > > > >
> > >
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > > > >
> > >
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > > > >
> > >
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > > > >
> > >
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > > > dbnu50253lju... [532572 characters]
> > > > > 2018-03-11 03:56:26,118 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > > > normalization
> > > > >
> > > > > Semyon.
> > > >
> > > >
> >
> >


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Markus Jelsma <ma...@openindex.io>.
That is for the LinkDB.

 
 
-----Original message-----
> From:Yossi Tamari <yo...@pipl.com>
> Sent: Monday 12th March 2018 13:02
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...
> 
> > -----Original Message-----
> > From: Markus Jelsma <ma...@openindex.io>
> > Sent: 12 March 2018 14:01
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
> > 
> > scripts/apache-nutch-
> > 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> > scripts/apache-nutch-
> > 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
> > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > 
> > 
> > 
> > 
> > -----Original message-----
> > > From:Yossi Tamari <yo...@pipl.com>
> > > Sent: Monday 12th March 2018 12:56
> > > To: user@nutch.apache.org
> > > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > Nutch.default contains a property db.max.outlinks.per.page, which I think is
> > supposed to prevent these cases. However, I just searched the code and couldn't
> > find where it is used. Bug?
> > >
> > > > -----Original Message-----
> > > > From: Semyon Semyonov <se...@mail.com>
> > > > Sent: 12 March 2018 12:47
> > > > To: usernutch.apache.org <us...@nutch.apache.org>
> > > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > > long links
> > > >
> > > > Dear all,
> > > >
> > > > There is an issue with UrlRegexFilter and parsing. In average,
> > > > parsing takes about 1 millisecond, but sometimes the websites have
> > > > the crazy links that destroy the parsing(takes 3+ hours and destroy the next
> > steps of the crawling).
> > > > For example, below you can see shortened logged version of url with
> > > > encoded image, the real lenght of the link is 532572 characters.
> > > >
> > > > Any idea what should I do with such behavior?  Should I modify the
> > > > plugin to reject links with lenght > MAX or use more comlex
> > > > logic/check extra configuration?
> > > > 2018-03-10 23:39:52,082 INFO [main]
> > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > > normalization
> > > > 2018-03-10 23:39:52,178 INFO [main]
> > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter
> > > > for url
> > > >
> > :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> > > >
> > UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > > >
> > Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > > >
> > X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > > >
> > efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > > dbnu50253lju... [532572 characters]
> > > > 2018-03-11 03:56:26,118 INFO [main]
> > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > > normalization
> > > >
> > > > Semyon.
> > >
> > >
> 
> 

RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...

> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> Sent: 12 March 2018 14:01
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> scripts/apache-nutch-
> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> scripts/apache-nutch-
> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> 
> 
> 
> 
> -----Original message-----
> > From:Yossi Tamari <yo...@pipl.com>
> > Sent: Monday 12th March 2018 12:56
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > long links
> >
> > Nutch.default contains a property db.max.outlinks.per.page, which I think is
> supposed to prevent these cases. However, I just searched the code and couldn't
> find where it is used. Bug?
> >
> > > -----Original Message-----
> > > From: Semyon Semyonov <se...@mail.com>
> > > Sent: 12 March 2018 12:47
> > > To: usernutch.apache.org <us...@nutch.apache.org>
> > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > Dear all,
> > >
> > > There is an issue with UrlRegexFilter and parsing. In average,
> > > parsing takes about 1 millisecond, but sometimes the websites have
> > > the crazy links that destroy the parsing(takes 3+ hours and destroy the next
> steps of the crawling).
> > > For example, below you can see shortened logged version of url with
> > > encoded image, the real lenght of the link is 532572 characters.
> > >
> > > Any idea what should I do with such behavior?  Should I modify the
> > > plugin to reject links with lenght > MAX or use more comlex
> > > logic/check extra configuration?
> > > 2018-03-10 23:39:52,082 INFO [main]
> > > org.apache.nutch.parse.ParseOutputFormat:
> > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > normalization
> > > 2018-03-10 23:39:52,178 INFO [main]
> > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter
> > > for url
> > >
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> > >
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > >
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > >
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > >
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > dbnu50253lju... [532572 characters]
> > > 2018-03-11 03:56:26,118 INFO [main]
> > > org.apache.nutch.parse.ParseOutputFormat:
> > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > normalization
> > >
> > > Semyon.
> >
> >


RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Markus Jelsma <ma...@openindex.io>.
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:    maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);


 
 
-----Original message-----
> From:Yossi Tamari <yo...@pipl.com>
> Sent: Monday 12th March 2018 12:56
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? 
> 
> > -----Original Message-----
> > From: Semyon Semyonov <se...@mail.com>
> > Sent: 12 March 2018 12:47
> > To: usernutch.apache.org <us...@nutch.apache.org>
> > Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> > 
> > Dear all,
> > 
> > There is an issue with UrlRegexFilter and parsing. In average, parsing takes
> > about 1 millisecond, but sometimes the websites have the crazy links that
> > destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).
> > For example, below you can see shortened logged version of url with encoded
> > image, the real lenght of the link is 532572 characters.
> > 
> > Any idea what should I do with such behavior?  Should I modify the plugin to
> > reject links with lenght > MAX or use more comlex logic/check extra
> > configuration?
> > 2018-03-10 23:39:52,082 INFO [main]
> > org.apache.nutch.parse.ParseOutputFormat:
> > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization
> > 2018-03-10 23:39:52,178 INFO [main]
> > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url
> > :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> > UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > dbnu50253lju... [532572 characters]
> > 2018-03-11 03:56:26,118 INFO [main]
> > org.apache.nutch.parse.ParseOutputFormat:
> > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization
> > 
> > Semyon.
> 
> 

RE: UrlRegexFilter is getting destroyed for unrealistically long links

Posted by Yossi Tamari <yo...@pipl.com>.
Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? 

> -----Original Message-----
> From: Semyon Semyonov <se...@mail.com>
> Sent: 12 March 2018 12:47
> To: usernutch.apache.org <us...@nutch.apache.org>
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Dear all,
> 
> There is an issue with UrlRegexFilter and parsing. In average, parsing takes
> about 1 millisecond, but sometimes the websites have the crazy links that
> destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).
> For example, below you can see shortened logged version of url with encoded
> image, the real lenght of the link is 532572 characters.
> 
> Any idea what should I do with such behavior?  Should I modify the plugin to
> reject links with lenght > MAX or use more comlex logic/check extra
> configuration?
> 2018-03-10 23:39:52,082 INFO [main]
> org.apache.nutch.parse.ParseOutputFormat:
> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization
> 2018-03-10 23:39:52,178 INFO [main]
> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> dbnu50253lju... [532572 characters]
> 2018-03-11 03:56:26,118 INFO [main]
> org.apache.nutch.parse.ParseOutputFormat:
> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization
> 
> Semyon.