You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by AJ Chen <aj...@web2express.org> on 2010/09/11 17:37:10 UTC

how to skip invalid outlinks

After crawldb grows big, the percentage of invalid urls in the generated
segments become very high. Fetching invlid urls is wasteful - reducing
throughput significantly. These invalid urls are from outlinks of various
contents.  Lots of them seem from parsing javascript. What's a good to
prevent invalid outlinks getting into crawldb?
thanks,
-aj

Re: how to skip invalid outlinks

Posted by AJ Chen <aj...@web2express.org>.

Using url regex exclusion is an approach. JS parser also needs to improve
identification of valid outlinks.

I'm using another approach: adding code to validate url before any url is
added to crawldb in the update db step. one test to do is "new URL(url)",
which throws exception on incorrectly formatted url. This test will skip
lots of invalid urls.  However, lots of invalid urls still get into
crawldb.  I may have missed some places that need to have this url
validation code.  Question: Where are the right places to insert url
validation code in order to completely prevent invalid urls from adding to
crawldb?

-aj

On Sun, Sep 12, 2010 at 9:09 AM, Mike Baranczak <mb...@gmail.com>wrote:

> I had the same problem, and a lot of the bad links did seem to come from
> faulty JavaScript parsing. Jeff's suggestion is probably the best you can do
> for now. The long-term solution would be to fix the JavaScript parser
> plugin.
>
> -MB
>
>
> On Sep 11, 2010, at 3:09 PM, Jeff Zhou wrote:
>
> > there is no way unless you know what urls to exclude. i use regex to
> remove
> > any outlinks falling in the pattern i dislike.
> >
> > On Sat, Sep 11, 2010 at 11:37 AM, AJ Chen <aj...@web2express.org>
> wrote:
> >
> >> After crawldb grows big, the percentage of invalid urls in the generated
> >> segments become very high. Fetching invlid urls is wasteful - reducing
> >> throughput significantly. These invalid urls are from outlinks of
> various
> >> contents.  Lots of them seem from parsing javascript. What's a good to
> >> prevent invalid outlinks getting into crawldb?
> >> thanks,
> >> -aj
> >>
>
>

-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: how to skip invalid outlinks

Posted by Mike Baranczak <mb...@gmail.com>.

I had the same problem, and a lot of the bad links did seem to come from faulty JavaScript parsing. Jeff's suggestion is probably the best you can do for now. The long-term solution would be to fix the JavaScript parser plugin.

-MB

On Sep 11, 2010, at 3:09 PM, Jeff Zhou wrote:

> there is no way unless you know what urls to exclude. i use regex to remove
> any outlinks falling in the pattern i dislike.
> 
> On Sat, Sep 11, 2010 at 11:37 AM, AJ Chen <aj...@web2express.org> wrote:
> 
>> After crawldb grows big, the percentage of invalid urls in the generated
>> segments become very high. Fetching invlid urls is wasteful - reducing
>> throughput significantly. These invalid urls are from outlinks of various
>> contents.  Lots of them seem from parsing javascript. What's a good to
>> prevent invalid outlinks getting into crawldb?
>> thanks,
>> -aj
>>

Re: how to skip invalid outlinks

Posted by Jeff Zhou <je...@gmail.com>.

there is no way unless you know what urls to exclude. i use regex to remove
any outlinks falling in the pattern i dislike.

On Sat, Sep 11, 2010 at 11:37 AM, AJ Chen <aj...@web2express.org> wrote:

> After crawldb grows big, the percentage of invalid urls in the generated
> segments become very high. Fetching invlid urls is wasteful - reducing
> throughput significantly. These invalid urls are from outlinks of various
> contents.  Lots of them seem from parsing javascript. What's a good to
> prevent invalid outlinks getting into crawldb?
> thanks,
> -aj
>