You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yavuz Selim YILMAZ <yv...@gmail.com> on 2010/09/16 14:25:23 UTC
Junk Links
Hi all,
I make a crawl which has such links;
.../+location.hostname+
.../</div>
.../application/rss+xml
.../</text/css
.../application/text/css
.../text/+location.hostname+
" .../ " is same for every link
I use it with solr, when I a make query, similar urls returns.
Anybody knows how to protect crawling this kind of junk links.
Thnxs
--
Yavuz Selim YILMAZ
Re: Junk Links
Posted by Mike Baranczak <mb...@gmail.com>.
Take a look at the URLNormalizer plugins.
On Sep 23, 2010, at 4:03 AM, Yavuz Selim YILMAZ wrote:
> Another question;
>
> I have thsi kind of urls;
>
> .....aaa/
> .....aaa
> .....bbb/
> .....ccc
> .....ddd/
> .....ddd
>
> There are duplicates like that.
>
> What Im' trying to explain is, some of them is duplicate, but there are
> links like "...bbb/" and "...ccc"which are valid.
>
> How can handle this kind of problem?
>
> I think it is not handled by url-regex. Because it just controls the url
> property, can not know if the same url with or without slash is taken.
>
> Any idea?
>
> --
>
> Yavuz Selim YILMAZ
>
Re: Junk Links
Posted by Yavuz Selim YILMAZ <yv...@gmail.com>.
Another question;
I have thsi kind of urls;
.....aaa/
.....aaa
.....bbb/
.....ccc
.....ddd/
.....ddd
There are duplicates like that.
What Im' trying to explain is, some of them is duplicate, but there are
links like "...bbb/" and "...ccc"which are valid.
How can handle this kind of problem?
I think it is not handled by url-regex. Because it just controls the url
property, can not know if the same url with or without slash is taken.
Any idea?
--
Yavuz Selim YILMAZ
2010/9/17 Yavuz Selim YILMAZ <yv...@gmail.com>
> Thnx Mike,
> I missed, then configuring *url regex* may be helpful.
> --
>
> Yavuz Selim YILMAZ
>
>
> 2010/9/16 Mike Baranczak <mb...@gmail.com>
>
> This was discussed just a few days ago:
>>
>>
>> http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591
>>
>>
>>
>> On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:
>>
>> > Hi all,
>> >
>> > I make a crawl which has such links;
>> >
>> > .../+location.hostname+
>> > .../</div>
>> > .../application/rss+xml
>> > .../</text/css
>> > .../application/text/css
>> > .../text/+location.hostname+
>> >
>> > " .../ " is same for every link
>> >
>> > I use it with solr, when I a make query, similar urls returns.
>> >
>> > Anybody knows how to protect crawling this kind of junk links.
>> >
>> > Thnxs
>> >
>> > --
>> >
>> > Yavuz Selim YILMAZ
>>
>>
>
Re: Junk Links
Posted by Yavuz Selim YILMAZ <yv...@gmail.com>.
Thnx Mike,
I missed, then configuring *url regex* may be helpful.
--
Yavuz Selim YILMAZ
2010/9/16 Mike Baranczak <mb...@gmail.com>
> This was discussed just a few days ago:
>
>
> http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591
>
>
>
> On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:
>
> > Hi all,
> >
> > I make a crawl which has such links;
> >
> > .../+location.hostname+
> > .../</div>
> > .../application/rss+xml
> > .../</text/css
> > .../application/text/css
> > .../text/+location.hostname+
> >
> > " .../ " is same for every link
> >
> > I use it with solr, when I a make query, similar urls returns.
> >
> > Anybody knows how to protect crawling this kind of junk links.
> >
> > Thnxs
> >
> > --
> >
> > Yavuz Selim YILMAZ
>
>
Re: Junk Links
Posted by Mike Baranczak <mb...@gmail.com>.
This was discussed just a few days ago:
http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591
On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:
> Hi all,
>
> I make a crawl which has such links;
>
> .../+location.hostname+
> .../</div>
> .../application/rss+xml
> .../</text/css
> .../application/text/css
> .../text/+location.hostname+
>
> " .../ " is same for every link
>
> I use it with solr, when I a make query, similar urls returns.
>
> Anybody knows how to protect crawling this kind of junk links.
>
> Thnxs
>
> --
>
> Yavuz Selim YILMAZ