You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yavuz Selim YILMAZ <yv...@gmail.com> on 2010/09/16 14:25:23 UTC

Junk Links

Hi all,

I make a crawl which has such links;

.../+location.hostname+
.../</div>
.../application/rss+xml
.../</text/css
.../application/text/css
 .../text/+location.hostname+

" .../ " is same for every link

I use it with solr, when I a make query, similar urls returns.

Anybody knows how to protect crawling this kind of junk links.

Thnxs

--

Yavuz Selim YILMAZ

Re: Junk Links

Posted by Mike Baranczak <mb...@gmail.com>.
Take a look at the URLNormalizer plugins. 


On Sep 23, 2010, at 4:03 AM, Yavuz Selim YILMAZ wrote:

> Another question;
> 
> I have thsi kind of urls;
> 
> .....aaa/
> .....aaa
> .....bbb/
> .....ccc
> .....ddd/
> .....ddd
> 
> There are duplicates like that.
> 
> What Im' trying to explain is, some of them is duplicate, but there are
> links like "...bbb/" and "...ccc"which are valid.
> 
> How can handle this kind of problem?
> 
> I think it is not handled by url-regex. Because it just controls the url
> property, can not know if the same url with or without slash is taken.
> 
> Any idea?
> 
> --
> 
> Yavuz Selim YILMAZ
> 

Re: Junk Links

Posted by Yavuz Selim YILMAZ <yv...@gmail.com>.
Another question;

I have thsi kind of urls;

.....aaa/
.....aaa
.....bbb/
.....ccc
.....ddd/
.....ddd

There are duplicates like that.

What Im' trying to explain is, some of them is duplicate, but there are
links like "...bbb/" and "...ccc"which are valid.

How can handle this kind of problem?

I think it is not handled by url-regex. Because it just controls the url
property, can not know if the same url with or without slash is taken.

Any idea?

--

Yavuz Selim YILMAZ


2010/9/17 Yavuz Selim YILMAZ <yv...@gmail.com>

> Thnx Mike,
> I missed, then configuring *url regex* may be helpful.
> --
>
> Yavuz Selim YILMAZ
>
>
> 2010/9/16 Mike Baranczak <mb...@gmail.com>
>
> This was discussed just a few days ago:
>>
>>
>> http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591
>>
>>
>>
>> On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:
>>
>> > Hi all,
>> >
>> > I make a crawl which has such links;
>> >
>> > .../+location.hostname+
>> > .../</div>
>> > .../application/rss+xml
>> > .../</text/css
>> > .../application/text/css
>> > .../text/+location.hostname+
>> >
>> > " .../ " is same for every link
>> >
>> > I use it with solr, when I a make query, similar urls returns.
>> >
>> > Anybody knows how to protect crawling this kind of junk links.
>> >
>> > Thnxs
>> >
>> > --
>> >
>> > Yavuz Selim YILMAZ
>>
>>
>

Re: Junk Links

Posted by Yavuz Selim YILMAZ <yv...@gmail.com>.
Thnx Mike,
I missed, then configuring *url regex* may be helpful.
--

Yavuz Selim YILMAZ


2010/9/16 Mike Baranczak <mb...@gmail.com>

> This was discussed just a few days ago:
>
>
> http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591
>
>
>
> On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:
>
> > Hi all,
> >
> > I make a crawl which has such links;
> >
> > .../+location.hostname+
> > .../</div>
> > .../application/rss+xml
> > .../</text/css
> > .../application/text/css
> > .../text/+location.hostname+
> >
> > " .../ " is same for every link
> >
> > I use it with solr, when I a make query, similar urls returns.
> >
> > Anybody knows how to protect crawling this kind of junk links.
> >
> > Thnxs
> >
> > --
> >
> > Yavuz Selim YILMAZ
>
>

Re: Junk Links

Posted by Mike Baranczak <mb...@gmail.com>.
This was discussed just a few days ago:

http://lucene.472066.n3.nabble.com/how-to-skip-invalid-outlinks-td1457591.html#a1457591



On Sep 16, 2010, at 8:25 AM, Yavuz Selim YILMAZ wrote:

> Hi all,
> 
> I make a crawl which has such links;
> 
> .../+location.hostname+
> .../</div>
> .../application/rss+xml
> .../</text/css
> .../application/text/css
> .../text/+location.hostname+
> 
> " .../ " is same for every link
> 
> I use it with solr, when I a make query, similar urls returns.
> 
> Anybody knows how to protect crawling this kind of junk links.
> 
> Thnxs
> 
> --
> 
> Yavuz Selim YILMAZ