You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/09/18 15:20:30 UTC

Relative urls - outlinks

Is there anyway to keep nutch from generating outlinks for any RELATIVE urls? 
I basically don't want to use ANY relative urls that I find.. 

Then the next question is how do I get them out of my crawldb :-)



--
View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-outlinks-tp4008601.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Relative urls - outlinks

Posted by webdev1977 <we...@gmail.com>.
NOOOooooo!!!  Just kidding! :-)  

So maybe you can clear something up for me.  In the future while building a
new crawldb, if I only wanted to accept urls from the following:

http://myhost:81/site1/test.php?id=1234
http://myhost:81/site1/list.php?page=1234&count=21
http://myhost:81/site1/view.php?id=1234
http://myhost:81/site2/test2.php?id=12233
http://myhost:81/site2/list.php?page=25&count=12344

file:////sharedrive1/share1/

How would the regex-urlfilter look for the php pages?

+^http://myhost:81/site1/test.php\?.*    ??? 





--
View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-outlinks-tp4008601p4008603.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Relative urls - outlinks

Posted by Markus Jelsma <ma...@openindex.io>.
No, relative URL's are resolved in both parsers plugins. You can try to disable it manually. There's no way to remove them from the CrawlDB except some clever filtering. They're absolute now.

 
 
-----Original message-----
> From:webdev1977 <we...@gmail.com>
> Sent: Tue 18-Sep-2012 15:24
> To: user@nutch.apache.org
> Subject: Relative urls - outlinks
> 
> Is there anyway to keep nutch from generating outlinks for any RELATIVE urls? 
> I basically don't want to use ANY relative urls that I find.. 
> 
> Then the next question is how do I get them out of my crawldb :-)
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-outlinks-tp4008601.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>