You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shobhit <sh...@gmail.com> on 2015/01/27 11:16:27 UTC

OutlinkExtractor is not considering the relative URLs as outlinks.

Hi All,

I am trying to crawl the webpages using Nutch-2.1, but I am not getting
relative urls as outlinks when parsing the HTML content of webpage.

Below is the environment details :

Nutch-2.1
Hadoop-0.20.205
HBase-0.90.6
hbase-gora-0.2.1


Web page is having relative URLs as below :

 </audios/?id=1234123>   </audios/?id=1234124>   </audios/?id=1234125>  
</audios/?id=1334126>  When I am trying to extract the outlinks from webpage
then these above urls are getting skipped from OutlinkExtraction logic ,
because of below regex-pattern,
present in OutlinkExtractor.java file

private static final String URL_PATTERN = 
   
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)";

please let me know if I am looking at correct file or any other solution is
available for this.	
your guidence is really appriciated.

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01285.html


Regards,
Shobhit </audios/?id=1334127> 



--
View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: OutlinkExtractor is not considering the relative URLs as outlinks.

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

> I am trying to crawl the webpages using Nutch-2.1, but I am not getting
> relative urls as outlinks when parsing the HTML content of webpage.

> Web page is having relative URLs as below :

> </audios/?id=1234123>   </audios/?id=1234124>   </audios/?id=1234125>

Is the page HTML? Then outlinks are extracted via markup, e.g.
 <a href="...">
Relative links are always made absolute.

OutlinkExtractor is only used if there are no outlinks from mark-up.
That is the case for plain text files and other formats without
marked links.

Sebastian

On 01/27/2015 03:41 PM, Shobhit wrote:
> Or if it is something like, Nutch (Tika Parser) , will convert these relative
> urls to their corresponding absolute urls as an outlinks first and then they
> will get passed from above regex-pattern. 
> 
> Regards,
> Shobhit
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187p4182233.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: OutlinkExtractor is not considering the relative URLs as outlinks.

Posted by Shobhit <sh...@gmail.com>.

Or if it is something like, Nutch (Tika Parser) , will convert these relative
urls to their corresponding absolute urls as an outlinks first and then they
will get passed from above regex-pattern. 

Regards,
Shobhit






--
View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187p4182233.html
Sent from the Nutch - User mailing list archive at Nabble.com.