You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shobhit <sh...@gmail.com> on 2015/01/27 11:16:27 UTC
OutlinkExtractor is not considering the relative URLs as outlinks.
Hi All,
I am trying to crawl the webpages using Nutch-2.1, but I am not getting
relative urls as outlinks when parsing the HTML content of webpage.
Below is the environment details :
Nutch-2.1
Hadoop-0.20.205
HBase-0.90.6
hbase-gora-0.2.1
Web page is having relative URLs as below :
</audios/?id=1234123> </audios/?id=1234124> </audios/?id=1234125>
</audios/?id=1334126> When I am trying to extract the outlinks from webpage
then these above urls are getting skipped from OutlinkExtraction logic ,
because of below regex-pattern,
present in OutlinkExtractor.java file
private static final String URL_PATTERN =
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)";
please let me know if I am looking at correct file or any other solution is
available for this.
your guidence is really appriciated.
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01285.html
Regards,
Shobhit </audios/?id=1334127>
--
View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: OutlinkExtractor is not considering the relative URLs as outlinks.
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> I am trying to crawl the webpages using Nutch-2.1, but I am not getting
> relative urls as outlinks when parsing the HTML content of webpage.
> Web page is having relative URLs as below :
> </audios/?id=1234123> </audios/?id=1234124> </audios/?id=1234125>
Is the page HTML? Then outlinks are extracted via markup, e.g.
<a href="...">
Relative links are always made absolute.
OutlinkExtractor is only used if there are no outlinks from mark-up.
That is the case for plain text files and other formats without
marked links.
Sebastian
On 01/27/2015 03:41 PM, Shobhit wrote:
> Or if it is something like, Nutch (Tika Parser) , will convert these relative
> urls to their corresponding absolute urls as an outlinks first and then they
> will get passed from above regex-pattern.
>
> Regards,
> Shobhit
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187p4182233.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: OutlinkExtractor is not considering the relative URLs as
outlinks.
Posted by Shobhit <sh...@gmail.com>.
Or if it is something like, Nutch (Tika Parser) , will convert these relative
urls to their corresponding absolute urls as an outlinks first and then they
will get passed from above regex-pattern.
Regards,
Shobhit
--
View this message in context: http://lucene.472066.n3.nabble.com/OutlinkExtractor-is-not-considering-the-relative-URLs-as-outlinks-tp4182187p4182233.html
Sent from the Nutch - User mailing list archive at Nabble.com.