You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2013/04/30 22:56:16 UTC
[jira] [Closed] (NUTCH-1329) parser not extract outlinks to
external web sites
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil closed NUTCH-1329.
------------------------------
Resolution: Cannot Reproduce
Closing for now by marking it "cannot reproduce"
> parser not extract outlinks to external web sites
> -------------------------------------------------
>
> Key: NUTCH-1329
> URL: https://issues.apache.org/jira/browse/NUTCH-1329
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: behnam nikbakht
> Labels: parse
> Fix For: 2.3, 1.8
>
>
> found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
> i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
> so i replace these lines:
> URL url = URLUtil.resolveURL(base, target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
> with:
> String host_temp=null;
> try{
> host_temp=URLUtil.getDomainName(new URL(target));
> }
> catch(Exception eiuy){
> host_temp=null;
> }
> URL url=null;
> if(host_temp==null)// it is an internal outlink
> url = URLUtil.resolveURL(base, target);
> else //it is an external link
> url=new URL(target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira