You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/01/12 20:18:14 UTC
[jira] [Updated] (NUTCH-1329) parser not extract outlinks to
external web sites
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1329:
----------------------------------------
Fix Version/s: 2.2
1.7
> parser not extract outlinks to external web sites
> -------------------------------------------------
>
> Key: NUTCH-1329
> URL: https://issues.apache.org/jira/browse/NUTCH-1329
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: behnam nikbakht
> Labels: parse
> Fix For: 1.7, 2.2
>
>
> found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
> i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
> so i replace these lines:
> URL url = URLUtil.resolveURL(base, target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
> with:
> String host_temp=null;
> try{
> host_temp=URLUtil.getDomainName(new URL(target));
> }
> catch(Exception eiuy){
> host_temp=null;
> }
> URL url=null;
> if(host_temp==null)// it is an internal outlink
> url = URLUtil.resolveURL(base, target);
> else //it is an external link
> url=new URL(target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira