You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jerome Charron (JIRA)" <ji...@apache.org> on 2005/08/19 23:22:56 UTC
[jira] Closed: (NUTCH-20) Extract urls from plain texts
[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]
Jerome Charron closed NUTCH-20:
-------------------------------
Fix Version: 0.8-dev
Resolution: Fixed
Revision 233559 - http://svn.apache.org/viewcvs.cgi?rev=233559&view=rev
* Add utility to extract urls from plain text (thanks to Stephan Strittmatter)
* Uses the OutlinkExtractor in parse plugins PDF, MSWord, Text, RTF, Ext
Note: Take a look at the JSParseFilter in order to use the OutlinkExtractor in it.
> Extract urls from plain texts
> ------------------------------
>
> Key: NUTCH-20
> URL: http://issues.apache.org/jira/browse/NUTCH-20
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Stefan Grroschupf
> Priority: Trivial
> Fix For: 0.8-dev
> Attachments: OutlinkExtractor.java, OutlinkExtractor.java, OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt
>
> Some parsers have no Outlinks returned. E.g. the Word-Parser.
> This class is able to extract (absolute) hyperlinks from a plain String (content) and generates outlinks from them.
> This would be very usful for parser which have no explicite extraction of hyperlinks.
> Excample:
> Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at http://www.apache.org and ...");
> Will return an array of Outlinks containing the one element of "http://www.apache.org".
> ----
> transfered from: http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
> submitted by: Stephan Strittmatter
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira