You are viewing a plain text version of this content. The canonical link for it is here.
- [jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/01 10:42:15 UTC, 0 replies.
- [jira] Commented: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/01 11:03:15 UTC, 1 replies.
- Build failed in Hudson: Nutch-Nightly #74 - posted by hu...@lucene.zones.apache.org on 2007/05/03 09:00:10 UTC, 0 replies.
- Nutch - Filtering (REGEX) - posted by simon_ece <si...@yahoo.com> on 2007/05/03 09:36:11 UTC, 0 replies.
- [jira] Created: (NUTCH-477) Extend URLFilters to support different filtering chains - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/03 23:53:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/03 23:55:15 UTC, 0 replies.
- Hudson build is back to normal: Nutch-Nightly #75 - posted by hu...@lucene.zones.apache.org on 2007/05/04 09:05:23 UTC, 0 replies.
- [jira] Created: (NUTCH-478) Add function for stopping FetherThread gracefully - posted by "chee.wu (JIRA)" <ji...@apache.org> on 2007/05/05 08:27:15 UTC, 0 replies.
- SIGSEGV - posted by Brian Whitman <br...@variogr.am> on 2007/05/05 23:59:09 UTC, 6 replies.
- [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser - posted by "Antonio Eggberg (JIRA)" <ji...@apache.org> on 2007/05/07 08:06:16 UTC, 3 replies.
- Re: How to install Nutch on Freebsd? - posted by Nuther <nu...@proservice.ge> on 2007/05/07 08:59:57 UTC, 2 replies.
- Who of most pages indexed by means of it nutch and how many? - posted by mr_max <mr...@bk.ru> on 2007/05/07 10:17:50 UTC, 0 replies.
- And where it is possible to esteem about all opportunities nutch? - posted by mr_max <mr...@bk.ru> on 2007/05/07 10:20:37 UTC, 0 replies.
- And if nutch it would be written on With С++ worked more quickly? - posted by mr_max <mr...@bk.ru> on 2007/05/07 10:21:20 UTC, 0 replies.
- Scope-based crawling and indexing - posted by Vikas <vi...@hotmail.com> on 2007/05/07 14:47:14 UTC, 0 replies.
- [jira] Created: (NUTCH-479) Support for OR queries - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/07 21:15:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-479) Support for OR queries - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/07 21:18:15 UTC, 1 replies.
- [jira] Created: (NUTCH-480) Searching multiple indexes with a single nutch instance - posted by "Ravi Chintakunta (JIRA)" <ji...@apache.org> on 2007/05/08 03:11:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-480) Searching multiple indexes with a single nutch instance - posted by "Ravi Chintakunta (JIRA)" <ji...@apache.org> on 2007/05/08 03:13:15 UTC, 0 replies.
- Document Classification - indexing question - posted by Bastian Preindl <ba...@preindl.net> on 2007/05/08 12:30:56 UTC, 3 replies.
- Build failed in Hudson: Nutch-Nightly #80 - posted by hu...@lucene.zones.apache.org on 2007/05/09 09:00:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/09 10:47:15 UTC, 3 replies.
- [jira] Commented: (NUTCH-470) Adding optional terms to a query - posted by "Ronny Næss (JIRA)" <ji...@apache.org> on 2007/05/09 15:34:16 UTC, 1 replies.
- Re: [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9 - posted by Mike Schwartz <mf...@gmail.com> on 2007/05/09 15:36:35 UTC, 1 replies.
- how is crawl-urlfilter.txt taken care of? - posted by Manoharam Reddy <ma...@gmail.com> on 2007/05/09 17:00:43 UTC, 1 replies.
- [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9 - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/09 18:39:15 UTC, 1 replies.
- [jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/09 19:16:15 UTC, 0 replies.
- [jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/09 19:20:15 UTC, 2 replies.
- [jira] Assigned: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 19:24:15 UTC, 0 replies.
- [jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/09 19:42:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 20:03:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 20:05:15 UTC, 0 replies.
- Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/creativecommons/src/test/org/creativecommons/nutch/ src/... - posted by Sami Siren <ss...@gmail.com> on 2007/05/09 20:21:04 UTC, 1 replies.
- [jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title) - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 20:40:15 UTC, 0 replies.
- [jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work. - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 20:44:15 UTC, 0 replies.
- [jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 20:51:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 21:38:16 UTC, 0 replies.
- Recrawl help - posted by karthik085 <ka...@gmail.com> on 2007/05/09 21:41:20 UTC, 0 replies.
- [jira] Commented: (NUTCH-479) Support for OR queries - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/09 23:48:15 UTC, 0 replies.
- Hudson build is back to normal: Nutch-Nightly #81 - posted by hu...@lucene.zones.apache.org on 2007/05/10 09:07:34 UTC, 0 replies.
- [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/10 14:58:15 UTC, 6 replies.
- [jira] Updated: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs - posted by "Mike Brzozowski (JIRA)" <ji...@apache.org> on 2007/05/10 18:16:15 UTC, 0 replies.
- [jira] Commented: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs - posted by "Mike Brzozowski (JIRA)" <ji...@apache.org> on 2007/05/10 18:16:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-456) parse msexcel plugin speedup - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/10 18:16:16 UTC, 0 replies.
- [jira] Assigned: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/10 18:18:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4)) - posted by "Mike Brzozowski (JIRA)" <ji...@apache.org> on 2007/05/10 18:18:16 UTC, 0 replies.
- [jira] Commented: (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4)) - posted by "Mike Brzozowski (JIRA)" <ji...@apache.org> on 2007/05/10 18:27:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/10 18:32:15 UTC, 0 replies.
- Will any Nutch/Lucene folks be at the Enterprise Search Summit in week in New York? - posted by Michael McIntosh <mi...@tnrglobal.com> on 2007/05/11 17:17:03 UTC, 0 replies.
- [jira] Created: (NUTCH-481) http.content.limit is broken in the protocol-httpclient plugin - posted by "charlie wanek (JIRA)" <ji...@apache.org> on 2007/05/11 20:24:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-481) http.content.limit is broken in the protocol-httpclient plugin - posted by "charlie wanek (JIRA)" <ji...@apache.org> on 2007/05/11 20:41:15 UTC, 0 replies.
- [jira] Created: (NUTCH-482) Remove redundant plugin lib-log4j - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/12 09:54:15 UTC, 0 replies.
- [jira] Created: (NUTCH-483) remove redundant commons-logging jar from ontology plugin - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/12 09:56:15 UTC, 0 replies.
- Site nightly API link is broken - posted by Gal Nitzan <ga...@gmail.com> on 2007/05/12 10:00:20 UTC, 3 replies.
- [jira] Created: (NUTCH-484) Nutch Nightly API link is broken in site - posted by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2007/05/12 11:01:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-484) Nutch Nightly API link is broken in site - posted by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2007/05/12 11:06:15 UTC, 0 replies.
- [jira] Created: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object - posted by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2007/05/12 21:50:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object - posted by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2007/05/12 22:00:15 UTC, 4 replies.
- [jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/12 23:55:15 UTC, 2 replies.
- [jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/13 16:56:15 UTC, 0 replies.
- [jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/13 18:01:15 UTC, 0 replies.
- [jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser - posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2007/05/13 18:23:15 UTC, 0 replies.
- [jira] Resolved: (NUTCH-482) Remove redundant plugin lib-log4j - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/14 16:38:17 UTC, 0 replies.
- [jira] Resolved: (NUTCH-483) remove redundant commons-logging jar from ontology plugin - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/14 16:52:17 UTC, 0 replies.
- [jira] Resolved: (NUTCH-457) Create top level dist directory and checkin KEYS file to subversion be standard with Lucene Java and Hadoop - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/14 17:16:16 UTC, 0 replies.
- [jira] Created: (NUTCH-486) Break searcher dependency on commons-cli - posted by "Mark Woon (JIRA)" <ji...@apache.org> on 2007/05/15 01:36:16 UTC, 0 replies.
- [jira] Commented: (NUTCH-486) Break searcher dependency on commons-cli - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/15 08:22:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/15 20:32:16 UTC, 0 replies.
- [jira] Resolved: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding - posted by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/15 20:32:16 UTC, 0 replies.
- Re: Issues pending before 0.9 release - posted by rubdabadub <ru...@gmail.com> on 2007/05/17 06:21:26 UTC, 1 replies.
- bug in SegmentReader - posted by Ilya Vishnevsky <Il...@e-legion.com> on 2007/05/21 10:42:20 UTC, 0 replies.
- Bug (with fix): Neko HTML parser goes on defaults. - posted by Marcin Okraszewski <ok...@o2.pl> on 2007/05/21 12:45:51 UTC, 2 replies.
- [jira] Created: (NUTCH-487) Neko HTML parser goes on default settings. - posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/05/21 16:06:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-487) Neko HTML parser goes on default settings. - posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/05/21 16:06:16 UTC, 0 replies.
- [jira] Commented: (NUTCH-25) needs 'character encoding' detector - posted by "Doug Cook (JIRA)" <ji...@apache.org> on 2007/05/21 18:34:16 UTC, 2 replies.
- [jira] Updated: (NUTCH-25) needs 'character encoding' detector - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/21 22:48:16 UTC, 0 replies.
- [jira] Created: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list - posted by "Emmanuel Joke (JIRA)" <ji...@apache.org> on 2007/05/22 09:38:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list - posted by "Emmanuel Joke (JIRA)" <ji...@apache.org> on 2007/05/22 09:40:16 UTC, 1 replies.
- [jira] Created: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters - posted by "Emmanuel Joke (JIRA)" <ji...@apache.org> on 2007/05/22 10:35:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters - posted by "Emmanuel Joke (JIRA)" <ji...@apache.org> on 2007/05/22 10:37:16 UTC, 1 replies.
- [jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/22 11:23:16 UTC, 2 replies.
- [jira] Created: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch) - posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/05/22 14:18:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch) - posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/05/22 14:18:17 UTC, 1 replies.
- [jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. - posted by "Vadim Bauer (JIRA)" <ji...@apache.org> on 2007/05/22 14:37:16 UTC, 0 replies.
- IntelliJ & Eclipse Lucene code styles available - posted by Otis Gospodnetic <ot...@yahoo.com> on 2007/05/23 08:20:38 UTC, 0 replies.
- Get meta name="description" and other meta tags from Content - posted by Yakn <bo...@yahoo.com> on 2007/05/23 17:02:21 UTC, 1 replies.
- [jira] Created: (NUTCH-491) dedup fails with ArrayIndexOutOfBoundsException - posted by "Nicolás Lichtmaier (JIRA)" <ji...@apache.org> on 2007/05/23 18:53:16 UTC, 0 replies.
- [jira] Commented: (NUTCH-491) dedup fails with ArrayIndexOutOfBoundsException - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/24 13:55:17 UTC, 0 replies.
- NUTCH-348 and Nutch-0.7.2 - posted by karthik085 <ka...@gmail.com> on 2007/05/24 16:01:05 UTC, 1 replies.
- [jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. - posted by "Vadim Bauer (JIRA)" <ji...@apache.org> on 2007/05/25 23:05:16 UTC, 1 replies.
- [jira] Created: (NUTCH-492) java.lang.OutOfMemoryError while indexing. - posted by "Nicolás Lichtmaier (JIRA)" <ji...@apache.org> on 2007/05/27 01:42:16 UTC, 0 replies.
- [jira] Work started: (NUTCH-466) Flexible segment format - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/28 11:01:29 UTC, 0 replies.
- proposal for committer - posted by Gal Nitzan <ga...@gmail.com> on 2007/05/28 14:32:54 UTC, 2 replies.
- Plugins initialized all the time! - posted by Nicolás Lichtmaier <ni...@reloco.com.ar> on 2007/05/28 22:47:54 UTC, 12 replies.
- running nutch without http proxy - posted by prem kumar <pr...@gmail.com> on 2007/05/29 16:03:06 UTC, 1 replies.
- [jira] Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get - posted by "wangxu (JIRA)" <ji...@apache.org> on 2007/05/30 02:05:15 UTC, 0 replies.
- Committer - posted by Chris Mattmann <ch...@jpl.nasa.gov> on 2007/05/30 15:42:19 UTC, 0 replies.
- OutOfMemoryError - Why should the while(1) loop stop? - posted by Manoharam Reddy <ma...@gmail.com> on 2007/05/30 16:55:57 UTC, 1 replies.
- [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/30 20:37:16 UTC, 2 replies.
- Build failed in Hudson: Nutch-Nightly #102 - posted by hu...@lucene.zones.apache.org on 2007/05/31 09:00:07 UTC, 0 replies.
- What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ? - posted by Manoharam Reddy <ma...@gmail.com> on 2007/05/31 09:07:37 UTC, 0 replies.
- How is lib-http plugin called? It is not there in plugins.include! - posted by Manoharam Reddy <ma...@gmail.com> on 2007/05/31 09:10:20 UTC, 1 replies.
- [jira] Updated: (NUTCH-494) FindBugs: CrawlDbReader and DeleteDuplicates - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/31 10:52:15 UTC, 0 replies.
- [jira] Created: (NUTCH-494) FindBugs: CrawlDbReader and DeleteDuplicates - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/31 10:52:15 UTC, 0 replies.
- [jira] Created: (NUTCH-495) Unnecessary delays in Fetcher2 - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/31 17:49:16 UTC, 0 replies.
- [jira] Updated: (NUTCH-495) Unnecessary delays in Fetcher2 - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/31 17:51:15 UTC, 0 replies.
- Hudson build is back to normal: Nutch-Nightly #103 - posted by hu...@lucene.zones.apache.org on 2007/05/31 18:56:26 UTC, 0 replies.
- [jira] Updated: (NUTCH-466) Flexible segment format - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/31 20:42:15 UTC, 1 replies.
- [jira] Resolved: (NUTCH-486) Break searcher dependency on commons-cli - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/31 21:01:24 UTC, 0 replies.
- [jira] Commented: (NUTCH-466) Flexible segment format - posted by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/31 21:28:16 UTC, 0 replies.
- Making "Hits" work as a normal List - posted by Nicolás Lichtmaier <ni...@reloco.com.ar> on 2007/05/31 22:58:02 UTC, 0 replies.
- [jira] Resolved: (NUTCH-392) OutputFormat implementations should pass on Progressable - posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/05/31 23:25:16 UTC, 0 replies.
- [PATCH] Moving HitDetails construction to a constructor =) - posted by Nicolás Lichtmaier <ni...@reloco.com.ar> on 2007/05/31 23:57:34 UTC, 0 replies.