You are viewing a plain text version of this content. The canonical link for it is here.
- RE: Lucene term-vector - posted by Kenji <kk...@pipestone.com> on 2005/12/01 00:03:17 UTC, 0 replies.
- Re: "Good man" is Different than "Man good" in Nutch? - posted by Erik Hatcher <er...@ehatchersolutions.com> on 2005/12/01 01:12:17 UTC, 0 replies.
- Re: RegexURLFilter / testing regex-urlfilter.txt - posted by Bryan Woliner <br...@gmail.com> on 2005/12/01 06:23:47 UTC, 2 replies.
- urlfilter-db usage - posted by Brent Parker <fb...@comcast.net> on 2005/12/01 06:44:06 UTC, 2 replies.
- Re: Help require in local hard-disk crawling with Nutch - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/01 13:43:42 UTC, 0 replies.
- Problems with crawling - posted by Wmelo <wm...@olimpo.com.br> on 2005/12/01 18:20:49 UTC, 0 replies.
- Cookies being sent by fetcher - posted by Matt Zytaruk <ma...@wavefire.com> on 2005/12/01 19:13:09 UTC, 0 replies.
- More Problems with crawling - posted by Wmelo <wm...@olimpo.com.br> on 2005/12/02 01:49:35 UTC, 0 replies.
- mapred branch: IOException in invertlinks (No input directories specified) - posted by Florent Gluck <fl...@busytonight.com> on 2005/12/02 02:34:02 UTC, 5 replies.
- Re: Class Not Found - posted by Jack Tang <hi...@gmail.com> on 2005/12/02 03:16:42 UTC, 5 replies.
- How to crawl Local filesystem, getting error in plugin load and activation -SEVERE org.apache.nutch.plugin.PluginRuntimeException: extension point: org.apache.nutch.net.URLFilter does not exist. - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/02 13:06:22 UTC, 0 replies.
- Segment Slicer - posted by Matt Zytaruk <ma...@wavefire.com> on 2005/12/02 17:41:02 UTC, 1 replies.
- Writing a Plugin - posted by "Vanderdray, Jacob" <JV...@aarp.org> on 2005/12/02 23:36:58 UTC, 0 replies.
- Fetching and Indexing in WEB - Content that has page navigation (Search Result page) - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/03 12:45:12 UTC, 0 replies.
- Unable to load parser from parser factory for html and text files. - posted by Arun Kumar Sharma <sh...@yahoo.co.in> on 2005/12/04 08:39:05 UTC, 0 replies.
- Error in intialization of logger and plugin preferences, while crawling local files system - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/04 09:50:30 UTC, 0 replies.
- Hi how can I do a incremental crawling - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/05 05:38:58 UTC, 1 replies.
- org.apache.nutch.protocol.ProtocolNotFound: protocol - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/05 06:56:47 UTC, 0 replies.
- fetch of file:///F:/xxx/xxx/xxx.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/05 13:57:03 UTC, 5 replies.
- Fetch of file:///abc/xxx/FetcherTask.html failed with: java.lang.Exception: org.apache.nutch.protocol.file.FileError: File Error: 404 - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/05 15:05:24 UTC, 0 replies.
- parsing .wml files - posted by Neil Mooney <ne...@euselect.com> on 2005/12/05 15:49:16 UTC, 0 replies.
- Speed of indexing - posted by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/12/05 20:23:11 UTC, 5 replies.
- test/extending nutch - posted by ci...@bloglines.com on 2005/12/05 21:40:17 UTC, 0 replies.
- Number of URLs in segment fetchlist vs. Number of URLs in index - posted by Bryan Woliner <br...@gmail.com> on 2005/12/06 02:26:48 UTC, 0 replies.
- nutch war file -> where to go from here - posted by Kasper Hansen <nu...@zizi.dk> on 2005/12/06 09:29:50 UTC, 0 replies.
- Merging two sets of crawled data. - posted by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/06 10:24:03 UTC, 4 replies.
- Crawling TLD's + injected sites. - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/06 12:32:10 UTC, 1 replies.
- ad feed for nutch - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/06 12:37:24 UTC, 7 replies.
- NDFS problem on mapred branch - posted by Hamza Kaya <ha...@gmail.com> on 2005/12/06 14:48:08 UTC, 3 replies.
- try to restart aborted crawl - posted by Daqing Zhao <de...@gmail.com> on 2005/12/06 14:54:08 UTC, 4 replies.
- merge vs. updatedb - posted by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/12/06 18:09:58 UTC, 0 replies.
- Display on non-ASCII Characters in Search Results? - posted by Bill Goffe <go...@Oswego.EDU> on 2005/12/06 19:17:43 UTC, 2 replies.
- Returning all hits in a document - posted by John Reidy <jo...@reidy.com> on 2005/12/07 04:55:48 UTC, 1 replies.
- searching while crawling. - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/07 07:53:55 UTC, 1 replies.
- Nutch returns irrelevant site - posted by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/07 11:32:28 UTC, 1 replies.
- Nutch and Google Map togather for Real Estate search. - posted by Benny Krauss <be...@gmail.com> on 2005/12/07 16:48:20 UTC, 3 replies.
- Re: Setting up a crawler for a country. - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/07 17:06:52 UTC, 0 replies.
- Upgrading from Nutch 0.7.1 to 0.8 - posted by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/12/07 17:50:30 UTC, 3 replies.
- Luke and Indexes - posted by Bryan Woliner <br...@gmail.com> on 2005/12/07 23:02:23 UTC, 2 replies.
- Re: [Nutch-general] RE: Speed of indexing - posted by og...@yahoo.com on 2005/12/08 06:07:49 UTC, 0 replies.
- Crawling two sites in the same segment.... - posted by rupa priya <ru...@yahoo.com> on 2005/12/08 06:16:12 UTC, 0 replies.
- Re :Re: searching while crawling. - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 06:44:59 UTC, 0 replies.
- how to - posted by "Riku | http://kukusky.8800.org" <ku...@gmail.com> on 2005/12/08 09:22:45 UTC, 2 replies.
- Plugin path in Nutch web - posted by Nguyen Ngoc Giang <gi...@gmail.com> on 2005/12/08 10:47:44 UTC, 1 replies.
- Crawling - dynamically generate web pages with paginations - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 11:22:09 UTC, 0 replies.
- Too many open file error -while searching using Nutch - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 11:30:27 UTC, 1 replies.
- Crawling listing (pagination) pages. - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 15:31:56 UTC, 1 replies.
- How to refresh the application context - to use the merged index - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 17:08:04 UTC, 0 replies.
- After mergesegs - posted by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/12/08 17:52:24 UTC, 3 replies.
- Problem with fetching segment - posted by "Håvard W. Kongsgård" <h....@niap.no> on 2005/12/09 01:57:54 UTC, 7 replies.
- How to get page content given URL only? - posted by Nguyen Ngoc Giang <gi...@gmail.com> on 2005/12/09 09:24:03 UTC, 7 replies.
- Cache page, modifying the output - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/09 12:57:01 UTC, 1 replies.
- User Agent - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/09 15:55:16 UTC, 1 replies.
- nutch questions - posted by Ken van Mulder <ke...@wavefire.com> on 2005/12/09 17:09:31 UTC, 1 replies.
- Linking Document scores together in a query - posted by Matt Zytaruk <ma...@wavefire.com> on 2005/12/09 19:15:30 UTC, 2 replies.
- Technorati - posted by Paul Harrison <pa...@personifi.com> on 2005/12/09 20:50:20 UTC, 2 replies.
- Incremental crawl w/ map reduce - posted by Florent Gluck <fl...@busytonight.com> on 2005/12/10 02:19:45 UTC, 2 replies.
- Nutch Tomcat5 or.apache.jasper.JasperException - posted by Michael Taggart <mi...@webco.tv> on 2005/12/10 02:39:13 UTC, 6 replies.
- Nutch Tomcat JasperException - posted by Michael Taggart <mi...@webco.tv> on 2005/12/10 02:45:20 UTC, 0 replies.
- nutch mapred + tomcat and a couple other questions - posted by Florent Gluck <fl...@busytonight.com> on 2005/12/12 19:10:11 UTC, 2 replies.
- Is there is simple command to count number of docs in an index? - posted by Bryan Woliner <br...@gmail.com> on 2005/12/12 20:13:44 UTC, 0 replies.
- Calling All Nutch Experts - posted by Michael Taggart <mi...@webco.tv> on 2005/12/12 23:35:35 UTC, 5 replies.
- Q re returning all hits from a document - posted by John Reidy <jo...@reidysystems.com> on 2005/12/13 01:49:08 UTC, 0 replies.
- index filesystem - posted by pa...@cli.di.unipi.it on 2005/12/13 09:59:57 UTC, 0 replies.
- Re: [Nutch-general] index filesystem - posted by Stefan Groschupf <sg...@media-style.com> on 2005/12/13 11:56:09 UTC, 0 replies.
- Re: getting last-modified date from the crawled pages - posted by John Reidy <jo...@reidy.com> on 2005/12/13 12:44:00 UTC, 0 replies.
- Improve retrieval of keywords - posted by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/13 17:48:29 UTC, 0 replies.
- OT: Alexa Web Search Platform - posted by Howie Wang <ho...@hotmail.com> on 2005/12/13 19:35:20 UTC, 0 replies.
- Map Reduce Errors - posted by Matt Zytaruk <ma...@wavefire.com> on 2005/12/13 19:46:33 UTC, 7 replies.
- Re: [Nutch-general] OT: Alexa Web Search Platform - posted by og...@yahoo.com on 2005/12/13 19:56:45 UTC, 2 replies.
- installation on windows - Tomkat - posted by Tim Archambault <jo...@gmail.com> on 2005/12/13 22:05:14 UTC, 1 replies.
- Nutch & RAM - posted by Mike Peterson <mi...@mail.ru> on 2005/12/14 01:04:09 UTC, 1 replies.
- Nutch and RAM - posted by Michael Bravo <mi...@yahoo.com> on 2005/12/14 05:58:34 UTC, 0 replies.
- About Writing Custom queryFilter - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/14 10:58:10 UTC, 5 replies.
- Re: is there any way to prune webdb? - posted by Tim Archambault <jo...@gmail.com> on 2005/12/14 14:53:57 UTC, 0 replies.
- Retry later - posted by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/14 15:04:49 UTC, 0 replies.
- Re: [Nutch-general] How to refresh the application context - to use the merged index - posted by "Peter A. Daly" <pe...@gmail.com> on 2005/12/14 21:48:43 UTC, 0 replies.
- java.net.ConnectException: Connection refused - posted by Michael Taggart <mi...@webco.tv> on 2005/12/14 22:49:22 UTC, 9 replies.
- Tomcat with -security? - posted by Bill Goffe <go...@Oswego.EDU> on 2005/12/14 23:41:56 UTC, 0 replies.
- MapReduce on Cluster - posted by Hamza Kaya <ha...@gmail.com> on 2005/12/15 13:51:45 UTC, 3 replies.
- Any nutch architecture doc out there? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/15 16:35:20 UTC, 1 replies.
- Smaller error on Wiki - posted by Dan Glauser <da...@gmail.com> on 2005/12/15 17:06:59 UTC, 0 replies.
- Socket Exceptions - posted by Matt Zytaruk <ma...@wavefire.com> on 2005/12/15 21:06:03 UTC, 0 replies.
- java.io.IOException in dedup (map reduce) - posted by Florent Gluck <fl...@busytonight.com> on 2005/12/15 23:55:16 UTC, 0 replies.
- MapRed searching - posted by Michael Taggart <mi...@webco.tv> on 2005/12/16 02:55:32 UTC, 10 replies.
- Live updating an intranet index - posted by Tyrell Perera <ty...@gmail.com> on 2005/12/16 03:14:16 UTC, 5 replies.
- Q: returning all hits in a document. - posted by John Reidy <jo...@reidysystems.com> on 2005/12/16 03:58:56 UTC, 2 replies.
- Is there any way to check that no duplicate url get inserted through "WebDBInjector" - posted by Arun Kumar Sharma <sh...@yahoo.co.in> on 2005/12/16 09:04:43 UTC, 5 replies.
- best strategy to deal with large index file - posted by Jeff Liang <je...@messagesolution.com> on 2005/12/16 09:41:11 UTC, 2 replies.
- Crawling search engines and cgi scripts - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/16 17:10:54 UTC, 2 replies.
- how to ignore sections of a web page - posted by ci...@bloglines.com on 2005/12/16 18:36:39 UTC, 0 replies.
- Must plugins be MT safe? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/16 20:01:03 UTC, 1 replies.
- cannot pull up result for fetched document, will a merge help? - posted by Jed Reynolds <jr...@emediawire.com> on 2005/12/16 21:52:24 UTC, 0 replies.
- nutch crawl fails with: org.apache.nutch.indexer.IndexingFilter does not exist. - posted by Stephen Fitch <sc...@gmail.com> on 2005/12/18 00:21:44 UTC, 4 replies.
- PluginRuntimeException: org.apache.nutch.indexer.IndexingFilter does not exist - posted by Alfred Ostermeier <al...@arcor.de> on 2005/12/18 00:39:51 UTC, 3 replies.
- How to recrawl urls - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/19 02:52:12 UTC, 5 replies.
- Re: Filesystem structure for the web front-end. - posted by Bryan Woliner <br...@gmail.com> on 2005/12/19 08:35:10 UTC, 0 replies.
- injecting URLs with '?' - posted by Miguel A Paraz <mp...@gmail.com> on 2005/12/19 11:56:58 UTC, 1 replies.
- is nutch recrawl possible? - posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com> on 2005/12/19 14:46:25 UTC, 9 replies.
- DO generate create fetchlist of URL"S unfetched ? - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/19 17:02:37 UTC, 0 replies.
- build instructions? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/19 20:38:17 UTC, 5 replies.
- Appropriate steps for mapred - posted by Michael Taggart <mi...@webco.tv> on 2005/12/19 23:43:19 UTC, 1 replies.
- Multiple anchors on same site - what's better than making these unique? - posted by David Wallace <da...@nzqa.govt.nz> on 2005/12/20 00:49:32 UTC, 1 replies.
- Re: Multiple anchors on same site - what's better than making these unique? - posted by Stefan Groschupf <sg...@media-style.com> on 2005/12/20 01:00:29 UTC, 0 replies.
- Does Search Result Show Similar Pages Like Google? - posted by Victor Lee <vi...@yahoo.com> on 2005/12/20 04:48:18 UTC, 5 replies.
- "Out of memory exception"-while updating - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/20 16:58:53 UTC, 1 replies.
- Can nutch be used as link checker? What does http.max.delay error mean? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/21 01:35:16 UTC, 1 replies.
- Thanks! - posted by Bill Goffe <go...@Oswego.EDU> on 2005/12/21 06:14:09 UTC, 0 replies.
- java.nio.BufferOverflowException while parsing html contents - posted by Arun Kumar Sharma <sh...@yahoo.co.in> on 2005/12/21 08:11:51 UTC, 1 replies.
- Read Time out problem - posted by Nguyen Ngoc Giang <gi...@gmail.com> on 2005/12/21 09:00:54 UTC, 2 replies.
- Re: Problem re-running crawl tool ? Is there any way to run nutch crawl tool any number of times - posted by Stefan Groschupf <sg...@media-style.com> on 2005/12/21 13:43:52 UTC, 0 replies.
- "Too many open file issue" - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/21 14:00:24 UTC, 2 replies.
- Re: Problem re-running crawl tool ? Is there any way to run nutch crawl tool any number of times - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/21 14:16:57 UTC, 2 replies.
- which files/directories are needed after a segment or index merge - posted by Bryan Woliner <br...@gmail.com> on 2005/12/21 18:28:04 UTC, 9 replies.
- finding the segment an url lands in - posted by Jed Reynolds <jr...@emediawire.com> on 2005/12/21 20:53:00 UTC, 1 replies.
- java.io.IOException: Input/output error with bin/nutch updatedb db seg command - posted by Edward Whittaker <ed...@ewdw.com> on 2005/12/22 13:12:20 UTC, 0 replies.
- New Tutorial Needed - posted by carmmello <ca...@globo.com> on 2005/12/22 21:12:25 UTC, 3 replies.
- Crawling password-protected sites - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/23 03:13:41 UTC, 2 replies.
- Page number links and web service interface. - posted by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/23 10:33:13 UTC, 1 replies.
- Help! 0.7.1 segments don't work in 0.8 - posted by carmmello <ca...@globo.com> on 2005/12/23 16:35:46 UTC, 1 replies.
- Help! 0.7.1 segments don't work in 0.8 - posted by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/12/23 16:51:08 UTC, 2 replies.
- How to configure Nutch to parse PDF and Word docs? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/23 19:18:45 UTC, 1 replies.
- Best setup for multiple nutch users on one server - posted by Bryan Woliner <br...@gmail.com> on 2005/12/25 00:41:03 UTC, 1 replies.
- "Out of memor error" while updating - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/26 09:27:46 UTC, 3 replies.
- Problem crawling Ms-word , Crawl depth issue, Result problem - posted by Arun Kaundal <ar...@gmail.com> on 2005/12/26 13:17:07 UTC, 0 replies.
- file to http mapping - posted by Jeff Breidenbach <je...@jab.org> on 2005/12/26 15:21:07 UTC, 3 replies.
- How to run Nutch? - posted by carmmello <ca...@globo.com> on 2005/12/26 19:19:15 UTC, 14 replies.
- Running nutch under VmWare - posted by Byron Miller <by...@yahoo.com> on 2005/12/27 00:40:14 UTC, 0 replies.
- Distributed search corrupted output problem - posted by Ed Whittaker <ed...@ewdw.com> on 2005/12/27 05:38:31 UTC, 3 replies.
- Crawl problem in 0.7 and 0.7.1 - posted by Chih How Bong <ch...@gmail.com> on 2005/12/27 06:31:44 UTC, 0 replies.
- max-out links count in Nutch. - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/27 11:38:30 UTC, 1 replies.
- document markup to control indexing - posted by Jeff Breidenbach <je...@jab.org> on 2005/12/27 15:59:39 UTC, 6 replies.
- multibyte character support status - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/27 19:44:22 UTC, 0 replies.
- Trouble setting NDFS on multiple machines - posted by Gal Nitzan <gn...@usa.net> on 2005/12/27 22:20:09 UTC, 5 replies.
- How can I set a search server over NDFS - posted by Gal Nitzan <gn...@usa.net> on 2005/12/28 00:42:22 UTC, 1 replies.
- How can I set a search server over NDFS - Revised - posted by Gal Nitzan <gn...@usa.net> on 2005/12/28 01:32:38 UTC, 1 replies.
- Can we search based on two fileds? - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/28 03:51:25 UTC, 0 replies.
- Crawler problem in 0.7 and 0.7.1 - posted by Chih How Bong <ch...@gmail.com> on 2005/12/28 05:33:53 UTC, 1 replies.
- Is any one able to successfully run Distributed Crawl? - posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com> on 2005/12/28 07:12:49 UTC, 4 replies.
- Clustering Index job - posted by "R.Mayoran" <ma...@team-lab.com> on 2005/12/28 09:56:28 UTC, 2 replies.
- Setting Search over NDFS - posted by Gal Nitzan <gn...@usa.net> on 2005/12/28 14:25:47 UTC, 5 replies.
- Why does Nutch use n-grams in analysis? - posted by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/28 19:16:36 UTC, 3 replies.
- "Out of Memory Error" - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/29 11:10:31 UTC, 0 replies.
- P2P Hyperestraier approach compared to Nutch - posted by Alexandre Dulaunoy <ad...@gmail.com> on 2005/12/29 19:29:41 UTC, 2 replies.
- Writing custom application using nutch - posted by Kumar Limbu <ku...@gmail.com> on 2005/12/30 08:20:11 UTC, 0 replies.
- Nutch freezing on fetch - posted by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/30 09:57:54 UTC, 6 replies.
- numfetchers in map red - posted by Gal Nitzan <gn...@usa.net> on 2005/12/30 12:55:24 UTC, 0 replies.
- OutOfMemoryError while crawling - posted by Nalin Kumar <se...@gmail.com> on 2005/12/30 13:51:29 UTC, 0 replies.
- "Out of memory error"-while updating - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/30 14:18:56 UTC, 0 replies.
- uses of 'io.sort.mb' and ' io.sort.factor' in nutch-default.xml - posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/31 07:17:59 UTC, 0 replies.