You are viewing a plain text version of this content. The canonical link for it is here.
- How to tell Nutch that text files are text files? - posted by Hannu Väisänen <hv...@joyx.joensuu.fi> on 2009/07/02 07:32:43 UTC, 0 replies.
- Re: How torunning nutch on 2G memory tasknode - posted by lei wang <nu...@gmail.com> on 2009/07/02 13:58:55 UTC, 0 replies.
- nutch crawldb failed for java heap space - posted by lei wang <nu...@gmail.com> on 2009/07/02 18:21:18 UTC, 4 replies.
- How To Generate the JavaDoc - posted by schroedi <sc...@gmail.com> on 2009/07/02 20:58:51 UTC, 1 replies.
- Optimal size of a segments sub-directory and a couple of other questions relating to Nutch response times - posted by Vijay <vi...@gmail.com> on 2009/07/03 03:15:10 UTC, 0 replies.
- Nutch 1.0 on the limits of the data - posted by Polsnet <po...@163.com> on 2009/07/03 06:03:30 UTC, 2 replies.
- NYC Apache Lucene/Solr/Nutch/etc. Meetup - posted by Grant Ingersoll <gs...@apache.org> on 2009/07/03 14:11:32 UTC, 0 replies.
- what's the relationship between nutch, solr, lucene, and hadoop - posted by xiao yang <ya...@gmail.com> on 2009/07/03 21:06:50 UTC, 1 replies.
- Problems when deploy nutch-1.0.war - posted by xiao yang <ya...@gmail.com> on 2009/07/04 09:41:58 UTC, 5 replies.
- Re: Storing a serialized object ? - posted by MilleBii <mi...@gmail.com> on 2009/07/04 10:22:01 UTC, 1 replies.
- Getting Nutch1.0 example working in tomcat 6 (on ubuntu) - posted by Alex McLintock <al...@gmail.com> on 2009/07/04 13:21:48 UTC, 0 replies.
- Favorite Linux Distribution for Nutch - posted by schroedi <sc...@gmail.com> on 2009/07/04 16:50:40 UTC, 6 replies.
- How to get lastModified or create-date content from html pages? - posted by postusenet <po...@gmail.com> on 2009/07/04 19:26:04 UTC, 0 replies.
- Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. - posted by xiao yang <ya...@gmail.com> on 2009/07/05 17:33:21 UTC, 3 replies.
- Nutch-1.0: Cannot lock storage error - posted by xiao yang <ya...@gmail.com> on 2009/07/06 08:41:07 UTC, 0 replies.
- Hoe to search Nutch DB - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/06 09:05:06 UTC, 1 replies.
- Authentication Not Occuring - posted by youyou wu <wu...@hotmail.com> on 2009/07/06 11:40:11 UTC, 1 replies.
- what is Non DFS Used in cluster summary ?how to delete it? - posted by Pravin Karne <pr...@persistent.co.in> on 2009/07/06 12:38:11 UTC, 0 replies.
- what is Non DFS Used in cluster summary? how to delete Non DFS Used data - posted by Pravin Karne <pr...@persistent.co.in> on 2009/07/06 12:41:30 UTC, 0 replies.
- how parse chm files - posted by Yaidel Guedes Beltran <yg...@estudiantes.uci.cu> on 2009/07/06 15:02:02 UTC, 0 replies.
- Problems when index .chm files - posted by Yaidel Guedes Beltran <yg...@estudiantes.uci.cu> on 2009/07/06 19:16:27 UTC, 1 replies.
- error nutch recrawl - posted by Maurizio Croci <cr...@gmail.com> on 2009/07/06 19:47:00 UTC, 1 replies.
- Writing Plugins - Documentation? - posted by Alex McLintock <al...@gmail.com> on 2009/07/06 20:58:14 UTC, 0 replies.
- Solr Integration since v1.0 ? - posted by Alex McLintock <al...@gmail.com> on 2009/07/07 14:51:11 UTC, 0 replies.
- Re: How to search Nutch DB - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/08 08:02:44 UTC, 0 replies.
- How to Parse Rss Feed URL - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/08 08:24:41 UTC, 2 replies.
- How to add chinese segment feature to Nutch-1.0 - posted by xiao yang <ya...@gmail.com> on 2009/07/08 13:17:14 UTC, 0 replies.
- Running Nutch on VMs - posted by Jake Jacobson <ja...@gmail.com> on 2009/07/08 17:02:06 UTC, 1 replies.
- Show db_gone in crawlDB - posted by schroedi <sc...@gmail.com> on 2009/07/09 06:05:30 UTC, 1 replies.
- How to crawl URLs getting from RSSParser - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/09 07:21:12 UTC, 0 replies.
- Index weightings of different types of text node...h1, h2 anchor etc.. - posted by Joel Halbert <jo...@su3analytics.com> on 2009/07/09 15:30:23 UTC, 0 replies.
- Weighting different html text nodes - h1,h2 etc.. - posted by Joel Halbert <jo...@storequery.com> on 2009/07/09 15:31:43 UTC, 1 replies.
- Re: Index weightings of different types of text node...h1, h2 anchor etc.. - posted by Magnús Skúlason <ma...@gmail.com> on 2009/07/09 15:39:35 UTC, 0 replies.
- call for answer - posted by postusenet <po...@gmail.com> on 2009/07/09 22:40:12 UTC, 0 replies.
- Script to crawl web - posted by Jake Jacobson <ja...@gmail.com> on 2009/07/09 23:02:00 UTC, 0 replies.
- Arc to segements failed for " Task attempt_200907091108_0001_m_000520_0 failed to report status for 602 seconds. Killing!" - posted by lei wang <nu...@gmail.com> on 2009/07/10 03:56:49 UTC, 0 replies.
- Re: Arc to segements failed for " Task attempt_200907091108_0001_m_000520_0 failed to report status for 602 seconds. Killing!" - posted by Ken Krugler <kk...@transpac.com> on 2009/07/10 04:56:13 UTC, 0 replies.
- indexing each item in seperate page - posted by Beats <ta...@yahoo.com> on 2009/07/10 09:01:51 UTC, 3 replies.
- how to change encoding - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/10 11:43:33 UTC, 0 replies.
- [ANN] Luke + Hadoop, alpha version - posted by Andrzej Bialecki <ab...@getopt.org> on 2009/07/10 12:08:14 UTC, 0 replies.
- Re: how to change encoding - posted by Doğacan Güney <do...@gmail.com> on 2009/07/10 12:14:01 UTC, 0 replies.
- Re: How to parse and index content field of RSS-Feed? - posted by Beats <ta...@yahoo.com> on 2009/07/10 13:29:04 UTC, 0 replies.
- How to search part of words? - posted by st...@hartmann.info on 2009/07/10 14:57:13 UTC, 0 replies.
- How to search for part of words? - posted by st...@hartmann.info on 2009/07/10 15:04:59 UTC, 1 replies.
- how to allow every url to b accepted - posted by Beats <ta...@yahoo.com> on 2009/07/10 15:41:57 UTC, 1 replies.
- Problem with nutch - posted by Pranay Gunna <gu...@yahoo.com> on 2009/07/10 21:35:31 UTC, 0 replies.
- Ontology-Clearing Cache... - posted by gunnapranay <gu...@yahoo.com> on 2009/07/10 23:16:53 UTC, 0 replies.
- job failed for "Too many fetch-failures" - posted by lei wang <nu...@gmail.com> on 2009/07/11 04:46:18 UTC, 0 replies.
- how to crawl a page but not index it - posted by Beats <ta...@yahoo.com> on 2009/07/11 09:20:43 UTC, 5 replies.
- Too many fether failures - posted by lei wang <nu...@gmail.com> on 2009/07/12 08:58:15 UTC, 0 replies.
- Changing fieldsNorm at query time - posted by ilayaraja <il...@rediff.co.in> on 2009/07/12 16:24:32 UTC, 0 replies.
- Search results return 0 - posted by Zaihan <za...@unrealasia.net> on 2009/07/12 19:05:33 UTC, 0 replies.
- Nutch Character encoding converter - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/13 06:46:22 UTC, 2 replies.
- Deleting indexes - posted by Beats <ta...@yahoo.com> on 2009/07/13 09:10:41 UTC, 3 replies.
- Nutch OutPut in which UTF format - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/13 10:06:26 UTC, 0 replies.
- prune tool query - posted by Beats <ta...@yahoo.com> on 2009/07/13 10:25:39 UTC, 2 replies.
- Job failed help - posted by Jake Jacobson <ja...@gmail.com> on 2009/07/13 14:53:45 UTC, 7 replies.
- Integrating Nutch frontend with Backend. - posted by Zaihan <za...@unrealasia.net> on 2009/07/13 14:57:59 UTC, 1 replies.
- Re: Nutch OutPut in which UTF format - posted by Doğacan Güney <do...@gmail.com> on 2009/07/13 15:52:40 UTC, 0 replies.
- Search History and Top Searches - posted by Kenan Azam <az...@gmail.com> on 2009/07/13 19:58:03 UTC, 1 replies.
- Nutch Tutorial 1.0 based off of the French Version - posted by Jake Jacobson <ja...@gmail.com> on 2009/07/13 22:26:34 UTC, 5 replies.
- Just getting started w/tutorial- errors in crawl.log - posted by oh...@cox.net on 2009/07/14 02:58:15 UTC, 4 replies.
- url normalizer - posted by Neeti Gupta <ne...@yahoo.com> on 2009/07/14 08:46:31 UTC, 0 replies.
- Re: recrawling - posted by Neeti Gupta <ne...@yahoo.com> on 2009/07/14 08:50:20 UTC, 2 replies.
- Ignoring robots.txt - posted by Beats <ta...@yahoo.com> on 2009/07/14 10:06:54 UTC, 2 replies.
- job failed for "java.io.IOException: Task process exit with nonzero status of 255." - posted by lei wang <nu...@gmail.com> on 2009/07/14 13:05:55 UTC, 0 replies.
- A few questions about crawl-urlfilter.txt - posted by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/07/14 14:12:51 UTC, 0 replies.
- How to crawl page displayed as response to search query in solr - posted by Beats <ta...@yahoo.com> on 2009/07/14 15:36:13 UTC, 0 replies.
- Re: A few questions about crawl-urlfilter.txt - posted by Ken Krugler <kk...@transpac.com> on 2009/07/14 16:54:36 UTC, 2 replies.
- Tutorial followup - Nutch webapp not seeing stuff? - posted by oh...@cox.net on 2009/07/14 17:09:34 UTC, 7 replies.
- Re: job failed for "java.io.IOException: Task process exit with nonzero status of 255." - posted by lei wang <nu...@gmail.com> on 2009/07/15 02:51:44 UTC, 0 replies.
- How to manage the urls in crawlDB? - posted by xiao yang <ya...@gmail.com> on 2009/07/15 15:27:50 UTC, 1 replies.
- Reminder: NYC Lucene et. al Meetup next week - posted by Grant Ingersoll <gr...@lucidimagination.com> on 2009/07/15 17:22:30 UTC, 0 replies.
- [REMINDER] NYC Meetup July 22nd - posted by Grant Ingersoll <gs...@apache.org> on 2009/07/15 17:31:56 UTC, 0 replies.
- mergesegs disk space - posted by Tomislav Poljak <tp...@gmail.com> on 2009/07/15 18:31:37 UTC, 8 replies.
- Errorr when using language-identifier plugin ? - posted by MilleBii <mi...@gmail.com> on 2009/07/15 19:40:51 UTC, 0 replies.
- Local or Distributed mode? - posted by "Rodrigo Reyes C." <ro...@avity.com> on 2009/07/15 21:35:37 UTC, 1 replies.
- How nutch use ontology - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/16 10:01:12 UTC, 0 replies.
- indexing meta tags in 1.0 - posted by Will Daley <wa...@willdaley.com> on 2009/07/16 12:12:57 UTC, 0 replies.
- Use of lock file - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/16 12:51:51 UTC, 0 replies.
- how to filter pages before indexing - posted by Beats <ta...@yahoo.com> on 2009/07/16 13:11:22 UTC, 3 replies.
- Nutch download speed - posted by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/07/16 15:11:29 UTC, 1 replies.
- Add new conf file. - posted by Beats <ta...@yahoo.com> on 2009/07/16 16:46:45 UTC, 0 replies.
- Crawling with a PKI Cert - posted by Jake Jacobson <ja...@gmail.com> on 2009/07/16 17:52:22 UTC, 0 replies.
- Problem crawling local filesystem - posted by oh...@cox.net on 2009/07/16 19:36:40 UTC, 1 replies.
- Meta tag plugin for 1.0 - posted by wadaley <nu...@willdaley.com> on 2009/07/16 21:26:00 UTC, 0 replies.
- java heap space problem when using the language identifier - posted by MilleBii <mi...@gmail.com> on 2009/07/16 22:53:55 UTC, 6 replies.
- Question about crawling local filesystem and directories - posted by oh...@cox.net on 2009/07/16 22:57:46 UTC, 0 replies.
- Difference between Feed parser and Rss Parser - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/17 08:21:01 UTC, 1 replies.
- How segment depends on depth - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/17 13:03:16 UTC, 1 replies.
- Issue with Parse metaData while crawling RSSFeed URL - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/17 13:15:41 UTC, 1 replies.
- Why cant I inject a google link to the database? - posted by Larsson85 <kr...@hotmail.com> on 2009/07/17 14:04:53 UTC, 14 replies.
- dump all outlinks - posted by reinhard schwab <re...@aon.at> on 2009/07/17 18:43:18 UTC, 2 replies.
- wrong outlinks - posted by reinhard schwab <re...@aon.at> on 2009/07/17 21:48:17 UTC, 0 replies.
- Re: wrong outlinks - posted by Doğacan Güney <do...@gmail.com> on 2009/07/17 23:40:50 UTC, 2 replies.
- error in using generate command - posted by Beats <ta...@yahoo.com> on 2009/07/18 10:32:02 UTC, 5 replies.
- Entities.encode is not UTF-8 compliant - posted by MilleBii <mi...@gmail.com> on 2009/07/18 15:54:40 UTC, 1 replies.
- directories needed for a merge - posted by Alex Basa <al...@yahoo.com> on 2009/07/20 03:30:12 UTC, 0 replies.
- different urlfilter for different seeds - posted by Beats <ta...@yahoo.com> on 2009/07/20 09:05:44 UTC, 2 replies.
- Crawling - posted by Neeti Gupta <ne...@yahoo.com> on 2009/07/20 11:11:10 UTC, 0 replies.
- Nutch 1.0 Fetch failure... - posted by Fred Kuipers <mr...@gmail.com> on 2009/07/20 18:55:00 UTC, 2 replies.
- Using Nutch to crawl PubMed - posted by Arshad Khan <kh...@gmail.com> on 2009/07/21 05:59:36 UTC, 1 replies.
- nutch 0.9 with jetty 6 and jdk 1.6 - posted by Michaela Moesenbacher <Mi...@elements.at> on 2009/07/21 10:42:37 UTC, 0 replies.
- [ApacheCon US] Travel Assistance - posted by Grant Ingersoll <gs...@apache.org> on 2009/07/22 12:49:23 UTC, 0 replies.
- nutch -threads in hadoop - posted by Brian Tingle <Br...@ucop.edu> on 2009/07/23 04:21:38 UTC, 3 replies.
- Querying nutch content using Pig Latin - posted by Ninad Raut <hb...@gmail.com> on 2009/07/23 07:13:10 UTC, 0 replies.
- How to add new field in parseData - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/23 12:38:08 UTC, 1 replies.
- Pages with Specific URLS. - posted by Zaihan <za...@unrealasia.net> on 2009/07/23 15:50:44 UTC, 1 replies.
- Gracefull stop in the middle of a fetch phase ? - posted by MilleBii <mi...@gmail.com> on 2009/07/23 20:29:36 UTC, 4 replies.
- adding [-numFetchers numFetchers] to crawl - posted by Brian Tingle <Br...@ucop.edu> on 2009/07/24 05:16:19 UTC, 0 replies.
- IO exception while adding field in Parsedata contentmeta. - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/24 16:21:29 UTC, 0 replies.
- IO exception while adding field in Parsedata parsemeta. - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/24 16:21:33 UTC, 1 replies.
- Can I "chunk" during the crawl? - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/24 16:39:43 UTC, 0 replies.
- Why did my crawl fail? - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/24 16:53:31 UTC, 7 replies.
- Dumping CrawlDB into database - posted by schroedi <sc...@gmail.com> on 2009/07/24 16:59:17 UTC, 0 replies.
- Nutch 1.0 and Hadoop 0.20 - posted by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/07/24 20:44:48 UTC, 0 replies.
- How to search in one specific field? - posted by xiao yang <ya...@gmail.com> on 2009/07/25 07:44:14 UTC, 0 replies.
- crawl-tool.xml - posted by reinhard schwab <re...@aon.at> on 2009/07/26 13:55:03 UTC, 1 replies.
- How to index other fields in solr - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/27 08:34:23 UTC, 2 replies.
- Nutch crawling status - posted by caezar <ca...@gmail.com> on 2009/07/27 16:27:58 UTC, 1 replies.
- question - posted by Jair Piedrahita Vargas <JA...@bancolombia.com.co> on 2009/07/27 18:50:29 UTC, 1 replies.
- Using Nutch (w/custom plugin) to crawl vs. custom Lucene app - posted by oh...@cox.net on 2009/07/27 21:35:36 UTC, 0 replies.
- Support needed - posted by sf30098 <sf...@yahoo.com> on 2009/07/27 23:01:09 UTC, 1 replies.
- Host specific parsing - posted by Koch Martina <Ko...@huberverlag.de> on 2009/07/28 09:24:51 UTC, 2 replies.
- Development support - posted by Koch Martina <Ko...@huberverlag.de> on 2009/07/28 12:30:03 UTC, 0 replies.
- Dumping what I have? - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/28 16:46:04 UTC, 3 replies.
- How to add new field in indexing in SolrIndexer.java - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/29 07:38:15 UTC, 1 replies.
- Include/exclude lists - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/29 10:33:12 UTC, 1 replies.
- How fetcher works - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/30 06:17:49 UTC, 1 replies.
- Nutch and Solr - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/30 14:22:19 UTC, 0 replies.
- Meaning of ProtocolStatus.ACCESS_DENIED - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/30 15:59:50 UTC, 0 replies.
- Dumping Crawl DB with XML - posted by schroedi <sc...@gmail.com> on 2009/07/30 17:19:42 UTC, 0 replies.
- Nutch in C++ - posted by al...@aim.com on 2009/07/30 21:13:16 UTC, 0 replies.
- how to exclude some external links - posted by al...@aim.com on 2009/07/31 03:15:49 UTC, 0 replies.
- Re: how to exclude some external links - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/31 03:26:37 UTC, 0 replies.
- Plugin development - posted by Paul Tomblin <pt...@xcski.com> on 2009/07/31 04:04:59 UTC, 4 replies.
- denied by robots.txt rules - posted by Saurabh Suman <sa...@rediff.com> on 2009/07/31 05:28:29 UTC, 1 replies.
- Re: Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing. - posted by Filipe Antunes <fa...@tecnica.cc> on 2009/07/31 11:03:55 UTC, 1 replies.
- Focussed Web Crawling with Nutch - posted by Alex McLintock <al...@gmail.com> on 2009/07/31 12:07:16 UTC, 2 replies.
- Specific fetch list based on url status or score - posted by MilleBii <mi...@gmail.com> on 2009/07/31 19:12:01 UTC, 0 replies.