user@nutch.apache.org, 2011-12

You are viewing a plain text version of this content. The canonical link for it is here.

- How can i crawl some image in content ? - posted by magix <17...@qq.com> on 2011/12/01 00:31:17 UTC, 1 replies.
- Re: Nutch and Sharepoint authentication - posted by remi tassing <ta...@gmail.com> on 2011/12/01 02:21:20 UTC, 3 replies.
- Posting to a secured Solr? - posted by John Whelan <jo...@whelanlabs.com> on 2011/12/01 07:21:34 UTC, 1 replies.
- Re: Fetching just some urls outside domain - posted by Adriana Farina <ad...@gmail.com> on 2011/12/01 09:57:34 UTC, 7 replies.
- Re: Solr Indexing Problem - posted by Markus Jelsma <ma...@openindex.io> on 2011/12/01 11:02:14 UTC, 1 replies.
- Removing crawldb segments - posted by jotta <so...@gmail.com> on 2011/12/01 16:52:27 UTC, 3 replies.
- Resolving Ivy dependencies in eclipse - posted by blaise thomson <bl...@yahoo.com> on 2011/12/01 17:53:25 UTC, 1 replies.
- how to filter outlinks - posted by al...@aim.com on 2011/12/02 02:24:01 UTC, 0 replies.
- ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/rss+xml - posted by magix <17...@qq.com> on 2011/12/02 04:27:17 UTC, 1 replies.
- Filter by content language ID - posted by co...@complexityintelligence.com on 2011/12/02 16:23:42 UTC, 5 replies.
- Best strategy for boundary defined crawling - posted by co...@complexityintelligence.com on 2011/12/02 16:23:53 UTC, 6 replies.
- how give several sites to nutch to crawl? - posted by mina <ta...@gmail.com> on 2011/12/03 08:32:26 UTC, 4 replies.
- Which outlinks on a webpage are stored in the segment? - posted by Marek Bachmann <m....@uni-kassel.de> on 2011/12/03 16:38:42 UTC, 2 replies.
- Ranking of injected urls vs crawled urls - posted by Harris Rappaport <hp...@gmail.com> on 2011/12/05 00:19:08 UTC, 3 replies.
- Persistent Crawldb Checksum error - posted by Danicela nutch <Da...@mail.com> on 2011/12/05 14:06:41 UTC, 2 replies.
- problem with the Nutch and Hadoop Tutorial when starting to deploy Nutch to Single Machine - posted by José Ignacio Ortiz de Galisteo <jo...@salir.com> on 2011/12/05 14:14:43 UTC, 3 replies.
- [ANNOUNCEMENT] Nutch wiki attachments - posted by Lewis John Mcgibbney <le...@gmail.com> on 2011/12/05 14:28:55 UTC, 0 replies.
- Re : Re: Persistent Crawldb Checksum error - posted by Danicela nutch <Da...@mail.com> on 2011/12/05 14:56:32 UTC, 0 replies.
- Can't find fetcher log when running on Hadoop cluster - posted by manishb <ba...@gmail.com> on 2011/12/05 16:41:16 UTC, 1 replies.
- Re: Very large filter lists - posted by Markus Jelsma <ma...@openindex.io> on 2011/12/05 18:37:25 UTC, 1 replies.
- error "java.net.SocketTimeoutException: Read timed out" in crawl with nutch? - posted by mina <ta...@gmail.com> on 2011/12/06 00:36:05 UTC, 1 replies.
- Problems with HeapSpace in Hadoop Cluster - posted by Marek Bachmann <m....@uni-kassel.de> on 2011/12/06 00:37:26 UTC, 1 replies.
- new nutch tool - posted by Tim Pease <ti...@gmail.com> on 2011/12/06 00:38:06 UTC, 2 replies.
- error "java.net.SocketException: Connection reset" in crawl with nutch - posted by mina <ta...@gmail.com> on 2011/12/06 00:44:53 UTC, 1 replies.
- add link - posted by Kevin <ke...@procher.info> on 2011/12/06 01:59:28 UTC, 0 replies.
- search the web and get the number of hits corresponding to a particular request - posted by Jihene Ferchichi Jmal <fe...@hotmail.fr> on 2011/12/06 09:03:09 UTC, 7 replies.
- generate/update times and crawldb size - posted by Danicela nutch <Da...@mail.com> on 2011/12/06 11:33:49 UTC, 4 replies.
- Re : Re: generate/update times and crawldb size - posted by Danicela nutch <Da...@mail.com> on 2011/12/06 12:27:46 UTC, 3 replies.
- best number of threads per host - posted by mina <ta...@gmail.com> on 2011/12/06 16:01:45 UTC, 1 replies.
- Problems with solrindex - posted by DanFernandes <fe...@gmail.com> on 2011/12/07 12:37:56 UTC, 1 replies.
- Nutch plugins - posted by jotta <so...@gmail.com> on 2011/12/07 16:16:59 UTC, 1 replies.
- Trouble running solrindexer from Nutch 1.4 - posted by Chip Calhoun <cc...@aip.org> on 2011/12/07 23:17:08 UTC, 3 replies.
- Re: Problem running Nutch on Win 7 + Cygwin - posted by Jean-François Gingras <je...@gmail.com> on 2011/12/08 03:25:40 UTC, 5 replies.
- Generator: 0 records selected for fetching, exiting ... - posted by Rafael Pappert <rp...@fwpsystems.com> on 2011/12/08 03:50:20 UTC, 2 replies.
- The book "Building Search Applications with Lucene and Nutch" - posted by remi tassing <ta...@gmail.com> on 2011/12/08 13:39:09 UTC, 1 replies.
- Selective fetching without exclusion - posted by Joshua J Pavel <jp...@us.ibm.com> on 2011/12/08 17:36:12 UTC, 1 replies.
- Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data - posted by Muhammad Rizwan <mu...@sigmatec.com.pk> on 2011/12/09 08:40:12 UTC, 10 replies.
- "URLFilterChecker" documentation - posted by remi tassing <ta...@gmail.com> on 2011/12/09 13:32:41 UTC, 9 replies.
- crawl only fetches one page each time! - posted by Mohammad wrk <mh...@yahoo.com> on 2011/12/10 18:27:43 UTC, 3 replies.
- Nutch + Solr + Carrot2 Tutorials - posted by Swapnil Kulkarni <sw...@usc.edu> on 2011/12/11 16:45:49 UTC, 0 replies.
- Re : Re : Re: generate/update times and crawldb size - posted by Danicela nutch <Da...@mail.com> on 2011/12/12 17:05:17 UTC, 0 replies.
- Applying for subscription - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/13 08:09:00 UTC, 0 replies.
- How can i crawl data from hbase using nutch - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/13 08:19:20 UTC, 2 replies.
- I want to subscribe to this group - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/13 11:50:56 UTC, 0 replies.
- Re : Re : Re : Re: generate/update times and crawldb size - posted by Danicela nutch <Da...@mail.com> on 2011/12/13 14:46:04 UTC, 0 replies.
- Running Nutch in Tomcat - path to conf folder - posted by "Avni, Itamar" <It...@verint.com> on 2011/12/13 20:08:25 UTC, 0 replies.
- Bug in o.a.n.n.URLNormalizerChecker? - posted by Lewis John Mcgibbney <le...@gmail.com> on 2011/12/13 20:08:42 UTC, 1 replies.
- how to adjust 'content' - posted by "Hartl, Florian" <fl...@sap.com> on 2011/12/14 02:11:17 UTC, 2 replies.
- Is it possible to crawl hdfs file system using nutch - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/14 11:04:57 UTC, 2 replies.
- Solr Indexing - posted by Rafael Pappert <rp...@fwpsystems.com> on 2011/12/14 12:14:15 UTC, 3 replies.
- SolrIndex java.io.IOException: Job failed! - posted by remi tassing <ta...@gmail.com> on 2011/12/14 14:57:40 UTC, 3 replies.
- check out the iPhone game, that i have developed. - posted by Vijayakrishna <vk...@gmail.com> on 2011/12/14 22:29:15 UTC, 0 replies.
- Nutch readdb shows much more fetched urls than parsed - posted by mikaza <mi...@mediainsight.info> on 2011/12/15 11:39:21 UTC, 2 replies.
- LinkDB usages - posted by Danicela nutch <Da...@mail.com> on 2011/12/15 14:32:21 UTC, 1 replies.
- Success Error? - posted by Christopher Gross <co...@gmail.com> on 2011/12/15 15:36:40 UTC, 8 replies.
- Nutch Hadoop Optimization - posted by Bai Shen <ba...@gmail.com> on 2011/12/15 17:22:23 UTC, 7 replies.
- Crawling Sharepoint - posted by Christopher Gross <co...@gmail.com> on 2011/12/15 21:13:12 UTC, 1 replies.
- Malformed URL: '', skipping (java.net.MalformedURLException - posted by mina <ta...@gmail.com> on 2011/12/15 23:48:50 UTC, 4 replies.
- Re: Content field does not provied fully parsed text. Why? - posted by jepse <jp...@jepse.net> on 2011/12/16 16:20:35 UTC, 0 replies.
- Java out of memory error - posted by Bai Shen <ba...@gmail.com> on 2011/12/16 17:13:45 UTC, 8 replies.
- Crawl fails: Input path does not exist - posted by Dean Pullen <de...@semantico.com> on 2011/12/16 18:26:20 UTC, 5 replies.
- updates to runbot.sh - posted by Christopher Gross <co...@gmail.com> on 2011/12/16 20:56:15 UTC, 2 replies.
- Runaway fetcher threads - posted by Ar...@csiro.au on 2011/12/19 08:32:53 UTC, 4 replies.
- 'A record version mismatch occured' - posted by Danicela nutch <Da...@mail.com> on 2011/12/19 10:17:54 UTC, 2 replies.
- Re : Re: 'A record version mismatch occured' - posted by Danicela nutch <Da...@mail.com> on 2011/12/19 12:14:07 UTC, 0 replies.
- Workaround for "(..) can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus"? - posted by Marek Bachmann <m....@uni-kassel.de> on 2011/12/19 12:50:52 UTC, 3 replies.
- Meta Tags - posted by Marek Bachmann <m....@uni-kassel.de> on 2011/12/19 15:30:12 UTC, 4 replies.
- Missing document - posted by Christopher Gross <co...@gmail.com> on 2011/12/19 17:16:58 UTC, 10 replies.
- problem with tutorial - posted by Christopher Gross <co...@gmail.com> on 2011/12/19 19:41:27 UTC, 10 replies.
- Can't crawl a domain; can't figure out why. - posted by Chip Calhoun <cc...@aip.org> on 2011/12/19 22:53:02 UTC, 5 replies.
- Correct syntax for regex-urlfilter.txt - trying to exclude single path results - posted by Matt Poff <ma...@headfirst.co.nz> on 2011/12/20 06:09:45 UTC, 5 replies.
- error in topN - posted by mina <ta...@gmail.com> on 2011/12/20 12:48:46 UTC, 1 replies.
- Re: topN-help - posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2011/12/21 07:23:47 UTC, 0 replies.
- nutch parse Tika problem - posted by Xiao Li <sh...@gmail.com> on 2011/12/22 02:06:14 UTC, 0 replies.
- Hadoop .20.205 & Nutch 1.3 - posted by Peyman Mohajerian <mo...@gmail.com> on 2011/12/22 02:47:21 UTC, 4 replies.
- HtmlParser parse-html-plugin - posted by jepse <jp...@jepse.net> on 2011/12/22 13:41:20 UTC, 2 replies.
- Parsing fetcher hangs oocasionally - posted by Markus Jelsma <ma...@openindex.io> on 2011/12/22 15:56:47 UTC, 0 replies.
- Trouble building Nutch - posted by Patrick Durusau <pa...@durusau.net> on 2011/12/22 16:23:35 UTC, 1 replies.
- Fetch Retries - posted by Bai Shen <ba...@gmail.com> on 2011/12/22 19:39:02 UTC, 3 replies.
- nutch solr index process to add tag when indexing solr - posted by abhayd <aj...@hotmail.com> on 2011/12/22 20:20:40 UTC, 1 replies.
- Retrieve the original HTML from nutch-1.4 crawldb - posted by 邓尧 <to...@gmail.com> on 2011/12/23 03:06:26 UTC, 2 replies.
- Nutch and classification (Re: Fwd: Meta Tags) - posted by Marek Bachmann <m....@uni-kassel.de> on 2011/12/23 13:17:22 UTC, 0 replies.
- Re: Multiple values encountered for non multivalued field - posted by Bai Shen <ba...@gmail.com> on 2011/12/23 16:27:21 UTC, 1 replies.
- error in solrindex command in nutch 1.4 - posted by mina <ta...@gmail.com> on 2011/12/26 11:12:25 UTC, 3 replies.
- Re: Authentication issue nutch1.4 - posted by Susam Pal <su...@susam.in> on 2011/12/26 12:36:54 UTC, 0 replies.
- why nutch 1.4 don't set the origin html content field in solrindexer - posted by Cube Agen <ag...@gmail.com> on 2011/12/28 15:29:07 UTC, 5 replies.
- Drupal Integration with Nutch via CSIRO's Arch ? - posted by Nicholas Roberts <ni...@themediasociety.org> on 2011/12/29 07:31:36 UTC, 4 replies.
- Continuous Crawling - posted by Bai Shen <ba...@gmail.com> on 2011/12/29 21:29:02 UTC, 0 replies.
- nutch reindexes all documents after each crawl - posted by Magnús Skúlason <ma...@gmail.com> on 2011/12/30 14:16:00 UTC, 1 replies.
- Working with Twitter - posted by Lewis John Mcgibbney <le...@gmail.com> on 2011/12/30 19:33:01 UTC, 4 replies.
- Happy new year.... - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/31 21:43:50 UTC, 0 replies.
- Happy new year - posted by shashwat shriparv <dw...@gmail.com> on 2011/12/31 21:45:50 UTC, 0 replies.