You are viewing a plain text version of this content. The canonical link for it is here.
- Re: Google Analytics in Hadoop ? - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/01 12:06:27 UTC, 0 replies.
- Re: Indexing meta tags in Nutch 1.4 - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/01 12:15:24 UTC, 7 replies.
- Crawl sites with hashtags in url - posted by Roberto Gardenier <r....@simgroep.nl> on 2012/05/01 13:25:25 UTC, 6 replies.
- RE: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url - posted by Roberto Gardenier <r....@simgroep.nl> on 2012/05/01 13:55:28 UTC, 0 replies.
- Hadoop not doing anything - posted by Dean Pullen <de...@semantico.com> on 2012/05/01 17:26:31 UTC, 1 replies.
- Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local.. - posted by Sebastian Nagel <wa...@googlemail.com> on 2012/05/01 23:15:15 UTC, 4 replies.
- No se indexan los metatags de algunas urls - posted by "mendoza.juan" <me...@gmail.com> on 2012/05/02 16:03:10 UTC, 1 replies.
- Re: fields foreach document - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/02 20:01:21 UTC, 2 replies.
- Re: Generator OOM - posted by Markus Jelsma <ma...@openindex.io> on 2012/05/03 08:32:16 UTC, 0 replies.
- Avoid crawling nonsense calendar webpage - posted by Xiao Li <sh...@gmail.com> on 2012/05/04 21:13:30 UTC, 2 replies.
- Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles. - posted by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/05 22:05:53 UTC, 4 replies.
- Exception thrown when loading class org.apache.nutch.protocol.Protocol while reading the "crawldb" SequenceFile - posted by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/07 01:09:35 UTC, 0 replies.
- link without href - posted by Mohammad wrk <mo...@yahoo.com> on 2012/05/07 08:19:19 UTC, 1 replies.
- How do I merge indexes so that the "indexes" folder is merged as well? - posted by "nutch.buddy@gmail.com" <nu...@gmail.com> on 2012/05/07 12:07:59 UTC, 1 replies.
- Re: Client certificate authentication - posted by Siddharth Jain <si...@gmail.com> on 2012/05/07 12:24:05 UTC, 0 replies.
- Re: https authentication - posted by Siddharth Jain <si...@gmail.com> on 2012/05/07 12:30:47 UTC, 0 replies.
- Is it possible to control the segment size? - posted by "nutch.buddy@gmail.com" <nu...@gmail.com> on 2012/05/08 07:31:43 UTC, 4 replies.
- De-duplication of Nutch parsed data - posted by Vikas Hazrati <vi...@knoldus.com> on 2012/05/08 13:14:31 UTC, 5 replies.
- HTML documents with TXT extension - posted by Bai Shen <ba...@gmail.com> on 2012/05/08 14:34:58 UTC, 2 replies.
- Lower case URLs - correct regex? - posted by Dean Pullen <de...@semantico.com> on 2012/05/08 14:37:47 UTC, 3 replies.
- CLASSPATH - posted by Tolga <to...@ozses.net> on 2012/05/09 09:00:53 UTC, 8 replies.
- Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch. - posted by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/09 10:01:52 UTC, 0 replies.
- HTTP ERROR 400 - posted by Stephan Kristyn <kr...@yahoo-inc.com> on 2012/05/09 12:11:21 UTC, 14 replies.
- Working! - posted by Tolga <to...@ozses.net> on 2012/05/09 14:45:34 UTC, 0 replies.
- Make Nutch to crawl internal urls only - posted by James Ford <si...@gmail.com> on 2012/05/09 17:09:09 UTC, 7 replies.
- Focused Crawling with Nutch (IndexingFilter:filter) - posted by Michael Erickson <er...@gmail.com> on 2012/05/09 20:07:11 UTC, 2 replies.
- HTTP error 400 - posted by Tolga <to...@ozses.net> on 2012/05/10 08:10:04 UTC, 19 replies.
- Running nutch in eclipse - posted by Vijith <vi...@gmail.com> on 2012/05/10 13:14:41 UTC, 6 replies.
- Crawl-tool for iterative crawling? - posted by Matthias Paul <ma...@gmail.com> on 2012/05/10 18:18:08 UTC, 8 replies.
- Separate logger for nutch - posted by Vijith <vi...@gmail.com> on 2012/05/11 11:37:40 UTC, 6 replies.
- Indexing HTML metatags from Nutch into Solr - posted by ML mail <ml...@yahoo.com> on 2012/05/11 12:40:36 UTC, 0 replies.
- Re: Indexing HTML metatags from Nutch into Solr - posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu> on 2012/05/11 19:55:47 UTC, 6 replies.
- Nutchgora - SQLTransientConnectionException? - posted by Ramsel Ruiz <si...@gmail.com> on 2012/05/12 01:25:34 UTC, 3 replies.
- Heap space problem when running nutch on cluster - posted by "nutch.buddy@gmail.com" <nu...@gmail.com> on 2012/05/13 17:16:57 UTC, 1 replies.
- Can't retrieve Tika parser for mime-type text/javascript - posted by forwardswing <wa...@sohu.com> on 2012/05/14 05:24:29 UTC, 10 replies.
- java.lang.NullPointerException:org.apache.hadoop.io.Text.encode(Text.java:388) - posted by forwardswing <wa...@sohu.com> on 2012/05/14 05:28:08 UTC, 2 replies.
- Couldn't get robots.txt for site - posted by kh3rad <kh...@gmail.com> on 2012/05/14 11:46:35 UTC, 2 replies.
- webpage download - posted by Taeseong Kim <fl...@gmail.com> on 2012/05/15 05:45:28 UTC, 3 replies.
- nutch 1-4 with solr-4 - posted by ramires <uy...@beriltech.com> on 2012/05/15 09:47:35 UTC, 0 replies.
- solrindex - posted by Tolga <to...@ozses.net> on 2012/05/15 15:27:08 UTC, 1 replies.
- Tika parser exception IndexOutOfBoundsException - posted by LEVILLAIN Olivier <ol...@coface.com> on 2012/05/15 16:17:26 UTC, 2 replies.
- Block irrelevant urls - posted by Vijith <vi...@gmail.com> on 2012/05/15 20:07:42 UTC, 1 replies.
- curl or nutch - posted by Tolga <to...@ozses.net> on 2012/05/16 09:43:49 UTC, 1 replies.
- problem - Failed to set permissions of path: \tmp\hadoop - posted by Florian Hartl <fl...@hartl.info> on 2012/05/16 22:34:42 UTC, 1 replies.
- In-link data scattered/duplicated across multiple folders in Nutch... - posted by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/16 22:54:26 UTC, 0 replies.
- ERROR solr.SolrIndexer - java.io.IOException: Job failed! - posted by cameron tran <ca...@gmail.com> on 2012/05/18 06:58:39 UTC, 3 replies.
- Exclude certain mime-types - posted by Matthias Paul <ma...@gmail.com> on 2012/05/18 14:56:30 UTC, 1 replies.
- Re: [VOTE] Apache Nutch 1.5 release rc #1 - posted by Matthias Paul <ma...@gmail.com> on 2012/05/18 15:08:30 UTC, 6 replies.
- use nutch to crawl information in google group - posted by haochen <ch...@gmail.com> on 2012/05/21 08:48:50 UTC, 0 replies.
- Using nutch and solr for lotus notes - posted by cameron tran <ca...@gmail.com> on 2012/05/21 09:39:47 UTC, 3 replies.
- org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id - posted by Tolga <to...@ozses.net> on 2012/05/21 13:02:56 UTC, 0 replies.
- Setting the Fetch time with a CustomFetchSchedule - posted by Vikas Hazrati <vi...@knoldus.com> on 2012/05/21 13:43:46 UTC, 5 replies.
- Crawl / index files as well - posted by Tolga <to...@ozses.net> on 2012/05/21 13:54:58 UTC, 0 replies.
- error parsing some xml - posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu> on 2012/05/21 17:06:56 UTC, 3 replies.
- Bug in Trunk Generator mapper? - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/21 20:06:50 UTC, 3 replies.
- PDF not crawled/indexed - posted by Tolga <to...@ozses.net> on 2012/05/22 09:48:15 UTC, 18 replies.
- Get Parent of URLs fetched by nutch - posted by blunderboy <sa...@gmail.com> on 2012/05/22 12:40:41 UTC, 2 replies.
- URL filtering and normalization - posted by Bai Shen <ba...@gmail.com> on 2012/05/22 19:39:58 UTC, 3 replies.
- Apache Nutch release 1.5 RC2 - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/22 21:59:24 UTC, 9 replies.
- parse.ParserFactory - posted by Tolga <to...@ozses.net> on 2012/05/22 22:20:16 UTC, 6 replies.
- using less resources - posted by al...@aim.com on 2012/05/22 22:25:44 UTC, 1 replies.
- Common Crawl dataset - posted by "Caklovic, Nenad" <Ne...@amd.com> on 2012/05/23 01:45:35 UTC, 2 replies.
- One last question - posted by Tolga <to...@ozses.net> on 2012/05/23 08:39:35 UTC, 3 replies.
- Need help - posted by abhishek tiwari <ab...@gmail.com> on 2012/05/23 11:30:04 UTC, 3 replies.
- Apparently far from last question :) - posted by Tolga <to...@ozses.net> on 2012/05/23 12:44:32 UTC, 4 replies.
- Need help in configuring nutch in eclipse - posted by Susmita Das <su...@yahoo.com> on 2012/05/23 14:38:36 UTC, 1 replies.
- nutch hadoop only one slave is crawling - posted by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/24 06:12:44 UTC, 2 replies.
- Large website not fully crawled - posted by Tolga <to...@ozses.net> on 2012/05/24 09:17:39 UTC, 7 replies.
- Multiple nutch jobs on a Hadoop cluster simultaneosuly - posted by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/24 12:57:25 UTC, 3 replies.
- 答复: Re: Good workflow for a regular re-indexing job - posted by "xiaodong.han@gmail.com" <xi...@gmail.com> on 2012/05/24 13:01:31 UTC, 0 replies.
- XML parsing - posted by Tolga <to...@ozses.net> on 2012/05/24 20:14:52 UTC, 1 replies.
- Re: RSS parser - posted by Sebastian Nagel <wa...@googlemail.com> on 2012/05/24 21:28:53 UTC, 1 replies.
- Using Nutch for Web Site Mirroring - posted by "vlad.paunescu" <vl...@gmail.com> on 2012/05/25 13:07:31 UTC, 3 replies.
- New Nutch Committer and PMC member : Sebastian Nagel - posted by Julien Nioche <li...@gmail.com> on 2012/05/25 17:56:35 UTC, 1 replies.
- nutchgora NullPointerException during parse at NutchJob.waitForCompletion / avro.util.Utf8. - posted by George Smith <ge...@gmail.com> on 2012/05/25 22:00:19 UTC, 3 replies.
- Add Third party dependency to your nutch plugin - posted by blunderboy <sa...@gmail.com> on 2012/05/28 13:52:16 UTC, 4 replies.
- Nutch IRC ? - posted by Vikas Hazrati <vi...@knoldus.com> on 2012/05/28 18:05:16 UTC, 1 replies.
- No links to process, is the webgraph empty? - posted by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/29 04:19:41 UTC, 6 replies.
- OSGI bundle of nutch - posted by blunderboy <sa...@gmail.com> on 2012/05/30 06:36:03 UTC, 2 replies.
- Cannot run program "chmod" - posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu> on 2012/05/30 20:46:20 UTC, 2 replies.
- [VOTE] Apache Nutch release 1.5 RC3 - posted by lewis john mcgibbney <le...@apache.org> on 2012/05/30 22:59:59 UTC, 0 replies.
- "nutch-site.xml" not robust - posted by Andy Xue <an...@gmail.com> on 2012/05/31 03:34:04 UTC, 1 replies.
- nutch1.4+hadoop1.0.3+solr3.4 - posted by John <hi...@qq.com> on 2012/05/31 16:36:30 UTC, 1 replies.
- Re: ParseSegment taking a long time to finish - posted by sidbatra <si...@gmail.com> on 2012/05/31 22:23:14 UTC, 0 replies.
- [VOTE] Apache Nutch 1.5 release-1.5RC4 - posted by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/31 22:37:52 UTC, 0 replies.