You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Zimmerman <wf...@nimblebooks.com> on 2011/10/09 02:22:24 UTC

solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Hi -- I am having trouble with the solrindexer parameters -- I see that
Lewis had similar problems a few months ago. Any idea what I am doing wrong?

bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch
> solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
> SolrIndexer: starting at 2011-10-09 00:13:24
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_fetch
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_parse
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_data
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_text
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_fetch
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_parse
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_data
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_text
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/crawl_parse
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_data
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_text
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_fetch
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_parse
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_data
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_text



-----------------------------------------------------
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates



On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi guys,
>
> I have been watching this thread intently and I am very happy to see that
> there is some progress :0)
>
> Radim,
>
> Can I ask that you open a JIRA issue and submit a patch, this way we can
> not
> only track it, but it will also give the community a chance to test and
> validate the patch prior to integration into the source.
>
> Thanks
>
> Lewis
>
> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> Rajesh.Ramanathapuram@turner.com> wrote:
>
> > Hi Radim,
> >
> >  Thank you so much for this. I am not familiar with commit process to the
> > core.
> >  Is there someone who can help us get this committed and help resolve
> this
> > issue?
> >
> > Thanks for all your help.
> >
> > Rajesh Ramana
> >
> > -----Original Message-----
> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> > Sent: Thursday, October 06, 2011 2:18 PM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not crawling URLs with spanish accented characters (
> ñ)
> >
> > - The REGEX normalizer transforms the special characters, but fails to
> > substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> >  - The fetcher is having trouble interpreting the links with special
> > character ‘ñ’.
> >
> > i can add this transformation to basic-url normalizer if somebody is
> > willing to commit it.
> >
>
>
>
> --
> *Lewis*
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Markus Jelsma <ma...@openindex.io>.
Besises, the -linkdb param is 1.4 not 1.3
that's what's wrong here. Bai explicitely mentioned 1.4

> Hi Fred,
> 
> Please ensure that the linkdb command was executed succesfully. The output
> logs do not indicate this.
> Looks like you've got a '-' minus character in from of the relative linkdb
> directory as well.
> 
> HTH
> 
> On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman <zi...@gmail.com>wrote:
> > I'm still having trouble with this in 1.3. looks as if there's something
> > dumb with syntax or file structure but can't get it.
> > 
> > $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
> > -linkdb crawl/linkdb crawl/segments/*
> > 
> > SolrIndexer: starting at 2011-10-25 23:26:02
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> > file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> > file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse
> > Input path does not exist:
> > file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data
> > Input path does not exist:
> > file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text
> > Input path does not exist:
> > file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current
> > 
> > 
> > On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma
> > 
> > <ma...@openindex.io>wrote:
> > > From the changelog:
> > > http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup
> > > 
> > > 111     * NUTCH-1054 LinkDB optional during indexing (jnioche)
> > > 
> > > With your command, the given linkdb is interpreted as a segment.
> > > 
> > > https://issues.apache.org/jira/browse/NUTCH-1054
> > > 
> > > This is the new command:
> > > 
> > > Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] (<segment>
> > > ...
> > > 
> > > -
> > > dir <segments>) [-noCommit
> > > 
> > > On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
> > > > I'm having a similar issue.  I'm using 1.4 and getting these errors
> > 
> > with
> > 
> > > > linkdb.  The segments seem fine.
> > > > 
> > > > 2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer:
> > > > starting
> > 
> > at
> > 
> > > > 2011-10-25 10:10:20
> > > > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> > > 
> > > IndexerMapReduce:
> > > > crawldb: crawl/crawldb
> > > > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> > > 
> > > IndexerMapReduces:
> > > > adding segment: crawl/linkdb
> > > > 2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce -
> > > 
> > > IndexerMapReduces:
> > > > adding segment: crawl/segments/20111025095216
> > > > 2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce -
> > > 
> > > IndexerMapReduces:
> > > > adding segment: crawl/segments/20111025100004
> > > > 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
> > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > > 
> > > exist:
> > > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
> > > > Input path does not exist:
> > > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
> > > > Input path does not exist:
> > > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
> > > > Input path does not exist:
> > > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text
> > > > 
> > > > 
> > > > Did something change with 1.4?
> > > > 
> > > > On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney <
> > > > 
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > Hi Fred,
> > > > > 
> > > > > How many individual directories do you have under
> > > > > /runtime/local/crawl/segments/
> > > > > ?
> > > > > 
> > > > > Another thing that raises alarms is the nohup.out dir's! Are these
> > > > > intentional? Interestingly, missing segment data is not the same
> > > > > with these dir's.
> > > > > 
> > > > > Does your log output indicate any discrepancies between various
> > 
> > command
> > 
> > > > > transitions?
> > > > > 
> > > > > 
> > > > > 
> > > > > bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$
> > > 
> > > bin/nutch
> > > 
> > > > > >> solrindex
> > > > > >> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
> > > > > >> crawl/linkdb crawl/segments/*
> > > > > >> SolrIndexer: starting at 2011-10-09 00:13:24
> > > > > >> org.apache.hadoop.mapred.InvalidInputException: Input path does
> > 
> > not
> > 
> > > > > exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922143907/crawl_fetch
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922143907/crawl_parse
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922143907/parse_data
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922143907/parse_text
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922144329/crawl_fetch
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922144329/crawl_parse
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922144329/parse_data
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 10
> > > 
> > > > > 922144329/parse_text
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 11
> > > 
> > > > > 008015309/crawl_parse
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 11
> > > 
> > > > > 008015309/parse_data
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
> > > 11
> > > 
> > > > > 008015309/parse_text
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/noh
> > > up
> > > 
> > > > > .out/crawl_fetch
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/noh
> > > up
> > > 
> > > > > .out/crawl_parse
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/noh
> > > up
> > > 
> > > > > .out/parse_data
> > > > > 
> > > > > >> Input path does not exist:
> > > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/noh
> > > up
> > > 
> > > > > .out/parse_text
> > > > > 
> > > > > > -----------------------------------------------------
> > > > > > Subscribe to the Nimble Books Mailing List
> > 
> > http://eepurl.com/czS-for
> > 
> > > > > > monthly updates
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> > > > > > 
> > > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > >> Hi guys,
> > > > > >> 
> > > > > >> I have been watching this thread intently and I am very happy to
> > 
> > see
> > 
> > > > > that
> > > > > 
> > > > > >> there is some progress :0)
> > > > > >> 
> > > > > >> Radim,
> > > > > >> 
> > > > > >> Can I ask that you open a JIRA issue and submit a patch, this
> > > > > >> way
> > 
> > we
> > 
> > > > > >> can not
> > > > > >> only track it, but it will also give the community a chance to
> > 
> > test
> > 
> > > > > >> and validate the patch prior to integration into the source.
> > > > > >> 
> > > > > >> Thanks
> > > > > >> 
> > > > > >> Lewis
> > > > > >> 
> > > > > >> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> > > > > >> 
> > > > > >> Rajesh.Ramanathapuram@turner.com> wrote:
> > > > > >> > Hi Radim,
> > > > > >> > 
> > > > > >> >  Thank you so much for this. I am not familiar with commit
> > 
> > process
> > 
> > > > > >> >  to
> > > > > >> 
> > > > > >> the
> > > > > >> 
> > > > > >> > core.
> > > > > >> > 
> > > > > >> >  Is there someone who can help us get this committed and help
> > > > > >> >  resolve
> > > > > >> 
> > > > > >> this
> > > > > >> 
> > > > > >> > issue?
> > > > > >> > 
> > > > > >> > Thanks for all your help.
> > > > > >> > 
> > > > > >> > Rajesh Ramana
> > > > > >> > 
> > > > > >> > -----Original Message-----
> > > > > >> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> > > > > >> > Sent: Thursday, October 06, 2011 2:18 PM
> > > > > >> > To: user@nutch.apache.org
> > > > > >> > Subject: Re: Nutch not crawling URLs with spanish accented
> > > > > >> > characters
> > > > > 
> > > > > (
> > > > > 
> > > > > >> ñ)
> > > > > >> 
> > > > > >> > - The REGEX normalizer transforms the special characters, but
> > > 
> > > fails
> > > 
> > > > > >> > to substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> > > > > >> > 
> > > > > >> >  - The fetcher is having trouble interpreting the links with
> > > 
> > > special
> > > 
> > > > > >> > character ‘ñ’.
> > > > > >> > 
> > > > > >> > i can add this transformation to basic-url normalizer if
> > 
> > somebody
> > 
> > > is
> > > 
> > > > > >> > willing to commit it.
> > > > > >> 
> > > > > >> --
> > > > > >> *Lewis*
> > > > > 
> > > > > --
> > > > > *Lewis*
> > > 
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Fred Zimmerman <zi...@gmail.com>.
will do.  Of course I have already googled these terms without much luck.
 Fred

On Wed, Oct 26, 2011 at 9:34 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Fred,
>
> These are clearly Solr aimed questions, which I would observe are specific
> to your schema. Maybe try the Solr archives for key words or else try the
> Solr user lists.I think that you are much more likely to get a
> substantiated
> response there.
>
> Thank you
>
> On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman <zimzaz.wfz@gmail.com
> >wrote:
>
> > I added just the <content> field ... I have already modified solr's
> > schema.xml to accommodate some other data types.
> >
> > Now when starting solr ...
> >
> > INFO: SolrUpdateServlet.init() done
> > 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
> > 2011-10-26 13:30:23.129:WARN::/solr/admin/
> > java.lang.IllegalStateException: STREAM
> >        at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...
> >
> >
> > On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Add the schema.xml from nutch/conf to your Solr core.
> > >
> > > btw: be careful with your host and port in the mailing lists. If it's
> > > open....
> > >
> > > On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
> > > > that's it.
> > > >
> > > > org.apache.solr.common.SolrException: ERROR:unknown field 'content'
> > > >
> > > > *ERROR:unknown field 'content'*
> > > >
> > > > request: http://url/solr/update?wt=javabin&version=2
> > > >         at
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > > > SolrServer.java:436) at
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > > > SolrServer.java:245) at
> > > >
> > >
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
> > > > UpdateRequest.java:105) at
> > > > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
> > > > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
> > > >         at
> > > >
> > >
> >
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
> > > > va:48) at
> > > >
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
> > > >         at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> > > >         at
> > > >
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > > > 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException:
> > Job
> > > > failed!
> > > >
> > > >
> > > > On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
> > > >
> > > > <ma...@openindex.io>wrote:
> > > > > Check your hadoop.log and Solr log. If that happens there's usually
> i
> > > > > field mismatch when indexing.
> > > > >
> > > > > On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> > > > > > OK, I've fixed the problem with the parameters giving incorrect
> > paths
> > > > > > to the files. Now I get this:
> > > > > >
> > > > > > $ bin/nutch solrindex
> > http://search.zimzaz.com:8983/solrcrawl/crawldb
> > > > > > crawl/linkdb crawl/segments/*
> > > > > > SolrIndexer: starting at 2011-10-26 12:57:57
> > > > > > java.io.IOException: Job failed!
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Fred,

These are clearly Solr aimed questions, which I would observe are specific
to your schema. Maybe try the Solr archives for key words or else try the
Solr user lists.I think that you are much more likely to get a substantiated
response there.

Thank you

On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman <zi...@gmail.com>wrote:

> I added just the <content> field ... I have already modified solr's
> schema.xml to accommodate some other data types.
>
> Now when starting solr ...
>
> INFO: SolrUpdateServlet.init() done
> 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
> 2011-10-26 13:30:23.129:WARN::/solr/admin/
> java.lang.IllegalStateException: STREAM
>        at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...
>
>
> On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > Add the schema.xml from nutch/conf to your Solr core.
> >
> > btw: be careful with your host and port in the mailing lists. If it's
> > open....
> >
> > On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
> > > that's it.
> > >
> > > org.apache.solr.common.SolrException: ERROR:unknown field 'content'
> > >
> > > *ERROR:unknown field 'content'*
> > >
> > > request: http://url/solr/update?wt=javabin&version=2
> > >         at
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > > SolrServer.java:436) at
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > > SolrServer.java:245) at
> > >
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
> > > UpdateRequest.java:105) at
> > > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
> > > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
> > >         at
> > >
> >
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
> > > va:48) at
> > > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
> > >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> > >         at
> > >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > > 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException:
> Job
> > > failed!
> > >
> > >
> > > On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
> > >
> > > <ma...@openindex.io>wrote:
> > > > Check your hadoop.log and Solr log. If that happens there's usually i
> > > > field mismatch when indexing.
> > > >
> > > > On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> > > > > OK, I've fixed the problem with the parameters giving incorrect
> paths
> > > > > to the files. Now I get this:
> > > > >
> > > > > $ bin/nutch solrindex
> http://search.zimzaz.com:8983/solrcrawl/crawldb
> > > > > crawl/linkdb crawl/segments/*
> > > > > SolrIndexer: starting at 2011-10-26 12:57:57
> > > > > java.io.IOException: Job failed!
> > > >
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>



-- 
*Lewis*

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Fred Zimmerman <zi...@gmail.com>.
I added just the <content> field ... I have already modified solr's
schema.xml to accommodate some other data types.

Now when starting solr ...

INFO: SolrUpdateServlet.init() done
2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
2011-10-26 13:30:23.129:WARN::/solr/admin/
java.lang.IllegalStateException: STREAM
        at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...


On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Add the schema.xml from nutch/conf to your Solr core.
>
> btw: be careful with your host and port in the mailing lists. If it's
> open....
>
> On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
> > that's it.
> >
> > org.apache.solr.common.SolrException: ERROR:unknown field 'content'
> >
> > *ERROR:unknown field 'content'*
> >
> > request: http://url/solr/update?wt=javabin&version=2
> >         at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > SolrServer.java:436) at
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> > SolrServer.java:245) at
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
> > UpdateRequest.java:105) at
> > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
> > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
> >         at
> >
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
> > va:48) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
> >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
> > failed!
> >
> >
> > On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > Check your hadoop.log and Solr log. If that happens there's usually i
> > > field mismatch when indexing.
> > >
> > > On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> > > > OK, I've fixed the problem with the parameters giving incorrect paths
> > > > to the files. Now I get this:
> > > >
> > > > $ bin/nutch solrindex http://search.zimzaz.com:8983/solrcrawl/crawldb
> > > > crawl/linkdb crawl/segments/*
> > > > SolrIndexer: starting at 2011-10-26 12:57:57
> > > > java.io.IOException: Job failed!
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Markus Jelsma <ma...@openindex.io>.
Add the schema.xml from nutch/conf to your Solr core.

btw: be careful with your host and port in the mailing lists. If it's open....

On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
> that's it.
> 
> org.apache.solr.common.SolrException: ERROR:unknown field 'content'
> 
> *ERROR:unknown field 'content'*
> 
> request: http://url/solr/update?wt=javabin&version=2
>         at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> SolrServer.java:436) at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> SolrServer.java:245) at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
> UpdateRequest.java:105) at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
> org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
>         at
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
> va:48) at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
> 
> 
> On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Check your hadoop.log and Solr log. If that happens there's usually i
> > field mismatch when indexing.
> > 
> > On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> > > OK, I've fixed the problem with the parameters giving incorrect paths
> > > to the files. Now I get this:
> > > 
> > > $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > SolrIndexer: starting at 2011-10-26 12:57:57
> > > java.io.IOException: Job failed!
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Fred Zimmerman <zi...@gmail.com>.
that's it.

org.apache.solr.common.SolrException: ERROR:unknown field 'content'

*ERROR:unknown field 'content'*

request: http://search.zimzaz.com:8983/solr/update?wt=javabin&version=2
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
        at
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
        at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!


On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Check your hadoop.log and Solr log. If that happens there's usually i field
> mismatch when indexing.
>
> On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> > OK, I've fixed the problem with the parameters giving incorrect paths to
> > the files. Now I get this:
> >
> > $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > SolrIndexer: starting at 2011-10-26 12:57:57
> > java.io.IOException: Job failed!
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Markus Jelsma <ma...@openindex.io>.
Check your hadoop.log and Solr log. If that happens there's usually i field 
mismatch when indexing.

On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
> OK, I've fixed the problem with the parameters giving incorrect paths to
> the files. Now I get this:
> 
> $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
> crawl/linkdb crawl/segments/*
> SolrIndexer: starting at 2011-10-26 12:57:57
> java.io.IOException: Job failed!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Fred Zimmerman <zi...@gmail.com>.
OK, I've fixed the problem with the parameters giving incorrect paths to the
files. Now I get this:

$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-10-26 12:57:57
java.io.IOException: Job failed!

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Fred,

Please ensure that the linkdb command was executed succesfully. The output
logs do not indicate this.
Looks like you've got a '-' minus character in from of the relative linkdb
directory as well.

HTH

On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman <zi...@gmail.com>wrote:

> I'm still having trouble with this in 1.3. looks as if there's something
> dumb with syntax or file structure but can't get it.
>
> $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
> -linkdb crawl/linkdb crawl/segments/*
>
> SolrIndexer: starting at 2011-10-25 23:26:02
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text
> Input path does not exist:
> file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current
>
>
> On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > From the changelog:
> > http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup
> >
> > 111     * NUTCH-1054 LinkDB optional during indexing (jnioche)
> >
> > With your command, the given linkdb is interpreted as a segment.
> >
> > https://issues.apache.org/jira/browse/NUTCH-1054
> >
> > This is the new command:
> >
> > Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] (<segment> ...
> |
> > -
> > dir <segments>) [-noCommit
> >
> > On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
> > > I'm having a similar issue.  I'm using 1.4 and getting these errors
> with
> > > linkdb.  The segments seem fine.
> > >
> > > 2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer: starting
> at
> > > 2011-10-25 10:10:20
> > > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > crawldb: crawl/crawldb
> > > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/linkdb
> > > 2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/segments/20111025095216
> > > 2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/segments/20111025100004
> > > 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
> > > Input path does not exist:
> > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
> > > Input path does not exist:
> > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
> > > Input path does not exist:
> > > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text
> > >
> > >
> > > Did something change with 1.4?
> > >
> > > On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney <
> > >
> > > lewis.mcgibbney@gmail.com> wrote:
> > > > Hi Fred,
> > > >
> > > > How many individual directories do you have under
> > > > /runtime/local/crawl/segments/
> > > > ?
> > > >
> > > > Another thing that raises alarms is the nohup.out dir's! Are these
> > > > intentional? Interestingly, missing segment data is not the same with
> > > > these dir's.
> > > >
> > > > Does your log output indicate any discrepancies between various
> command
> > > > transitions?
> > > >
> > > >
> > > >
> > > > bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$
> > bin/nutch
> > > >
> > > > >> solrindex
> > > > >> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
> > > > >> crawl/linkdb crawl/segments/*
> > > > >> SolrIndexer: starting at 2011-10-09 00:13:24
> > > > >> org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > > >
> > > > exist:
> > > >
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922143907/crawl_fetch
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922143907/crawl_parse
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922143907/parse_data
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922143907/parse_text
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922144329/crawl_fetch
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922144329/crawl_parse
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922144329/parse_data
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > > 922144329/parse_text
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > > 008015309/crawl_parse
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > > 008015309/parse_data
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > > 008015309/parse_text
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > > .out/crawl_fetch
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > > .out/crawl_parse
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > > .out/parse_data
> > > >
> > > > >> Input path does not exist:
> > > >
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > > .out/parse_text
> > > >
> > > > > -----------------------------------------------------
> > > > > Subscribe to the Nimble Books Mailing List
> http://eepurl.com/czS-for
> > > > > monthly updates
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> > > > >
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > >> Hi guys,
> > > > >>
> > > > >> I have been watching this thread intently and I am very happy to
> see
> > > >
> > > > that
> > > >
> > > > >> there is some progress :0)
> > > > >>
> > > > >> Radim,
> > > > >>
> > > > >> Can I ask that you open a JIRA issue and submit a patch, this way
> we
> > > > >> can not
> > > > >> only track it, but it will also give the community a chance to
> test
> > > > >> and validate the patch prior to integration into the source.
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> Lewis
> > > > >>
> > > > >> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> > > > >>
> > > > >> Rajesh.Ramanathapuram@turner.com> wrote:
> > > > >> > Hi Radim,
> > > > >> >
> > > > >> >  Thank you so much for this. I am not familiar with commit
> process
> > > > >> >  to
> > > > >>
> > > > >> the
> > > > >>
> > > > >> > core.
> > > > >> >
> > > > >> >  Is there someone who can help us get this committed and help
> > > > >> >  resolve
> > > > >>
> > > > >> this
> > > > >>
> > > > >> > issue?
> > > > >> >
> > > > >> > Thanks for all your help.
> > > > >> >
> > > > >> > Rajesh Ramana
> > > > >> >
> > > > >> > -----Original Message-----
> > > > >> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> > > > >> > Sent: Thursday, October 06, 2011 2:18 PM
> > > > >> > To: user@nutch.apache.org
> > > > >> > Subject: Re: Nutch not crawling URLs with spanish accented
> > > > >> > characters
> > > >
> > > > (
> > > >
> > > > >> ñ)
> > > > >>
> > > > >> > - The REGEX normalizer transforms the special characters, but
> > fails
> > > > >> > to substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> > > > >> >
> > > > >> >  - The fetcher is having trouble interpreting the links with
> > special
> > > > >> >
> > > > >> > character ‘ñ’.
> > > > >> >
> > > > >> > i can add this transformation to basic-url normalizer if
> somebody
> > is
> > > > >> > willing to commit it.
> > > > >>
> > > > >> --
> > > > >> *Lewis*
> > > >
> > > > --
> > > > *Lewis*
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>



-- 
*Lewis*

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Fred Zimmerman <zi...@gmail.com>.
I'm still having trouble with this in 1.3. looks as if there's something
dumb with syntax or file structure but can't get it.

$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*

SolrIndexer: starting at 2011-10-25 23:26:02
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data
Input path does not exist:
file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text
Input path does not exist:
file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current


On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> From the changelog:
> http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup
>
> 111     * NUTCH-1054 LinkDB optional during indexing (jnioche)
>
> With your command, the given linkdb is interpreted as a segment.
>
> https://issues.apache.org/jira/browse/NUTCH-1054
>
> This is the new command:
>
> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] (<segment> ... |
> -
> dir <segments>) [-noCommit
>
> On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
> > I'm having a similar issue.  I'm using 1.4 and getting these errors with
> > linkdb.  The segments seem fine.
> >
> > 2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer: starting at
> > 2011-10-25 10:10:20
> > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/linkdb
> > 2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20111025095216
> > 2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20111025100004
> > 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
> > Input path does not exist:
> > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
> > Input path does not exist:
> > file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text
> >
> >
> > Did something change with 1.4?
> >
> > On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney <
> >
> > lewis.mcgibbney@gmail.com> wrote:
> > > Hi Fred,
> > >
> > > How many individual directories do you have under
> > > /runtime/local/crawl/segments/
> > > ?
> > >
> > > Another thing that raises alarms is the nohup.out dir's! Are these
> > > intentional? Interestingly, missing segment data is not the same with
> > > these dir's.
> > >
> > > Does your log output indicate any discrepancies between various command
> > > transitions?
> > >
> > >
> > >
> > > bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$
> bin/nutch
> > >
> > > >> solrindex
> > > >> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
> > > >> crawl/linkdb crawl/segments/*
> > > >> SolrIndexer: starting at 2011-10-09 00:13:24
> > > >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > >
> > > exist:
> > >
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922143907/crawl_fetch
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922143907/crawl_parse
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922143907/parse_data
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922143907/parse_text
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922144329/crawl_fetch
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922144329/crawl_parse
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922144329/parse_data
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > > 922144329/parse_text
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > 008015309/crawl_parse
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > 008015309/parse_data
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > > 008015309/parse_text
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > .out/crawl_fetch
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > .out/crawl_parse
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > .out/parse_data
> > >
> > > >> Input path does not exist:
> > >
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > > .out/parse_text
> > >
> > > > -----------------------------------------------------
> > > > Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS-for
> > > > monthly updates
> > > >
> > > >
> > > >
> > > > On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> > > >
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >> Hi guys,
> > > >>
> > > >> I have been watching this thread intently and I am very happy to see
> > >
> > > that
> > >
> > > >> there is some progress :0)
> > > >>
> > > >> Radim,
> > > >>
> > > >> Can I ask that you open a JIRA issue and submit a patch, this way we
> > > >> can not
> > > >> only track it, but it will also give the community a chance to test
> > > >> and validate the patch prior to integration into the source.
> > > >>
> > > >> Thanks
> > > >>
> > > >> Lewis
> > > >>
> > > >> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> > > >>
> > > >> Rajesh.Ramanathapuram@turner.com> wrote:
> > > >> > Hi Radim,
> > > >> >
> > > >> >  Thank you so much for this. I am not familiar with commit process
> > > >> >  to
> > > >>
> > > >> the
> > > >>
> > > >> > core.
> > > >> >
> > > >> >  Is there someone who can help us get this committed and help
> > > >> >  resolve
> > > >>
> > > >> this
> > > >>
> > > >> > issue?
> > > >> >
> > > >> > Thanks for all your help.
> > > >> >
> > > >> > Rajesh Ramana
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> > > >> > Sent: Thursday, October 06, 2011 2:18 PM
> > > >> > To: user@nutch.apache.org
> > > >> > Subject: Re: Nutch not crawling URLs with spanish accented
> > > >> > characters
> > >
> > > (
> > >
> > > >> ñ)
> > > >>
> > > >> > - The REGEX normalizer transforms the special characters, but
> fails
> > > >> > to substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> > > >> >
> > > >> >  - The fetcher is having trouble interpreting the links with
> special
> > > >> >
> > > >> > character ‘ñ’.
> > > >> >
> > > >> > i can add this transformation to basic-url normalizer if somebody
> is
> > > >> > willing to commit it.
> > > >>
> > > >> --
> > > >> *Lewis*
> > >
> > > --
> > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Markus Jelsma <ma...@openindex.io>.
From the changelog:
http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup

111 	* NUTCH-1054 LinkDB optional during indexing (jnioche) 

With your command, the given linkdb is interpreted as a segment. 

https://issues.apache.org/jira/browse/NUTCH-1054

This is the new command:

Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] (<segment> ... | -
dir <segments>) [-noCommit

On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
> I'm having a similar issue.  I'm using 1.4 and getting these errors with
> linkdb.  The segments seem fine.
> 
> 2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer: starting at
> 2011-10-25 10:10:20
> 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/crawldb
> 2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/linkdb
> 2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20111025095216
> 2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20111025100004
> 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
> Input path does not exist:
> file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
> Input path does not exist:
> file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
> Input path does not exist:
> file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text
> 
> 
> Did something change with 1.4?
> 
> On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Hi Fred,
> > 
> > How many individual directories do you have under
> > /runtime/local/crawl/segments/
> > ?
> > 
> > Another thing that raises alarms is the nohup.out dir's! Are these
> > intentional? Interestingly, missing segment data is not the same with
> > these dir's.
> > 
> > Does your log output indicate any discrepancies between various command
> > transitions?
> > 
> > 
> > 
> > bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch
> > 
> > >> solrindex
> > >> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
> > >> crawl/linkdb crawl/segments/*
> > >> SolrIndexer: starting at 2011-10-09 00:13:24
> > >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > 
> > exist:
> > 
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922143907/crawl_fetch
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922143907/crawl_parse
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922143907/parse_data
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922143907/parse_text
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922144329/crawl_fetch
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922144329/crawl_parse
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922144329/parse_data
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
> > 922144329/parse_text
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > 008015309/crawl_parse
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > 008015309/parse_data
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
> > 008015309/parse_text
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > .out/crawl_fetch
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > .out/crawl_parse
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > .out/parse_data
> > 
> > >> Input path does not exist:
> > file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup
> > .out/parse_text
> > 
> > > -----------------------------------------------------
> > > Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> > > monthly updates
> > > 
> > > 
> > > 
> > > On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> > > 
> > > lewis.mcgibbney@gmail.com> wrote:
> > >> Hi guys,
> > >> 
> > >> I have been watching this thread intently and I am very happy to see
> > 
> > that
> > 
> > >> there is some progress :0)
> > >> 
> > >> Radim,
> > >> 
> > >> Can I ask that you open a JIRA issue and submit a patch, this way we
> > >> can not
> > >> only track it, but it will also give the community a chance to test
> > >> and validate the patch prior to integration into the source.
> > >> 
> > >> Thanks
> > >> 
> > >> Lewis
> > >> 
> > >> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> > >> 
> > >> Rajesh.Ramanathapuram@turner.com> wrote:
> > >> > Hi Radim,
> > >> > 
> > >> >  Thank you so much for this. I am not familiar with commit process
> > >> >  to
> > >> 
> > >> the
> > >> 
> > >> > core.
> > >> > 
> > >> >  Is there someone who can help us get this committed and help
> > >> >  resolve
> > >> 
> > >> this
> > >> 
> > >> > issue?
> > >> > 
> > >> > Thanks for all your help.
> > >> > 
> > >> > Rajesh Ramana
> > >> > 
> > >> > -----Original Message-----
> > >> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> > >> > Sent: Thursday, October 06, 2011 2:18 PM
> > >> > To: user@nutch.apache.org
> > >> > Subject: Re: Nutch not crawling URLs with spanish accented
> > >> > characters
> > 
> > (
> > 
> > >> ñ)
> > >> 
> > >> > - The REGEX normalizer transforms the special characters, but fails
> > >> > to substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> > >> > 
> > >> >  - The fetcher is having trouble interpreting the links with special
> > >> > 
> > >> > character ‘ñ’.
> > >> > 
> > >> > i can add this transformation to basic-url normalizer if somebody is
> > >> > willing to commit it.
> > >> 
> > >> --
> > >> *Lewis*
> > 
> > --
> > *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by Bai Shen <ba...@gmail.com>.
I'm having a similar issue.  I'm using 1.4 and getting these errors with
linkdb.  The segments seem fine.

2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer: starting at
2011-10-25 10:10:20
2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/linkdb
2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20111025095216
2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20111025100004
2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text


Did something change with 1.4?

On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Fred,
>
> How many individual directories do you have under
> /runtime/local/crawl/segments/
> ?
>
> Another thing that raises alarms is the nohup.out dir's! Are these
> intentional? Interestingly, missing segment data is not the same with these
> dir's.
>
> Does your log output indicate any discrepancies between various command
> transitions?
>
>
>
> bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch
> >> solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
> >> crawl/linkdb crawl/segments/*
> >> SolrIndexer: starting at 2011-10-09 00:13:24
> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_fetch
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_parse
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_data
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_text
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_fetch
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_parse
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_data
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_text
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/crawl_parse
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_data
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_text
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_fetch
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_parse
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_data
> >> Input path does not exist:
> >>
> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_text
> >
> >
> >
> > -----------------------------------------------------
> > Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> > monthly updates
> >
> >
> >
> > On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> >> Hi guys,
> >>
> >> I have been watching this thread intently and I am very happy to see
> that
> >> there is some progress :0)
> >>
> >> Radim,
> >>
> >> Can I ask that you open a JIRA issue and submit a patch, this way we can
> >> not
> >> only track it, but it will also give the community a chance to test and
> >> validate the patch prior to integration into the source.
> >>
> >> Thanks
> >>
> >> Lewis
> >>
> >> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
> >> Rajesh.Ramanathapuram@turner.com> wrote:
> >>
> >> > Hi Radim,
> >> >
> >> >  Thank you so much for this. I am not familiar with commit process to
> >> the
> >> > core.
> >> >  Is there someone who can help us get this committed and help resolve
> >> this
> >> > issue?
> >> >
> >> > Thanks for all your help.
> >> >
> >> > Rajesh Ramana
> >> >
> >> > -----Original Message-----
> >> > From: Radim Kolar [mailto:hsn@sendmail.cz]
> >> > Sent: Thursday, October 06, 2011 2:18 PM
> >> > To: user@nutch.apache.org
> >> > Subject: Re: Nutch not crawling URLs with spanish accented characters
> (
> >> ñ)
> >> >
> >> > - The REGEX normalizer transforms the special characters, but fails to
> >> > substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
> >> >  - The fetcher is having trouble interpreting the links with special
> >> > character ‘ñ’.
> >> >
> >> > i can add this transformation to basic-url normalizer if somebody is
> >> > willing to commit it.
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>
>
> --
> *Lewis*
>

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Fred,

How many individual directories do you have under
/runtime/local/crawl/segments/
?

Another thing that raises alarms is the nohup.out dir's! Are these
intentional? Interestingly, missing segment data is not the same with these
dir's.

Does your log output indicate any discrepancies between various command
transitions?



bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch
>> solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/ crawl/crawldb
>> crawl/linkdb crawl/segments/*
>> SolrIndexer: starting at 2011-10-09 00:13:24
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_fetch
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/crawl_parse
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_data
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922143907/parse_text
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_fetch
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/crawl_parse
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_data
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110922144329/parse_text
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/crawl_parse
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_data
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111008015309/parse_text
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_fetch
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/crawl_parse
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_data
>> Input path does not exist:
>> file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/nohup.out/parse_text
>
>
>
> -----------------------------------------------------
> Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> monthly updates
>
>
>
> On Sat, Oct 8, 2011 at 14:22, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi guys,
>>
>> I have been watching this thread intently and I am very happy to see that
>> there is some progress :0)
>>
>> Radim,
>>
>> Can I ask that you open a JIRA issue and submit a patch, this way we can
>> not
>> only track it, but it will also give the community a chance to test and
>> validate the patch prior to integration into the source.
>>
>> Thanks
>>
>> Lewis
>>
>> On Fri, Oct 7, 2011 at 5:49 PM, Ramanathapuram, Rajesh <
>> Rajesh.Ramanathapuram@turner.com> wrote:
>>
>> > Hi Radim,
>> >
>> >  Thank you so much for this. I am not familiar with commit process to
>> the
>> > core.
>> >  Is there someone who can help us get this committed and help resolve
>> this
>> > issue?
>> >
>> > Thanks for all your help.
>> >
>> > Rajesh Ramana
>> >
>> > -----Original Message-----
>> > From: Radim Kolar [mailto:hsn@sendmail.cz]
>> > Sent: Thursday, October 06, 2011 2:18 PM
>> > To: user@nutch.apache.org
>> > Subject: Re: Nutch not crawling URLs with spanish accented characters (
>> ñ)
>> >
>> > - The REGEX normalizer transforms the special characters, but fails to
>> > substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’
>> >  - The fetcher is having trouble interpreting the links with special
>> > character ‘ñ’.
>> >
>> > i can add this transformation to basic-url normalizer if somebody is
>> > willing to commit it.
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>


-- 
*Lewis*