You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by tracy nicol <su...@shiftdirector.com> on 2013/08/22 15:40:57 UTC
Nutch & Solr empty but no error messages
Hi Newbie here, I'm not seeing any results in SOLR after what looks like a
successful crawl. The seed URL list is full, the regex is wide open as .+
and nothing. I'm stumped so put a log up on
Pastebin<http://pastebin.com/BUyFai0u> Can
you please tell me where I've gone wrong?
Thanks,
/G
Re: Nutch & Solr empty but no error messages
Posted by Ahmet Emre Aladağ <em...@agmlab.com>.
There was once MySQL support in Nutch 2.1 but I remember something like it was abandoned in Nutch 2.2. So you may try MySQL of dotcloud with 2.1.
----- Orijinal Mesaj -----
Kimden: "tracy nicol" <su...@shiftdirector.com>
Kime: user@nutch.apache.org
Gönderilenler: 23 Ağustos Cuma 2013 23:25:31
Konu: Re: Nutch & Solr empty but no error messages
I figured out hbase wasn't optional with Nutch 2.x and spent the day trying
to get that running.
I think i've hit a dead end because zookeeper & hence hbase & others have
particular /etc/hosts requirements that can't be met on the dotcloud PAAS.
I'm now looking into Nutch and HSQLDB, any success stories or pointers ?
Thanks
On 23 August 2013 01:00, Lewis John Mcgibbney <le...@gmail.com>wrote:
> Hi Tracy,
> Logs are always your friend.
> Take it step by step [0], look at your logs and read the web db after every
> step to see whats going on.
> hth
> Lewis
>
> [0]
>
> http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
>
>
> On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <support@shiftdirector.com
> >wrote:
>
> > Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> > changed the regex filter as suggested.
> > I've checked parseChecker and indexChecker, results below look OK. I
> don't
> > know where to look next?
> >
> > Thank you.
> >
> > ./nutch parsechecker -dumpText http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > signature: 0cf33ede0bc75e70043c5632f3a4f443
> > ---------
> > Url
> > ---------------
> >
> > http://www.ru.ac.za/
> > ---------
> > Metadata
> > ---------
> >
> > ---------
> > ParseText
> > ---------
> >
> > Rhodes University News Perspective Digital Publications Virtual Campus
> > Gallery Intranet >Temp xx°C • Wind x x,
> > <SNIP>
> > s University | P.O. Box 94, Grahamstown 6140, South Africa Tel: +27
> 46
> > 603 8111 | Fax: +27 46 603 7350 | Email: registrar@ru.ac.zaEmail:
> > communications@ru.ac.za | Terms & Conditions | PAIA |
> Powered
> > by TERMINALFOUR Edit this page
> >
> >
> >
> > $ ./nutch indexchecker http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > content : Rhodes University News Perspective Digital Publications Virtual
> > Campus Gallery Intranet >Temp xx°C
> > title : Rhodes University
> > host : www.ru.ac.za
> > tstamp : 2013-08-22T20:41:10.038Z
> > url : http://www.ru.ac.za/
> >
> >
> >
> > On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> > > None of the 2424 seed URL's have been injected, they were rejected by
> the
> > > filters
> > >
> > > InjectorJob: total number of urls rejected by filters: 0
> > > InjectorJob: total number of urls injected after normalization and
> > > filtering: 2424
> > >
> > > Also, the regex fulter .+ is incorrect and should report an error. Try
> +.
> > > instead.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:tracy nicol <su...@shiftdirector.com>
> > > > Sent: Thursday 22nd August 2013 15:41
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch & Solr empty but no error messages
> > > >
> > > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> > like
> > > a
> > > > successful crawl. The seed URL list is full, the regex is wide open
> as
> > .+
> > > > and nothing. I'm stumped so put a log up on
> > > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > > you please tell me where I've gone wrong?
> > > >
> > > >
> > > > Thanks,
> > > > /G
> > > >
> > >
> >
>
>
>
> --
> *Lewis*
>
Re: Nutch & Solr empty but no error messages
Posted by tracy nicol <su...@shiftdirector.com>.
I figured out hbase wasn't optional with Nutch 2.x and spent the day trying
to get that running.
I think i've hit a dead end because zookeeper & hence hbase & others have
particular /etc/hosts requirements that can't be met on the dotcloud PAAS.
I'm now looking into Nutch and HSQLDB, any success stories or pointers ?
Thanks
On 23 August 2013 01:00, Lewis John Mcgibbney <le...@gmail.com>wrote:
> Hi Tracy,
> Logs are always your friend.
> Take it step by step [0], look at your logs and read the web db after every
> step to see whats going on.
> hth
> Lewis
>
> [0]
>
> http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
>
>
> On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <support@shiftdirector.com
> >wrote:
>
> > Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> > changed the regex filter as suggested.
> > I've checked parseChecker and indexChecker, results below look OK. I
> don't
> > know where to look next?
> >
> > Thank you.
> >
> > ./nutch parsechecker -dumpText http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > signature: 0cf33ede0bc75e70043c5632f3a4f443
> > ---------
> > Url
> > ---------------
> >
> > http://www.ru.ac.za/
> > ---------
> > Metadata
> > ---------
> >
> > ---------
> > ParseText
> > ---------
> >
> > Rhodes University News Perspective Digital Publications Virtual Campus
> > Gallery Intranet >Temp xx°C • Wind x x,
> > <SNIP>
> > s University | P.O. Box 94, Grahamstown 6140, South Africa Tel: +27
> 46
> > 603 8111 | Fax: +27 46 603 7350 | Email: registrar@ru.ac.zaEmail:
> > communications@ru.ac.za | Terms & Conditions | PAIA |
> Powered
> > by TERMINALFOUR Edit this page
> >
> >
> >
> > $ ./nutch indexchecker http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > content : Rhodes University News Perspective Digital Publications Virtual
> > Campus Gallery Intranet >Temp xx°C
> > title : Rhodes University
> > host : www.ru.ac.za
> > tstamp : 2013-08-22T20:41:10.038Z
> > url : http://www.ru.ac.za/
> >
> >
> >
> > On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> > > None of the 2424 seed URL's have been injected, they were rejected by
> the
> > > filters
> > >
> > > InjectorJob: total number of urls rejected by filters: 0
> > > InjectorJob: total number of urls injected after normalization and
> > > filtering: 2424
> > >
> > > Also, the regex fulter .+ is incorrect and should report an error. Try
> +.
> > > instead.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:tracy nicol <su...@shiftdirector.com>
> > > > Sent: Thursday 22nd August 2013 15:41
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch & Solr empty but no error messages
> > > >
> > > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> > like
> > > a
> > > > successful crawl. The seed URL list is full, the regex is wide open
> as
> > .+
> > > > and nothing. I'm stumped so put a log up on
> > > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > > you please tell me where I've gone wrong?
> > > >
> > > >
> > > > Thanks,
> > > > /G
> > > >
> > >
> >
>
>
>
> --
> *Lewis*
>
Re: Nutch & Solr empty but no error messages
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Tracy,
Logs are always your friend.
Take it step by step [0], look at your logs and read the web db after every
step to see whats going on.
hth
Lewis
[0]
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <su...@shiftdirector.com>wrote:
> Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> changed the regex filter as suggested.
> I've checked parseChecker and indexChecker, results below look OK. I don't
> know where to look next?
>
> Thank you.
>
> ./nutch parsechecker -dumpText http://www.ru.ac.za/
> fetching: http://www.ru.ac.za/
> parsing: http://www.ru.ac.za/
> contentType: text/html
> signature: 0cf33ede0bc75e70043c5632f3a4f443
> ---------
> Url
> ---------------
>
> http://www.ru.ac.za/
> ---------
> Metadata
> ---------
>
> ---------
> ParseText
> ---------
>
> Rhodes University News Perspective Digital Publications Virtual Campus
> Gallery Intranet >Temp xx°C • Wind x x,
> <SNIP>
> s University | P.O. Box 94, Grahamstown 6140, South Africa Tel: +27 46
> 603 8111 | Fax: +27 46 603 7350 | Email: registrar@ru.ac.za Email:
> communications@ru.ac.za | Terms & Conditions | PAIA | Powered
> by TERMINALFOUR Edit this page
>
>
>
> $ ./nutch indexchecker http://www.ru.ac.za/
> fetching: http://www.ru.ac.za/
> parsing: http://www.ru.ac.za/
> contentType: text/html
> content : Rhodes University News Perspective Digital Publications Virtual
> Campus Gallery Intranet >Temp xx°C
> title : Rhodes University
> host : www.ru.ac.za
> tstamp : 2013-08-22T20:41:10.038Z
> url : http://www.ru.ac.za/
>
>
>
> On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io> wrote:
>
> > None of the 2424 seed URL's have been injected, they were rejected by the
> > filters
> >
> > InjectorJob: total number of urls rejected by filters: 0
> > InjectorJob: total number of urls injected after normalization and
> > filtering: 2424
> >
> > Also, the regex fulter .+ is incorrect and should report an error. Try +.
> > instead.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:tracy nicol <su...@shiftdirector.com>
> > > Sent: Thursday 22nd August 2013 15:41
> > > To: user@nutch.apache.org
> > > Subject: Nutch & Solr empty but no error messages
> > >
> > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> like
> > a
> > > successful crawl. The seed URL list is full, the regex is wide open as
> .+
> > > and nothing. I'm stumped so put a log up on
> > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > you please tell me where I've gone wrong?
> > >
> > >
> > > Thanks,
> > > /G
> > >
> >
>
--
*Lewis*
Re: Nutch & Solr empty but no error messages
Posted by tracy nicol <su...@shiftdirector.com>.
Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
changed the regex filter as suggested.
I've checked parseChecker and indexChecker, results below look OK. I don't
know where to look next?
Thank you.
./nutch parsechecker -dumpText http://www.ru.ac.za/
fetching: http://www.ru.ac.za/
parsing: http://www.ru.ac.za/
contentType: text/html
signature: 0cf33ede0bc75e70043c5632f3a4f443
---------
Url
---------------
http://www.ru.ac.za/
---------
Metadata
---------
---------
ParseText
---------
Rhodes University News Perspective Digital Publications Virtual Campus
Gallery Intranet >Temp xx°C • Wind x x,
<SNIP>
s University | P.O. Box 94, Grahamstown 6140, South Africa Tel: +27 46
603 8111 | Fax: +27 46 603 7350 | Email: registrar@ru.ac.za Email:
communications@ru.ac.za | Terms & Conditions | PAIA | Powered
by TERMINALFOUR Edit this page
$ ./nutch indexchecker http://www.ru.ac.za/
fetching: http://www.ru.ac.za/
parsing: http://www.ru.ac.za/
contentType: text/html
content : Rhodes University News Perspective Digital Publications Virtual
Campus Gallery Intranet >Temp xx°C
title : Rhodes University
host : www.ru.ac.za
tstamp : 2013-08-22T20:41:10.038Z
url : http://www.ru.ac.za/
On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io> wrote:
> None of the 2424 seed URL's have been injected, they were rejected by the
> filters
>
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 2424
>
> Also, the regex fulter .+ is incorrect and should report an error. Try +.
> instead.
>
> Cheers
>
>
> -----Original message-----
> > From:tracy nicol <su...@shiftdirector.com>
> > Sent: Thursday 22nd August 2013 15:41
> > To: user@nutch.apache.org
> > Subject: Nutch & Solr empty but no error messages
> >
> > Hi Newbie here, I'm not seeing any results in SOLR after what looks like
> a
> > successful crawl. The seed URL list is full, the regex is wide open as .+
> > and nothing. I'm stumped so put a log up on
> > Pastebin<http://pastebin.com/BUyFai0u> Can
> > you please tell me where I've gone wrong?
> >
> >
> > Thanks,
> > /G
> >
>
RE: Nutch & Solr empty but no error messages
Posted by Markus Jelsma <ma...@openindex.io>.
None of the 2424 seed URL's have been injected, they were rejected by the filters
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2424
Also, the regex fulter .+ is incorrect and should report an error. Try +. instead.
Cheers
-----Original message-----
> From:tracy nicol <su...@shiftdirector.com>
> Sent: Thursday 22nd August 2013 15:41
> To: user@nutch.apache.org
> Subject: Nutch & Solr empty but no error messages
>
> Hi Newbie here, I'm not seeing any results in SOLR after what looks like a
> successful crawl. The seed URL list is full, the regex is wide open as .+
> and nothing. I'm stumped so put a log up on
> Pastebin<http://pastebin.com/BUyFai0u> Can
> you please tell me where I've gone wrong?
>
>
> Thanks,
> /G
>