You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by tracy nicol <su...@shiftdirector.com> on 2013/08/22 15:40:57 UTC

Nutch & Solr empty but no error messages

Hi Newbie here, I'm not seeing any results in SOLR after what looks like a
successful crawl. The seed URL list is full, the regex is wide open as .+
and nothing. I'm stumped so put a log up on
Pastebin<http://pastebin.com/BUyFai0u> Can
you please tell me where  I've gone wrong?


Thanks,
/G

Re: Nutch & Solr empty but no error messages

Posted by Ahmet Emre Aladağ <em...@agmlab.com>.
There was once MySQL support in Nutch 2.1 but I remember something like it was abandoned in Nutch 2.2. So you may try MySQL of dotcloud with 2.1.




----- Orijinal Mesaj -----
Kimden: "tracy nicol" <su...@shiftdirector.com>
Kime: user@nutch.apache.org
Gönderilenler: 23 Ağustos Cuma 2013 23:25:31
Konu: Re: Nutch & Solr empty but no error messages

I figured out hbase wasn't optional with Nutch 2.x and spent the day trying
to get that running.

I think i've hit a dead end because  zookeeper & hence hbase & others have
particular /etc/hosts requirements that can't be met on the dotcloud PAAS.

I'm now looking into Nutch and HSQLDB, any success stories or pointers ?

Thanks


On 23 August 2013 01:00, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Tracy,
> Logs are always your friend.
> Take it step by step [0], look at your logs and read the web db after every
> step to see whats going on.
> hth
> Lewis
>
> [0]
>
> http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
>
>
> On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <support@shiftdirector.com
> >wrote:
>
> > Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> > changed the regex filter as suggested.
> > I've checked parseChecker and indexChecker, results below look OK. I
> don't
> > know where to look next?
> >
> > Thank you.
> >
> > ./nutch parsechecker -dumpText http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > signature: 0cf33ede0bc75e70043c5632f3a4f443
> > ---------
> > Url
> > ---------------
> >
> > http://www.ru.ac.za/
> > ---------
> > Metadata
> > ---------
> >
> > ---------
> > ParseText
> > ---------
> >
> > Rhodes University News Perspective Digital Publications Virtual Campus
> > Gallery Intranet   >Temp xx°C • Wind x x,
> > <SNIP>
> > s University   |   P.O. Box 94, Grahamstown 6140, South Africa Tel: +27
> 46
> > 603 8111   |   Fax: +27 46 603 7350   |   Email: registrar@ru.ac.zaEmail:
> > communications@ru.ac.za   |   Terms & Conditions   |   PAIA   |
> Powered
> > by  TERMINALFOUR Edit this page
> >
> >
> >
> > $ ./nutch indexchecker http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > content : Rhodes University News Perspective Digital Publications Virtual
> > Campus Gallery Intranet   >Temp xx°C
> > title : Rhodes University
> > host : www.ru.ac.za
> > tstamp : 2013-08-22T20:41:10.038Z
> > url : http://www.ru.ac.za/
> >
> >
> >
> > On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> > > None of the 2424 seed URL's have been injected, they were rejected by
> the
> > > filters
> > >
> > > InjectorJob: total number of urls rejected by filters: 0
> > > InjectorJob: total number of urls injected after normalization and
> > > filtering: 2424
> > >
> > > Also, the regex fulter .+ is incorrect and should report an error. Try
> +.
> > > instead.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:tracy nicol <su...@shiftdirector.com>
> > > > Sent: Thursday 22nd August 2013 15:41
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch &amp; Solr empty but no error messages
> > > >
> > > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> > like
> > > a
> > > > successful crawl. The seed URL list is full, the regex is wide open
> as
> > .+
> > > > and nothing. I'm stumped so put a log up on
> > > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > > you please tell me where  I've gone wrong?
> > > >
> > > >
> > > > Thanks,
> > > > /G
> > > >
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch & Solr empty but no error messages

Posted by tracy nicol <su...@shiftdirector.com>.
I figured out hbase wasn't optional with Nutch 2.x and spent the day trying
to get that running.

I think i've hit a dead end because  zookeeper & hence hbase & others have
particular /etc/hosts requirements that can't be met on the dotcloud PAAS.

I'm now looking into Nutch and HSQLDB, any success stories or pointers ?

Thanks


On 23 August 2013 01:00, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Tracy,
> Logs are always your friend.
> Take it step by step [0], look at your logs and read the web db after every
> step to see whats going on.
> hth
> Lewis
>
> [0]
>
> http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
>
>
> On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <support@shiftdirector.com
> >wrote:
>
> > Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> > changed the regex filter as suggested.
> > I've checked parseChecker and indexChecker, results below look OK. I
> don't
> > know where to look next?
> >
> > Thank you.
> >
> > ./nutch parsechecker -dumpText http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > signature: 0cf33ede0bc75e70043c5632f3a4f443
> > ---------
> > Url
> > ---------------
> >
> > http://www.ru.ac.za/
> > ---------
> > Metadata
> > ---------
> >
> > ---------
> > ParseText
> > ---------
> >
> > Rhodes University News Perspective Digital Publications Virtual Campus
> > Gallery Intranet   >Temp xx°C • Wind x x,
> > <SNIP>
> > s University   |   P.O. Box 94, Grahamstown 6140, South Africa Tel: +27
> 46
> > 603 8111   |   Fax: +27 46 603 7350   |   Email: registrar@ru.ac.zaEmail:
> > communications@ru.ac.za   |   Terms & Conditions   |   PAIA   |
> Powered
> > by  TERMINALFOUR Edit this page
> >
> >
> >
> > $ ./nutch indexchecker http://www.ru.ac.za/
> > fetching: http://www.ru.ac.za/
> > parsing: http://www.ru.ac.za/
> > contentType: text/html
> > content : Rhodes University News Perspective Digital Publications Virtual
> > Campus Gallery Intranet   >Temp xx°C
> > title : Rhodes University
> > host : www.ru.ac.za
> > tstamp : 2013-08-22T20:41:10.038Z
> > url : http://www.ru.ac.za/
> >
> >
> >
> > On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> > > None of the 2424 seed URL's have been injected, they were rejected by
> the
> > > filters
> > >
> > > InjectorJob: total number of urls rejected by filters: 0
> > > InjectorJob: total number of urls injected after normalization and
> > > filtering: 2424
> > >
> > > Also, the regex fulter .+ is incorrect and should report an error. Try
> +.
> > > instead.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:tracy nicol <su...@shiftdirector.com>
> > > > Sent: Thursday 22nd August 2013 15:41
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch &amp; Solr empty but no error messages
> > > >
> > > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> > like
> > > a
> > > > successful crawl. The seed URL list is full, the regex is wide open
> as
> > .+
> > > > and nothing. I'm stumped so put a log up on
> > > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > > you please tell me where  I've gone wrong?
> > > >
> > > >
> > > > Thanks,
> > > > /G
> > > >
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch & Solr empty but no error messages

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Tracy,
Logs are always your friend.
Take it step by step [0], look at your logs and read the web db after every
step to see whats going on.
hth
Lewis

[0]
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling


On Thu, Aug 22, 2013 at 1:44 PM, tracy nicol <su...@shiftdirector.com>wrote:

> Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
> changed the regex filter as suggested.
> I've checked parseChecker and indexChecker, results below look OK. I don't
> know where to look next?
>
> Thank you.
>
> ./nutch parsechecker -dumpText http://www.ru.ac.za/
> fetching: http://www.ru.ac.za/
> parsing: http://www.ru.ac.za/
> contentType: text/html
> signature: 0cf33ede0bc75e70043c5632f3a4f443
> ---------
> Url
> ---------------
>
> http://www.ru.ac.za/
> ---------
> Metadata
> ---------
>
> ---------
> ParseText
> ---------
>
> Rhodes University News Perspective Digital Publications Virtual Campus
> Gallery Intranet   >Temp xx°C • Wind x x,
> <SNIP>
> s University   |   P.O. Box 94, Grahamstown 6140, South Africa Tel: +27 46
> 603 8111   |   Fax: +27 46 603 7350   |   Email: registrar@ru.ac.za Email:
> communications@ru.ac.za   |   Terms & Conditions   |   PAIA   |   Powered
> by  TERMINALFOUR Edit this page
>
>
>
> $ ./nutch indexchecker http://www.ru.ac.za/
> fetching: http://www.ru.ac.za/
> parsing: http://www.ru.ac.za/
> contentType: text/html
> content : Rhodes University News Perspective Digital Publications Virtual
> Campus Gallery Intranet   >Temp xx°C
> title : Rhodes University
> host : www.ru.ac.za
> tstamp : 2013-08-22T20:41:10.038Z
> url : http://www.ru.ac.za/
>
>
>
> On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io> wrote:
>
> > None of the 2424 seed URL's have been injected, they were rejected by the
> > filters
> >
> > InjectorJob: total number of urls rejected by filters: 0
> > InjectorJob: total number of urls injected after normalization and
> > filtering: 2424
> >
> > Also, the regex fulter .+ is incorrect and should report an error. Try +.
> > instead.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:tracy nicol <su...@shiftdirector.com>
> > > Sent: Thursday 22nd August 2013 15:41
> > > To: user@nutch.apache.org
> > > Subject: Nutch &amp; Solr empty but no error messages
> > >
> > > Hi Newbie here, I'm not seeing any results in SOLR after what looks
> like
> > a
> > > successful crawl. The seed URL list is full, the regex is wide open as
> .+
> > > and nothing. I'm stumped so put a log up on
> > > Pastebin<http://pastebin.com/BUyFai0u> Can
> > > you please tell me where  I've gone wrong?
> > >
> > >
> > > Thanks,
> > > /G
> > >
> >
>



-- 
*Lewis*

Re: Nutch & Solr empty but no error messages

Posted by tracy nicol <su...@shiftdirector.com>.
Thanks but still no joy. I've reduced the URL list to 4 simple URLS and
changed the regex filter as suggested.
I've checked parseChecker and indexChecker, results below look OK. I don't
know where to look next?

Thank you.

./nutch parsechecker -dumpText http://www.ru.ac.za/
fetching: http://www.ru.ac.za/
parsing: http://www.ru.ac.za/
contentType: text/html
signature: 0cf33ede0bc75e70043c5632f3a4f443
---------
Url
---------------

http://www.ru.ac.za/
---------
Metadata
---------

---------
ParseText
---------

Rhodes University News Perspective Digital Publications Virtual Campus
Gallery Intranet   >Temp xx°C • Wind x x,
<SNIP>
s University   |   P.O. Box 94, Grahamstown 6140, South Africa Tel: +27 46
603 8111   |   Fax: +27 46 603 7350   |   Email: registrar@ru.ac.za Email:
communications@ru.ac.za   |   Terms & Conditions   |   PAIA   |   Powered
by  TERMINALFOUR Edit this page



$ ./nutch indexchecker http://www.ru.ac.za/
fetching: http://www.ru.ac.za/
parsing: http://www.ru.ac.za/
contentType: text/html
content : Rhodes University News Perspective Digital Publications Virtual
Campus Gallery Intranet   >Temp xx°C
title : Rhodes University
host : www.ru.ac.za
tstamp : 2013-08-22T20:41:10.038Z
url : http://www.ru.ac.za/



On 22 August 2013 15:47, Markus Jelsma <ma...@openindex.io> wrote:

> None of the 2424 seed URL's have been injected, they were rejected by the
> filters
>
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 2424
>
> Also, the regex fulter .+ is incorrect and should report an error. Try +.
> instead.
>
> Cheers
>
>
> -----Original message-----
> > From:tracy nicol <su...@shiftdirector.com>
> > Sent: Thursday 22nd August 2013 15:41
> > To: user@nutch.apache.org
> > Subject: Nutch &amp; Solr empty but no error messages
> >
> > Hi Newbie here, I'm not seeing any results in SOLR after what looks like
> a
> > successful crawl. The seed URL list is full, the regex is wide open as .+
> > and nothing. I'm stumped so put a log up on
> > Pastebin<http://pastebin.com/BUyFai0u> Can
> > you please tell me where  I've gone wrong?
> >
> >
> > Thanks,
> > /G
> >
>

RE: Nutch & Solr empty but no error messages

Posted by Markus Jelsma <ma...@openindex.io>.
None of the 2424 seed URL's have been injected, they were rejected by the filters

InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2424

Also, the regex fulter .+ is incorrect and should report an error. Try +. instead.

Cheers
 
 
-----Original message-----
> From:tracy nicol <su...@shiftdirector.com>
> Sent: Thursday 22nd August 2013 15:41
> To: user@nutch.apache.org
> Subject: Nutch &amp; Solr empty but no error messages
> 
> Hi Newbie here, I'm not seeing any results in SOLR after what looks like a
> successful crawl. The seed URL list is full, the regex is wide open as .+
> and nothing. I'm stumped so put a log up on
> Pastebin<http://pastebin.com/BUyFai0u> Can
> you please tell me where  I've gone wrong?
> 
> 
> Thanks,
> /G
>