You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sznajder ForMailingList <bs...@gmail.com> on 2013/07/01 00:00:28 UTC
Re: Crawl in Nutch2.2

Sorry

It did not help...

After 2 iterations, there is still only one url in the DB...

Benjamin


On Sun, Jun 30, 2013 at 9:20 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi* *Sznajder,
>
> Please see an example in the 1.x tutorial here (
> https://wiki.apache.org/nutch/NutchTutorial#Steps). It is in the 3rd step,
> on how to configure regex for crawling websites.
>
>
>
>
> On Sun, Jun 30, 2013 at 10:15 AM, Sznajder ForMailingList <
> bs4mailinglist@gmail.com> wrote:
>
> > Thanks for your help.
> >
> > I am copying here the content.
> >
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # contributor license agreements.  See the NOTICE file distributed with
> > # this work for additional information regarding copyright ownership.
> > # The ASF licenses this file to You under the Apache License, Version 2.0
> > # (the "License"); you may not use this file except in compliance with
> > # the License.  You may obtain a copy of the License at
> > #
> > #     http://www.apache.org/licenses/LICENSE-2.0
> > #
> > # Unless required by applicable law or agreed to in writing, software
> > # distributed under the License is distributed on an "AS IS" BASIS,
> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> > # See the License for the specific language governing permissions and
> > # limitations under the License.
> >
> >
> > # The default url filter.
> > # Better for whole-internet crawling.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file: ftp: and mailto: urls
> > -^(ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > # for a more extensive coverage use the urlfilter-suffix plugin
> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > # accept anything else
> > +.
> >
> >
> >
> > On Sun, Jun 30, 2013 at 6:38 PM, h b <hb...@gmail.com> wrote:
> >
> > > What does your conf/regex_urlfilters
> > > file contain?
> > > Did you change this file?
> > > On Jun 30, 2013 5:10 AM, "Sznajder ForMailingList" <
> > > bs4mailinglist@gmail.com>
> > > wrote:
> > >
> > > > Thanks a lot for your help
> > > >
> > > > however, I still did not resovle this issue...
> > > >
> > > >
> > > > I attach there the logs after 2 rounds of
> > > > "generate/fetch/parse/updatedb"
> > > >
> > > > the DB still contains only the seed url , not more...
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > >> Try each step with a crawlId and see if this provides you with
> better
> > > >> results.
> > > >>
> > > >> Unless you truncated all data between Nutch tasks then you should be
> > > >> seeing
> > > >> more data in HBase.
> > > >> As Tejas asked... what do the logs say?
> > > >>
> > > >>
> > > >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList <
> > > >> bs4mailinglist@gmail.com> wrote:
> > > >>
> > > >> > Hi Lewis,
> > > >> >
> > > >> > Thanks for your reply
> > > >> >
> > > >> > I just set the values:
> > > >> >
> > > >> >  gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
> > > >> >
> > > >> >
> > > >> > I already removed the Hbase table in the past. Can it be a cause?
> > > >> >
> > > >> > Benjamin
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney <
> > > >> > lewis.mcgibbney@gmail.com> wrote:
> > > >> >
> > > >> > > Have you changed from the default MemStore gora storage to
> > something
> > > >> > else?
> > > >> > >
> > > >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList <
> > > >> > > bs4mailinglist@gmail.com>
> > > >> > > wrote:
> > > >> > > > thanks Tejas
> > > >> > > >
> > > >> > > > Yes, I cheecked the logs and  no Error appears in them
> > > >> > > >
> > > >> > > > I let the http.content.limit and parser.html.impl with their
> > > default
> > > >> > > > value...
> > > >> > > >
> > > >> > > > Benajmin
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil <
> > > >> tejas.patil.cs@gmail.com
> > > >> > > >wrote:
> > > >> > > >
> > > >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any
> > > >> exception
> > > >> > or
> > > >> > > >> error messages ?
> > > >> > > >> Also you might have a look at these configs in nutch-site.xml
> > > >> (default
> > > >> > > >> values are in nutch-default.xml):
> > > >> > > >> http.content.limit and parser.html.impl
> > > >> > > >>
> > > >> > > >>
> > > >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList <
> > > >> > > >> bs4mailinglist@gmail.com> wrote:
> > > >> > > >>
> > > >> > > >> > Hello
> > > >> > > >> >
> > > >> > > >> > I installed Nutch 2.2 on my linux machine.
> > > >> > > >> >
> > > >> > > >> > I defined the seed directory with one file containing:
> > > >> > > >> > http://en.wikipedia.org/
> > > >> > > >> > http://edition.cnn.com/
> > > >> > > >> >
> > > >> > > >> >
> > > >> > > >> > I ran the following:
> > > >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/
> > > >> > > >> >
> > > >> > > >> > After this step:
> > > >> > > >> > the call
> > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> > > >> > > >> >
> > > >> > > >> > returns
> > > >> > > >> > TOTAL urls:     2
> > > >> > > >> > status 0 (null):        2
> > > >> > > >> > avg score:      1.0
> > > >> > > >> >
> > > >> > > >> >
> > > >> > > >> > Then, I ran the following:
> > > >> > > >> > bin/nutch generate -topN 10
> > > >> > > >> > bin/nutch fetch -all
> > > >> > > >> > bin/nutch parse -all
> > > >> > > >> > bin/nutch updatedb
> > > >> > > >> > bin/nutch generate -topN 1000
> > > >> > > >> > bin/nutch fetch -all
> > > >> > > >> > bin/nutch parse -all
> > > >> > > >> > bin/nutch updatedb
> > > >> > > >> >
> > > >> > > >> >
> > > >> > > >> > However, the stats call after these steps is still:
> > > >> > > >> > the call
> > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> > > >> > > >> > status 5 (status_redir_perm):   1
> > > >> > > >> > max score:      2.0
> > > >> > > >> > TOTAL urls:     3
> > > >> > > >> > avg score:      1.3333334
> > > >> > > >> >
> > > >> > > >> >
> > > >> > > >> >
> > > >> > > >> > Only 3 urls?!
> > > >> > > >> > What do I miss?
> > > >> > > >> >
> > > >> > > >> > thanks
> > > >> > > >> >
> > > >> > > >> > Benjamin
> > > >> > > >> >
> > > >> > > >>
> > > >> > > >
> > > >> > >
> > > >> > > --
> > > >> > > *Lewis*
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> *Lewis*
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>