You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nes Yarug <ne...@gmail.com> on 2007/02/01 12:48:48 UTC

Re: New to Nutch, a few questions

Okay, thanks for that. I have updated my configuration and I will now
re-index the site. I'll let you know how it goes.

Many thanks,
Nes

On 1/31/07, Renaud Richardet <re...@oslutions.com> wrote:
>
> As Zaheed pointed out, "You need to activate index-more and query-more
> plugin in nutch-site.xml"
>
> So, copy the entry "plugin.includes" from nutch-defaults.xml, add
> index-more and query-lang, and insert it in your nutch-site.xml. You
> should have something like this:
>
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
>
> HTH,
> Renaud
>
>
> Nes Yarug wrote:
> > Oops, my previous post should read "I have NOT explicitely activated
> > those
> > plugins"
> >
> > On 1/31/07, Nes Yarug <ne...@gmail.com> wrote:
> >>
> >> I have explicitely activated those plugins. Could you tell me how to do
> >> that with an example as I looked through conf/nutch-default.xml and
> >> couldn't find any references to it. I'm using 0.8.1 by the way. They
> are
> >> enabled in the build I guess as default.properties is listing them:
> >>
> >> #
> >> # Indexing Filter Plugins
> >> #
> >> plugins.index=\
> >>    org.apache.nutch.indexer.basic*:\
> >>    org.apache.nutch.indexer.more*
> >>
> >> #
> >> # Query Filter Plugins
> >> #
> >> plugins.query=\
> >>    org.apache.nutch.searcher.basic*:\
> >>    org.apache.nutch.searcher.more*:\
> >>    org.apache.nutch.searcher.site*:\
> >>    org.apache.nutch.searcher.url*
> >>
> >> Many thanks,
> >> Nes
> >>
> >> On 1/31/07, Zaheed Haque <za...@gmail.com> wrote:
> >> >
> >> > Unless you haven't yet.. You need to activate index-more and
> >> > query-more plugin in nutch-site.xml
> >> >
> >> > You can also check the "explan link"  from the search results page
> and
> >> > you will see "lang" is missing if you haven't activated the
> index-more
> >> > and query-more plugin..
> >> >
> >> > Cheers
> >> >
> >> > On 1/31/07, Nes Yarug <ne...@gmail.com> wrote:
> >> > > Thank you everyone for your replies.
> >> > >
> >> > > I have implemented the recrawl script from
> >> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still
> >> running
> >> > for
> >> > > over 12 hours so I guess that  would index much more pages.
> >> > >
> >> > > Leaves the question about language specific search. I have tried
> >> > adding the
> >> > > lang: clause to my search query by appending lang:en but that is
> not
> >> > > returning any results (as if lang:en would become part of the
> actual
> >> > query).
> >> > > The url then looks like this: search.jsp
> >> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
> >> > >
> >> > > Anyone has used a language specific search before, do I need to
> >> add a
> >> > new
> >> > > (hidden) input field on the search form to specifiy the language
> >> > instead of
> >> > > appending it to the query? That would be my preference anyway, as I
> >> > want the
> >> > > language specific search to be transparant to he user.
> >> > >
> >> > > Again, many thanks for any replies,
> >> > > Nes
> >> > >
> >> > > On 1/30/07, Renaud Richardet <re...@oslutions.com> wrote:
> >> > > >
> >> > > > Nes Yarug wrote:
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'm new to Nutch and I have a few questions that I hope to get
> >> > some
> >> > > > > answers
> >> > > > > on. Thanks in advance for any replies.
> >> > > > >
> >> > > > > I want to use Nutch to index a web site I'm maintaining. I've
> >> > followed
> >> > > > > the
> >> > > > > tutorial for intranet crawling and used a list of links (17420
> >> > links
> >> > > > > to 8710
> >> > > > > pages, each page has two unique links) from my site to crawl
> >> > initially.
> >> > > > Actually, you don't need to provide a full list of links to
> Nutch.
> >> > You
> >> > > > can let it discover links as it crawl your site, and constrain
> >> them
> >> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
> >> > > > > The
> >> > > > > command I used was:
> >> > > > >
> >> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> >> > > > >
> >> > > > > The crawl completed, but I'm sure that when I was testing the
> >> > search
> >> > > > > it has
> >> > > > > not indexed a lot of pages. What I understand from the
> following
> >> > > > > command it
> >> > > > > only indexed 1527 of 21378 pages:
> >> > > > >
> >> > > > > CrawlDb statistics start: crawl/crawldb
> >> > > > > Statistics for CrawlDb: crawl/crawldb
> >> > > > > TOTAL urls:     21378
> >> > > > > retry 0:        20878
> >> > > > > retry 1:        487
> >> > > > > retry 2:        10
> >> > > > > retry 3:        3
> >> > > > > min score:      0.014
> >> > > > > avg score:       84.405266
> >> > > > > max score:      37106.03
> >> > > > > status 1 (DB_unfetched):        19848
> >> > > > > status 2 (DB_fetched):  1527
> >> > > > > status 3 (DB_gone):     3
> >> > > > > CrawlDb statistics: done
> >> > > > >
> >> > > > >
> >> > > > > Now my questions:
> >> > > > >
> >> > > > > 1) Will Nutch automatically continue to index the rest of the
> >> URLs
> >> > even
> >> > > > > though te initial crawl finished (through some internal
> >> scheduler
> >> > of
> >> > > > some
> >> > > > > sorts)?
> >> > > > You will need to refetch, or better: increase the depth, until
> >> "all
> >> > your
> >> > > > pages" are fetched.
> >> > > > >
> >> > > > > 2) All of my site's pages at the moment are contained in two
> >> > languages
> >> > > > > (each
> >> > > > > page has exactly two languages, the lang attribute on the
> >> html tag
> >> > of
> >> > > > > each
> >> > > > > page contains the language identifier). When searching, is
> >> there a
> >> > way
> >> > > > to
> >> > > > > only return pages in a specific language? I know the Nutch UI
> is
> >> > > > > localised,
> >> > > > > but it will still return pages in english if my UI language is
> >> > German
> >> > > > for
> >> > > > > example. I want it to return German pages only (<html
> >> lang="de">)
> >> > when
> >> > > > > searching through the German UI. Is that possible?
> >> > > > try using "lang:" in your query, I'm not sure it's working,
> >> > though...
> >> > > > From the javadoc: "LanguageQueryFilter.java should handles
> "lang:"
> >> > > > query clauses, causing them to search the "lang" field indexed by
> >> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> >> > > >
> >> > > > HTH,
> >> > > > Renaud
> >> > > >
> >> > > >
> >> > > > --
> >> > > > renaud richardet                           +1 617 230 9112
> >> > > > renaud <at> oslutions.com         http://www.oslutions.com
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> >
> >>
> >>
> >
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>