You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Hal Fulton <ru...@gmail.com> on 2007/10/01 18:36:28 UTC

Hits estimation?

I am trying to figure out how the "estimation" of the number
of hits works... I have poked around in Hits.java and that
neighborhood, with no insights...

Can anyone assist?

Thanks,
Hal Fulton

Re: Hits estimation?

Posted by Sagar Vibhute <vi...@gmail.com>.
Hi,

I believe you are looking for something entirely different. I was assuming
you only want to know what is the count of the terms in a performed crawl.
Anyways, here is the command I was talking of:

nutch org.apache.nutch.indexer.HighFreqTerms ./nutch_crawl/index

where 'nutch_crawl' was the directory where the crawl results were stored
when I performed the crawl. I am including a sample result as well (below).

----------------------------------------------------------------------------------------
nutch org.apache.nutch.indexer.HighFreqTerms ./nutch_crawl/index

content:into 126
content:some 128
content:our 128
content:nutch 128
content:changes 128
content:for-the 128
content:list 129
url:nutch 129
content:has 130
content:last 130
content:information 130
content:help 131
content:mailing 132
content:under 132
content:content 133
content:html 133
content:license 135
content:java 136
content:source 136
content:one 137
content:open 137
content:faq 138
content:how 138
content:which 139
content:home 140
content:4 140
content:on-the 141
content:http 143
content:projects 145
content:version 146
content:project 146
content:using 147
content:3 147
content:foundation 155
content:the-apache 155
content:is-a 156
content:also 156
content:web 159
content:other 161
content:copyright 161
content:have 161
content:text 161
content:we 162
content:new 163
content:like 164
content:lists 164
content:see 168
content:will 169
content:if 171
content:not 172
content:in-the 172
content:page 173
content:wiki 177
content:org 177
content:to-the 178
content:your 178
content:1 179
content:get 187
content:2 189
content:an 192
content:can 194
content:software 195
content:about 195
content:all 199
content:search 200
content:s 200
content:as 202
content:or 203
content:2007 206
content:site 206
content:it 209
content:use 213
content:at 214
content:be 217
content:apache 221
content:that 224
content:more 225
content:from 227
content:of-the 227
content:you 229
content:are 234
content:0 238
content:with 240
content:on 245
content:by 248
host:apache 251
url:apache 252
content:this 256
content:in 275
content:is 277
content:for 287
content:of 291
content:and 297
content:a 300
host:org 300
content:to 300
url:org 300
content:the 315
url:http 358
----------------------------------------------------------------------------------------

- Sagar

Re: Hits estimation?

Posted by Hal Fulton <ru...@gmail.com>.
I am not sure what I mean -- I was told there was an algorithm
somewhere in there for calculating a quick estimate.

Any insight is welcome.

Hal


On 10/1/07, Sagar Vibhute <vi...@gmail.com> wrote:
>
> Do you mean - finding the number of hits in a finished crawl? The
> frequency
> of terms crawled?
>
> - Sagar
>
> On 10/1/07, Hal Fulton <ru...@gmail.com> wrote:
> >
> > I am trying to figure out how the "estimation" of the number
> > of hits works... I have poked around in Hits.java and that
> > neighborhood, with no insights...
> >
> > Can anyone assist?
> >
> > Thanks,
> > Hal Fulton
> >
>

Re: Hits estimation?

Posted by Sagar Vibhute <vi...@gmail.com>.
Do you mean - finding the number of hits in a finished crawl? The frequency
of terms crawled?

- Sagar

On 10/1/07, Hal Fulton <ru...@gmail.com> wrote:
>
> I am trying to figure out how the "estimation" of the number
> of hits works... I have poked around in Hits.java and that
> neighborhood, with no insights...
>
> Can anyone assist?
>
> Thanks,
> Hal Fulton
>