You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by caomanhdat <ca...@gmail.com> on 2011/06/22 14:19:16 UTC

Get frequency of word

Hi all
I have a problem with get frequency of word in nutch :|
in Lucene it quite easy through this code :

Directory dir2 = FSDirectory.open(new File(indexDir));
    IndexReader ir = IndexReader.open(dir2); 
    TermDocs termDocs = ir.termDocs(new Term("contents", "eBank"));
    int count = 0;
    while (termDocs.next()) {
       count += termDocs.freq();
    }

But in nutch, the indexer quite weird so i can't do the same thing 

 Directory dir2 = FSDirectory.open(new File("D:\\nutch\\crawl\\indexes"));
    IndexReader ir = IndexReader.open(dir2); 
    TermDocs termDocs = ir.termDocs(new Term("contents", "eBank"));
    int count = 0;
    while (termDocs.next()) {
       count += termDocs.freq();
    }



--
View this message in context: http://lucene.472066.n3.nabble.com/Get-frequency-of-word-tp3095236p3095236.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Get frequency of word

Posted by caomanhdat <ca...@gmail.com>.
Thanks for your answer!
So how can i get the frequency of a word in all document which is indexed by
nutch.

--
View this message in context: http://lucene.472066.n3.nabble.com/Get-frequency-of-word-tp3095236p3099835.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Get frequency of word

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
Are you trying to use Nutch's indexer? AFAIK that's deprecated, isn't it?

On Wed, Jun 22, 2011 at 2:19 PM, caomanhdat <ca...@gmail.com> wrote:
> Hi all
> I have a problem with get frequency of word in nutch :|
> in Lucene it quite easy through this code :
>
> Directory dir2 = FSDirectory.open(new File(indexDir));
>    IndexReader ir = IndexReader.open(dir2);
>    TermDocs termDocs = ir.termDocs(new Term("contents", "eBank"));
>    int count = 0;
>    while (termDocs.next()) {
>       count += termDocs.freq();
>    }
>
> But in nutch, the indexer quite weird so i can't do the same thing
>
>  Directory dir2 = FSDirectory.open(new File("D:\\nutch\\crawl\\indexes"));
>    IndexReader ir = IndexReader.open(dir2);
>    TermDocs termDocs = ir.termDocs(new Term("contents", "eBank"));
>    int count = 0;
>    while (termDocs.next()) {
>       count += termDocs.freq();
>    }
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Get-frequency-of-word-tp3095236p3095236.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).