You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Iraida <to...@yandex.ru> on 2013/03/06 13:15:04 UTC

sorting frequencies

Hello I've indexed documents and considered the frequency of terms. At the
output file of the form:
d1.doc 0 14
d1.doc 1 7
d1.doc 2 4
d1.doc 3 1
d1.doc 4 4
d1.doc 5 2
d1.doc 7 1
d1.doc 8 5
d1.doc 9 3
d1.doc aj 1
d1.doc ax 1
d1.doc b 11
d1.doc big 1
d1.doc bjbjlulu 1
d1.doc books 3
d1.doc c 3
d1.doc can 1
d1.doc cj 1
d1.doc come 1
d1.doc country 1
d1.doc cx 1
d1.doc d 8
d1.doc different 2
d1.doc e 7
d1.doc every 1
d1.doc everywhere 1
d1.doc f 7
d1.doc find 1
d1.doc g 6
d1.doc gd 1
d1.doc h 12
d1.doc has 1
d1.doc have 1
d1.doc hx 1
d1.doc i 10
d1.doc j 5
d1.doc jd 1
d1.doc k 5
d1.doc l 11
d1.doc languages 1
d1.doc libraries 1
d1.doc library 2
d1.doc love 1
d1.doc m 12
d1.doc many 1
d1.doc mh 3
d1.doc microsoft 2
d1.doc millions 1
d1.doc msworddoc 1
d1.doc n 16
d1.doc newest 1
d1.doc normal.dot 1
d1.doc o 16
d1.doc office 2
d1.doc oh 1
d1.doc oldest 1
d1.doc our 1
d1.doc p 37
d1.doc p9 1
d1.doc pupils 1
d1.doc q 2
d1.doc r 11
d1.doc r4 1
d1.doc s 4
d1.doc school 1
d1.doc sh 3
d1.doc small 1
d1.doc subjects 1
d1.doc t 9
d1.doc take 1
d1.doc th 1
d1.doc u 5
d1.doc v 1
d1.doc w 3
d1.doc word 2
d1.doc word.document.8 1
d1.doc x 5
d1.doc y 4
d1.doc you 1
How to sort by frequency???
The problem is in the fact that there are words that are not in the
document(I do not know where else there are numbers and letters?)
The document was the following - d1.doc:
There are many big and small libraries everywhere in our country. They have
millions of books in different languages. You can find there the oldest and
the newest books.
Every school has a library. Pupils come to the library to take books on
different subjects.
Help me please.Sorry for my bad English/





--
View this message in context: http://lucene.472066.n3.nabble.com/sorting-frequencies-tp4045197.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: sorting frequencies

Posted by Iraida <to...@yandex.ru>.

I'm not sure what I'm doing it right.
Here is the code of the program
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopwordAnalyzerBase.*;
import org.apache.lucene.document.Field.*;
import org.apache.lucene.document.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.util.Version;
import java.io.*;
import java.io.FileOutputStream.*;

public class JavaApplication1 {

public static File dataDir = new File("C:/filestoindex");  
public static   File indexDir = new File("C:/fileindex");    


   public static void index(File indexDir,File dataDir) throws IOException
   {
              if (!dataDir.exists() || !dataDir.isDirectory()) 
       {
           throw new IOException(dataDir + " does not exist or is not a
directory");
       }

      
       Analyzer ac=new StandardAnalyzer(Version.LUCENE_30);
       IndexWriter indexWriter = new
IndexWriter(FSDirectory.open(indexDir),ac, true,
IndexWriter.MaxFieldLength.UNLIMITED);
       indexDirectory(indexWriter, dataDir);   
       indexWriter.close();
    }
  
   
    private static void indexDirectory(IndexWriter writer, File dir)      
            throws IOException {
        File[] files = dir.listFiles();
        for (int i = 0; i < files.length; i++)
        {            File f = files[i];
        if (f.isDirectory()) 
        {                indexDirectory(writer, f); 
        } 
              
        indexFile(writer, f);
        
        }
    }
    
    private static void indexFile(IndexWriter writer, File f) throws
IOException
    {  
       System.out.println("Индексация " + f.getName());
       Document doc = new Document();
       doc.add(new Field("contents" , new FileReader(f),
Field.TermVector.YES));
       doc.add(new Field("filename", f.getName(), Field.Store.YES,
Field.Index.NOT_ANALYZED));
       doc.add(new Field("path", f.getCanonicalPath(), Field.Store.YES,
Field.Index.NOT_ANALYZED));
  
        writer.addDocument(doc);
    }
    
   

    public static void main(String[] args) throws Exception 
    {
         
     index(indexDir, dataDir);
    
     IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
     FileOutputStream fr = new FileOutputStream("C:/fileout/f.txt");
     for (int docNum=0; docNum<reader.numDocs(); docNum++) {
        
        TermFreqVector tfv = reader.getTermFreqVector(docNum, "contents");
        if (tfv == null) {
        continue;
        }
        String terms[] = tfv.getTerms();
        int termCount = terms.length;
        int freqs[] = tfv.getTermFrequencies();

        for (int t=0; t < termCount; t++) {
        String
st=reader.document(docNum).getField("filename").stringValue()+" "+ terms[t]
+ " " +freqs[t]+ "\r\n";
       
        fr.write(st.getBytes("UTF-8") );
    
        System.out.println(
reader.document(docNum).getField("filename").stringValue()+" "+ terms[t] + "
" + freqs[t]);
        }
    }fr.close();
  
  

    
    }    
    
    
  }




--
View this message in context: http://lucene.472066.n3.nabble.com/sorting-frequencies-tp4045197p4045249.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: sorting frequencies

Posted by Lee <le...@gmail.com>.

Without knowing anything I see your output thinks d1.doc contains the 
word 'mircosoft' - are you sure you are parsing the document you think 
you are parsing?

The output looks like word stems and frequency to me, at a guess.

On 06/03/2013 13:15, Iraida wrote:
> Hello I've indexed documents and considered the frequency of terms. At the
> output file of the form:
> d1.doc 0 14
> d1.doc 1 7
> d1.doc 2 4
> d1.doc 3 1
> d1.doc 4 4
> d1.doc 5 2
> d1.doc 7 1
> d1.doc 8 5
> d1.doc 9 3
> d1.doc aj 1
> d1.doc ax 1
> d1.doc b 11
> d1.doc big 1
> d1.doc bjbjlulu 1
> d1.doc books 3
> d1.doc c 3
> d1.doc can 1
> d1.doc cj 1
> d1.doc come 1
> d1.doc country 1
> d1.doc cx 1
> d1.doc d 8
> d1.doc different 2
> d1.doc e 7
> d1.doc every 1
> d1.doc everywhere 1
> d1.doc f 7
> d1.doc find 1
> d1.doc g 6
> d1.doc gd 1
> d1.doc h 12
> d1.doc has 1
> d1.doc have 1
> d1.doc hx 1
> d1.doc i 10
> d1.doc j 5
> d1.doc jd 1
> d1.doc k 5
> d1.doc l 11
> d1.doc languages 1
> d1.doc libraries 1
> d1.doc library 2
> d1.doc love 1
> d1.doc m 12
> d1.doc many 1
> d1.doc mh 3
> d1.doc microsoft 2
> d1.doc millions 1
> d1.doc msworddoc 1
> d1.doc n 16
> d1.doc newest 1
> d1.doc normal.dot 1
> d1.doc o 16
> d1.doc office 2
> d1.doc oh 1
> d1.doc oldest 1
> d1.doc our 1
> d1.doc p 37
> d1.doc p9 1
> d1.doc pupils 1
> d1.doc q 2
> d1.doc r 11
> d1.doc r4 1
> d1.doc s 4
> d1.doc school 1
> d1.doc sh 3
> d1.doc small 1
> d1.doc subjects 1
> d1.doc t 9
> d1.doc take 1
> d1.doc th 1
> d1.doc u 5
> d1.doc v 1
> d1.doc w 3
> d1.doc word 2
> d1.doc word.document.8 1
> d1.doc x 5
> d1.doc y 4
> d1.doc you 1
> How to sort by frequency???
> The problem is in the fact that there are words that are not in the
> document(I do not know where else there are numbers and letters?)
> The document was the following - d1.doc:
> There are many big and small libraries everywhere in our country. They have
> millions of books in different languages. You can find there the oldest and
> the newest books.
> Every school has a library. Pupils come to the library to take books on
> different subjects.
> Help me please.Sorry for my bad English/
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/sorting-frequencies-tp4045197.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>