You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yang Sun <mr...@yahoo.com> on 2003/08/20 08:59:33 UTC

Question on Lucene when indexing big pdf files

Hi,
    I am a newbie on Lucene. Now I want to index all my harddisk contents for searching, these includes html file, pdf file, word file and etc. But I have encounter a problem when I try to index pdf files, I need your help.
    My environment is lucene-1.3-rc (lucene-1.2 has also been tried), jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no error when executing the indexing (I use StandardAnalyzer, you can refer to my sources in the attachment). But when I search using the keyword, I find a lot of useless results. The pdf haven't contain the content I want. Can you help me with this problem.
    In my attachment, I put my source files and the test pdf files. After I use my program to index these three pdf files, it seems all right then. But when I search the result using keyword "cisco" based on the indexing result, I get three Hits as the result. But two of the results do not contain the keyword "cisco", they are useless. I wonder if the pdfbox wrong, so I print out the indexed content, it also does not contain the keyword "cisco". I use Luke and my searcher program as the searching client, it seems no problem.
    Can anyone help me? Or any comments on this problem. Everyone is welcome. My email is suny@ffcs.fujitsu.co.jp.
    suny 

PS: sorry, I can not attach the files, this mailing list can not hold attachment? So I have to put my source codes here. My three test pdf files are totally 100k, if someone would like to help me test it, I will be very appreciate.
 
IndexPDF.java

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;
import java.util.Date;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:18:30
 * Functions:
 */
public class IndexPDF {
  File indexFiles;            //the file or directory we want to index
  public static void indexDocs(IndexWriter writer, File file) throws Exception {
    if (file.isDirectory()) {
      String[] files = file.list();
      for (int i = 0; i < files.length; i++) {
        indexDocs(writer, new File(file, files[i]));
      }
    } else if (file.getPath().endsWith(".pdf")) {
      System.out.println("adding " + file);
      writer.addDocument(PdfDocument.Document(file));
    } else {
      System.out.println("Ignoring " + file);
    }
  }
  public static void main(String args[]) throws Exception {
    if (args.length != 1) {
      System.out.println("Usage: IndexPDF <file/directory>");
      return;
    }
    try {
      Date start = new Date();
      IndexWriter writer = new IndexWriter("E:/Index", new StandardAnalyzer(), true);
      indexDocs(writer, new File(args[0]));
      writer.optimize();
      writer.close();
      Date end = new Date();
      System.out.println(end.getTime() - start.getTime());
      System.out.println(" total milliseconds");
    } catch (Exception e) {
      System.out.println(" caught a " + e.getClass() +
          "\n with message: " + e.getMessage());
    }
  }
}

PdfDocument.java

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
import java.io.OutputStreamWriter;
import net.vicp.resshare.weblucene.document.PDFDocument;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:20:54
 * Functions:
 */
public class PdfDocument {
  /* lucene object which represent a single data file */
  static Document doc = new Document();
  public static Document Document(File f ) {
    //set relative path to the path field in lucene
    doc.add(Field.UnIndexed("path", f.getPath()));
    System.out.println("Path is " + f.getPath());
    // use 10000 as the limit for temporary use
    doc.add(Field.Text("content", getPDFContent(f)));
    doc.add(Field.UnIndexed("filetype", "pdf"));
    doc.add(Field.UnIndexed("title", f.getName()));
    return doc;
  }
  /**
   * get the text content from the specified pdf file.
   * @param f the pdf we should extract the content
   * @return  the string contains the pdf content
   */
  private static String getPDFContent(File f) {
    byte[] contents = null;
    try {
      FileInputStream is = new FileInputStream(f);
      PDFParser parser = new PDFParser( is );
      parser.parse();
      PDDocument nbsp = parser.getPDDocument();
      ByteArrayOutputStream out = new ByteArrayOutputStream();
      OutputStreamWriter writer = new OutputStreamWriter( out );
      PDFTextStripper stripper = new PDFTextStripper();
      stripper.writeText(nbsp.getDocument(), writer);
      writer.close();
      contents = out.toByteArray();
    } catch (Exception e) {
      e.printStackTrace();
      return "";
    }
    String ts = new String(contents);
    System.out.println("the string length is"+contents.length+"\n");
    return ts;
  }
}



---------------------------------
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

Re: Question on Lucene when indexing big pdf files

Posted by Yang Sun <mr...@yahoo.com>.

Hi,
    When I use luke to look at my index, it seems all right. The content in the index is well, all the contents are extracted from the pdf file. I copy the pdf file content (namely "content" field), and search the keyword, but I can not found the keyword either. I think there is nothing wrong with the pdfbox program.

    Would you please help me to test this situation. I have three pdf file(totally 100k), after I index them,  I will get useless results when I ues "cisco" as the keyword. If you would like to help me, I will send you my test source files and the three pdf files to you. I will be very appreciate for your help.

Ben Litchfield <be...@csh.rit.edu> wrote:


> "cisco". I use Luke and my searcher program as the searching client,
> it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

Re: Question on Lucene when indexing big pdf files

Posted by Ben Litchfield <be...@csh.rit.edu>.


>     "cisco". I use Luke and my searcher program as the searching client,
>     it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

Re: Question on Lucene when indexing big pdf files

Posted by Ben Litchfield <be...@csh.rit.edu>.


>     "cisco". I use Luke and my searcher program as the searching client,
>     it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Question on Lucene when indexing big pdf files

Posted by Damien Lust <dl...@denali.be>.

I don't know if it can help you but here you are my code to extract 
code of pdf doc:

   /**
    * Extracts text from a pdf document
    *
    * @param in The InputStream representing the pdf file.
    * @return The text in the file
    */
   public String extractText(InputStream in)
   {
     String s = null;
     try
     {
       PDFTextStripper _stripper = new PDFTextStripper();
       PDFParser parser = new PDFParser(in);
       parser.parse();
       s = _stripper.getText(parser.getDocument());
     }
     catch (Throwable t)
     {
       t.printStackTrace();
     }
     return s;
   }






On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote:

> Hi,
>     I am a newbie on Lucene. Now I want to index all my harddisk 
> contents for searching, these includes html file, pdf file, word file 
> and etc. But I have encounter a problem when I try to index pdf files, 
> I need your help.
>     My environment is lucene-1.3-rc (lucene-1.2 has also been tried), 
> jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no 
> error when executing the indexing (I use StandardAnalyzer, you can 
> refer to my sources in the attachment). But when I search using the 
> keyword, I find a lot of useless results. The pdf haven't contain the 
> content I want. Can you help me with this problem.
>     In my attachment, I put my source files and the test pdf files. 
> After I use my program to index these three pdf files, it seems all 
> right then. But when I search the result using keyword "cisco" based 
> on the indexing result, I get three Hits as the result. But two of the 
> results do not contain the keyword "cisco", they are useless. I wonder 
> if the pdfbox wrong, so I print out the indexed content, it also does 
> not contain the keyword "cisco". I use Luke and my searcher program as 
> the searching client, it seems no problem.
>     Can anyone help me? Or any comments on this problem. Everyone is 
> welcome. My email is suny@ffcs.fujitsu.co.jp.
>     suny
>
> PS: sorry, I can not attach the files, this mailing list can not hold 
> attachment? So I have to put my source codes here. My three test pdf 
> files are totally 100k, if someone would like to help me test it, I 
> will be very appreciate.
>
> IndexPDF.java
>
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import java.io.File;
> import java.util.Date;
> /**
>  * FileName:
>  * User: Administrator
>  * Date: 2003-8-19
>  * Time: 23:18:30
>  * Functions:
>  */
> public class IndexPDF {
>   File indexFiles;            //the file or directory we want to index
>   public static void indexDocs(IndexWriter writer, File file) throws 
> Exception {
>     if (file.isDirectory()) {
>       String[] files = file.list();
>       for (int i = 0; i < files.length; i++) {
>         indexDocs(writer, new File(file, files[i]));
>       }
>     } else if (file.getPath().endsWith(".pdf")) {
>       System.out.println("adding " + file);
>       writer.addDocument(PdfDocument.Document(file));
>     } else {
>       System.out.println("Ignoring " + file);
>     }
>   }
>   public static void main(String args[]) throws Exception {
>     if (args.length != 1) {
>       System.out.println("Usage: IndexPDF <file/directory>");
>       return;
>     }
>     try {
>       Date start = new Date();
>       IndexWriter writer = new IndexWriter("E:/Index", new 
> StandardAnalyzer(), true);
>       indexDocs(writer, new File(args[0]));
>       writer.optimize();
>       writer.close();
>       Date end = new Date();
>       System.out.println(end.getTime() - start.getTime());
>       System.out.println(" total milliseconds");
>     } catch (Exception e) {
>       System.out.println(" caught a " + e.getClass() +
>           "\n with message: " + e.getMessage());
>     }
>   }
> }
>
> PdfDocument.java
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.util.PDFTextStripper;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.ByteArrayOutputStream;
> import java.io.OutputStreamWriter;
> import net.vicp.resshare.weblucene.document.PDFDocument;
> /**
>  * FileName:
>  * User: Administrator
>  * Date: 2003-8-19
>  * Time: 23:20:54
>  * Functions:
>  */
> public class PdfDocument {
>   /* lucene object which represent a single data file */
>   static Document doc = new Document();
>   public static Document Document(File f ) {
>     //set relative path to the path field in lucene
>     doc.add(Field.UnIndexed("path", f.getPath()));
>     System.out.println("Path is " + f.getPath());
>     // use 10000 as the limit for temporary use
>     doc.add(Field.Text("content", getPDFContent(f)));
>     doc.add(Field.UnIndexed("filetype", "pdf"));
>     doc.add(Field.UnIndexed("title", f.getName()));
>     return doc;
>   }
>   /**
>    * get the text content from the specified pdf file.
>    * @param f the pdf we should extract the content
>    * @return  the string contains the pdf content
>    */
>   private static String getPDFContent(File f) {
>     byte[] contents = null;
>     try {
>       FileInputStream is = new FileInputStream(f);
>       PDFParser parser = new PDFParser( is );
>       parser.parse();
>       PDDocument nbsp = parser.getPDDocument();
>       ByteArrayOutputStream out = new ByteArrayOutputStream();
>       OutputStreamWriter writer = new OutputStreamWriter( out );
>       PDFTextStripper stripper = new PDFTextStripper();
>       stripper.writeText(nbsp.getDocument(), writer);
>       writer.close();
>       contents = out.toByteArray();
>     } catch (Exception e) {
>       e.printStackTrace();
>       return "";
>     }
>     String ts = new String(contents);
>     System.out.println("the string length is"+contents.length+"\n");
>     return ts;
>   }
> }
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software