You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Subhrajyoti Moitra <su...@contata.co.in> on 2003/04/07 07:54:45 UTC

differential indexing

Hi,
I am trying to append one index to another. How should i do it?

Let me explain my problem, probably people can suggest some better way..

i have indexed a set of pdf documents. These documents are being retrieved from the DB. I have a unique docId associated with each document. This is one of the fields in the index entries. I am using pdfbox to parse the contents of the pdf document and convert it into text for indexing.

Scenario-I
Now some one adds a new document to the DB. What i am presently doing is that, i retrieve all the documents from the DB,
 including the new one, and create a fresh index out of these set of documents. The problem here is time.
I have some 10,000 documents in my system, re-indexing every one of them again is taking a hell-of-a long time.
What i want is to "APPEND" the new document index-data to the existing index.

Scenario-II
When an existing document in the DB is changed i want to remove that document from the index (this is easy since i have the unique docId with me) and add the new modified index-data to the existing index, instead of again recreating the entire index.

To sum up how do i do differential indexing. (hope i am using the proper terminology)

Some one please suggest some solutions to this.

Thank you in advance.
Subhro.

Re: differential indexing

Posted by Michael Wechner <mi...@wyona.org>.

Subhrajyoti Moitra wrote:
> Hi,
> I am trying to append one index to another. How should i do it?

Otis Gospodnetic wrote an article at about "merging indices" (Second Page)

http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html

HTH

Michael

> 
> Let me explain my problem, probably people can suggest some better way..
> 
> i have indexed a set of pdf documents. These documents are being retrieved from the DB. I have a unique docId associated with each document. This is one of the fields in the index entries. I am using pdfbox to parse the contents of the pdf document and convert it into text for indexing.
> 
> Scenario-I
> Now some one adds a new document to the DB. What i am presently doing is that, i retrieve all the documents from the DB,
>  including the new one, and create a fresh index out of these set of documents. The problem here is time.
> I have some 10,000 documents in my system, re-indexing every one of them again is taking a hell-of-a long time.
> What i want is to "APPEND" the new document index-data to the existing index.
> 
> Scenario-II
> When an existing document in the DB is changed i want to remove that document from the index (this is easy since i have the unique docId with me) and add the new modified index-data to the existing index, instead of again recreating the entire index.
> 
> To sum up how do i do differential indexing. (hope i am using the proper terminology)
> 
> Some one please suggest some solutions to this.
> 
> Thank you in advance.
> Subhro.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

a simple highlight resolvent

Posted by ke...@hotmail.com.

hi everyone,
I am new one in lucene and got a lot help in last 2 weeks. Here I want to share a a simple highlight resolvent. Of course I have try de.iqcomputing.lucene. Here just a simple way base the demo. Here is the source. It's a little inept. And I don't don't why it can't compile if I use String.replaceAll(), So I write another method to handle it. Any suggestion are welcomed.
          Kerr.

to use it in results.jsp. --------------------------------
hl = new org.apache.lucene.demo.HightLighter(query.toString("contents"));
...
      summary = hl.getHighlightFile(path);
      if (summary == null || summary.length() < 100 ){
         summary = doc.get("summary");
 } catch (Exception e){
        summary = doc.get("summary");
}
...
---------------------------------------
package org.apache.lucene.demo;

import java.io.*;
import java.util.*;
import org.apache.lucene.demo.html.HTMLParser;

public class HightLighter {

  String[] keys;
  boolean[] b;

  public HightLighter(String query) {
    HashMap highlight = new HashMap();
    StringTokenizer st = new StringTokenizer(query, "\"");
    String aToken;
    int off = 0;
    while(st.hasMoreTokens()){
      aToken = st.nextToken();
      char aChar;
      String dest = "";
      for(int i=0; i<aToken.length(); i++){
        aChar = aToken.charAt(i);
        if (aChar != '+' &&
            aChar != '-' &&
            aChar != '(' &&
            aChar != ')' &&
            aChar != ' '){
          dest = dest + aChar;
        }
      }
      if (dest.length() > 0) {
        highlight.put(Integer.toString(off) , dest);
        off = off + 1;
      }
    }

    keys = new String[highlight.size()];
    b = new boolean[highlight.size()];
    for(int c=0; c<highlight.size(); c++){
      keys[c] = (String)highlight.get(Integer.toString(c));
      b[c] = false;
    }
  }

  public String replaceKey(String sourceString, String key) {
    String destString = "";
    int i = sourceString.indexOf(key);
    while ( i != -1 ) {
      destString = destString + sourceString.substring(0,i)
                 + "<font color=\"red\">" + key + "</font>";
      sourceString = sourceString.substring(i+key.length());
      i = sourceString.indexOf(key);
    }
    destString = destString+sourceString;
    return destString;
  }

  public String getHighlight(String string){
    for(int i=0; i<keys.length; i++){
      if (string.indexOf(keys[i]) != -1){
        //string = string.replaceAll(keys[i], "<font color=\"red\">" + keys[i] + "</font>");
        string = replaceKey(string, keys[i]);
      }
    }
    return string;
  }

  public String getHighlightFile(String path)
      throws IOException, InterruptedException  {

    HTMLParser parser = new HTMLParser(new File(path));
    Reader fr = parser.getReader();
    String hlString = this.getHighlight(fr);
    fr.close();
    return hlString;
  }

  public String getHighlight(Reader reader){

    try{
      int size = 5 - keys.length;
      if ( size > 0) {
        size = size * 100;
      } else {
        size = 100;
      }
      char[] buffer = new char[size];
      StringBuffer last = new StringBuffer("..");
      String temp;
      boolean end = false;
      for(int c=0; c<b.length; c++){
        b[c] = false;
      }

      while ((reader.read(buffer) != -1) && !end){
        temp = new String(buffer);
        //System.out.println(temp);
        //temp = temp.replaceAll("\r\n", " ");
        for(int i=0; i<keys.length; i++){
          if (!b[i] && temp.indexOf(keys[i]) != -1){
            b[i] = true;
            //temp = temp.replaceAll(keys[i], "<font color=\"red\">" + keys[i] + "</font>");
            last.append("." + temp + "..");
          }
        }
        for(int i=0; i<keys.length; i++){
          if (!b[i]){
            end = false;
            break;
          } else {
            end = true;
          }
        }
      }
      last.append(".");
      if (last.length() < 100){
        return null;
      } else {
        return(this.getHighlight(last.toString()));
      }
    } catch (Exception e){
      return null;
    }
  }
}

Re: differential indexing

Posted by Subhrajyoti Moitra <su...@contata.co.in>.

thanks a lot Otis and Michael..

Otis i am doing what you had suggested below .. it seemes to solve my
problem..
i compare the dates at the application level rather than at the DB level..
since some pre-processing is being done.
now its working great.. thanks.. a lot..
subhro.

----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, April 07, 2003 10:16 PM
Subject: Re: differential indexing


> Why not just have 'last_indexed_date' that stores the last time you
> fetched from DB and indexed the PDF docs.
> Then when you want to re-index only those documents that have changed
> you use this last_indexed_date in your SELECT: SELECT * FROM your_table
> WHERE change_date > last_index_date.
> For each docId you get back you then know you have to do a delete & add
> in the Lucene index.
>
> Otis
>
>
>
>
>
> --- Subhrajyoti Moitra <su...@contata.co.in> wrote:
> > Hi,
> > I am trying to append one index to another. How should i do it?
> >
> > Let me explain my problem, probably people can suggest some better
> > way..
> >
> > i have indexed a set of pdf documents. These documents are being
> > retrieved from the DB. I have a unique docId associated with each
> > document. This is one of the fields in the index entries. I am using
> > pdfbox to parse the contents of the pdf document and convert it into
> > text for indexing.
> >
> > Scenario-I
> > Now some one adds a new document to the DB. What i am presently doing
> > is that, i retrieve all the documents from the DB,
> >  including the new one, and create a fresh index out of these set of
> > documents. The problem here is time.
> > I have some 10,000 documents in my system, re-indexing every one of
> > them again is taking a hell-of-a long time.
> > What i want is to "APPEND" the new document index-data to the
> > existing index.
> >
> > Scenario-II
> > When an existing document in the DB is changed i want to remove that
> > document from the index (this is easy since i have the unique docId
> > with me) and add the new modified index-data to the existing index,
> > instead of again recreating the entire index.
> >
> > To sum up how do i do differential indexing. (hope i am using the
> > proper terminology)
> >
> > Some one please suggest some solutions to this.
> >
> > Thank you in advance.
> > Subhro.
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - File online, calculators, forms, and more
> http://tax.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: differential indexing

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Why not just have 'last_indexed_date' that stores the last time you
fetched from DB and indexed the PDF docs.
Then when you want to re-index only those documents that have changed
you use this last_indexed_date in your SELECT: SELECT * FROM your_table
WHERE change_date > last_index_date.
For each docId you get back you then know you have to do a delete & add
in the Lucene index.

Otis





--- Subhrajyoti Moitra <su...@contata.co.in> wrote:
> Hi,
> I am trying to append one index to another. How should i do it?
> 
> Let me explain my problem, probably people can suggest some better
> way..
> 
> i have indexed a set of pdf documents. These documents are being
> retrieved from the DB. I have a unique docId associated with each
> document. This is one of the fields in the index entries. I am using
> pdfbox to parse the contents of the pdf document and convert it into
> text for indexing.
> 
> Scenario-I
> Now some one adds a new document to the DB. What i am presently doing
> is that, i retrieve all the documents from the DB,
>  including the new one, and create a fresh index out of these set of
> documents. The problem here is time.
> I have some 10,000 documents in my system, re-indexing every one of
> them again is taking a hell-of-a long time.
> What i want is to "APPEND" the new document index-data to the
> existing index.
> 
> Scenario-II
> When an existing document in the DB is changed i want to remove that
> document from the index (this is easy since i have the unique docId
> with me) and add the new modified index-data to the existing index,
> instead of again recreating the entire index.
> 
> To sum up how do i do differential indexing. (hope i am using the
> proper terminology)
> 
> Some one please suggest some solutions to this.
> 
> Thank you in advance.
> Subhro.


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org