You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Subhrajyoti Moitra <su...@contata.co.in> on 2003/04/07 07:54:45 UTC
differential indexing
Hi,
I am trying to append one index to another. How should i do it?
Let me explain my problem, probably people can suggest some better way..
i have indexed a set of pdf documents. These documents are being retrieved from the DB. I have a unique docId associated with each document. This is one of the fields in the index entries. I am using pdfbox to parse the contents of the pdf document and convert it into text for indexing.
Scenario-I
Now some one adds a new document to the DB. What i am presently doing is that, i retrieve all the documents from the DB,
including the new one, and create a fresh index out of these set of documents. The problem here is time.
I have some 10,000 documents in my system, re-indexing every one of them again is taking a hell-of-a long time.
What i want is to "APPEND" the new document index-data to the existing index.
Scenario-II
When an existing document in the DB is changed i want to remove that document from the index (this is easy since i have the unique docId with me) and add the new modified index-data to the existing index, instead of again recreating the entire index.
To sum up how do i do differential indexing. (hope i am using the proper terminology)
Some one please suggest some solutions to this.
Thank you in advance.
Subhro.
Re: differential indexing
Posted by Michael Wechner <mi...@wyona.org>.
Subhrajyoti Moitra wrote:
> Hi,
> I am trying to append one index to another. How should i do it?
Otis Gospodnetic wrote an article at about "merging indices" (Second Page)
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
HTH
Michael
>
> Let me explain my problem, probably people can suggest some better way..
>
> i have indexed a set of pdf documents. These documents are being retrieved from the DB. I have a unique docId associated with each document. This is one of the fields in the index entries. I am using pdfbox to parse the contents of the pdf document and convert it into text for indexing.
>
> Scenario-I
> Now some one adds a new document to the DB. What i am presently doing is that, i retrieve all the documents from the DB,
> including the new one, and create a fresh index out of these set of documents. The problem here is time.
> I have some 10,000 documents in my system, re-indexing every one of them again is taking a hell-of-a long time.
> What i want is to "APPEND" the new document index-data to the existing index.
>
> Scenario-II
> When an existing document in the DB is changed i want to remove that document from the index (this is easy since i have the unique docId with me) and add the new modified index-data to the existing index, instead of again recreating the entire index.
>
> To sum up how do i do differential indexing. (hope i am using the proper terminology)
>
> Some one please suggest some solutions to this.
>
> Thank you in advance.
> Subhro.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
a simple highlight resolvent
Posted by ke...@hotmail.com.
hi everyone,
I am new one in lucene and got a lot help in last 2 weeks. Here I want to share a a simple highlight resolvent. Of course I have try de.iqcomputing.lucene. Here just a simple way base the demo. Here is the source. It's a little inept. And I don't don't why it can't compile if I use String.replaceAll(), So I write another method to handle it. Any suggestion are welcomed.
Kerr.
to use it in results.jsp. --------------------------------
hl = new org.apache.lucene.demo.HightLighter(query.toString("contents"));
...
summary = hl.getHighlightFile(path);
if (summary == null || summary.length() < 100 ){
summary = doc.get("summary");
} catch (Exception e){
summary = doc.get("summary");
}
...
---------------------------------------
package org.apache.lucene.demo;
import java.io.*;
import java.util.*;
import org.apache.lucene.demo.html.HTMLParser;
public class HightLighter {
String[] keys;
boolean[] b;
public HightLighter(String query) {
HashMap highlight = new HashMap();
StringTokenizer st = new StringTokenizer(query, "\"");
String aToken;
int off = 0;
while(st.hasMoreTokens()){
aToken = st.nextToken();
char aChar;
String dest = "";
for(int i=0; i<aToken.length(); i++){
aChar = aToken.charAt(i);
if (aChar != '+' &&
aChar != '-' &&
aChar != '(' &&
aChar != ')' &&
aChar != ' '){
dest = dest + aChar;
}
}
if (dest.length() > 0) {
highlight.put(Integer.toString(off) , dest);
off = off + 1;
}
}
keys = new String[highlight.size()];
b = new boolean[highlight.size()];
for(int c=0; c<highlight.size(); c++){
keys[c] = (String)highlight.get(Integer.toString(c));
b[c] = false;
}
}
public String replaceKey(String sourceString, String key) {
String destString = "";
int i = sourceString.indexOf(key);
while ( i != -1 ) {
destString = destString + sourceString.substring(0,i)
+ "<font color=\"red\">" + key + "</font>";
sourceString = sourceString.substring(i+key.length());
i = sourceString.indexOf(key);
}
destString = destString+sourceString;
return destString;
}
public String getHighlight(String string){
for(int i=0; i<keys.length; i++){
if (string.indexOf(keys[i]) != -1){
//string = string.replaceAll(keys[i], "<font color=\"red\">" + keys[i] + "</font>");
string = replaceKey(string, keys[i]);
}
}
return string;
}
public String getHighlightFile(String path)
throws IOException, InterruptedException {
HTMLParser parser = new HTMLParser(new File(path));
Reader fr = parser.getReader();
String hlString = this.getHighlight(fr);
fr.close();
return hlString;
}
public String getHighlight(Reader reader){
try{
int size = 5 - keys.length;
if ( size > 0) {
size = size * 100;
} else {
size = 100;
}
char[] buffer = new char[size];
StringBuffer last = new StringBuffer("..");
String temp;
boolean end = false;
for(int c=0; c<b.length; c++){
b[c] = false;
}
while ((reader.read(buffer) != -1) && !end){
temp = new String(buffer);
//System.out.println(temp);
//temp = temp.replaceAll("\r\n", " ");
for(int i=0; i<keys.length; i++){
if (!b[i] && temp.indexOf(keys[i]) != -1){
b[i] = true;
//temp = temp.replaceAll(keys[i], "<font color=\"red\">" + keys[i] + "</font>");
last.append("." + temp + "..");
}
}
for(int i=0; i<keys.length; i++){
if (!b[i]){
end = false;
break;
} else {
end = true;
}
}
}
last.append(".");
if (last.length() < 100){
return null;
} else {
return(this.getHighlight(last.toString()));
}
} catch (Exception e){
return null;
}
}
}
Re: differential indexing
Posted by Subhrajyoti Moitra <su...@contata.co.in>.
thanks a lot Otis and Michael..
Otis i am doing what you had suggested below .. it seemes to solve my
problem..
i compare the dates at the application level rather than at the DB level..
since some pre-processing is being done.
now its working great.. thanks.. a lot..
subhro.
----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, April 07, 2003 10:16 PM
Subject: Re: differential indexing
> Why not just have 'last_indexed_date' that stores the last time you
> fetched from DB and indexed the PDF docs.
> Then when you want to re-index only those documents that have changed
> you use this last_indexed_date in your SELECT: SELECT * FROM your_table
> WHERE change_date > last_index_date.
> For each docId you get back you then know you have to do a delete & add
> in the Lucene index.
>
> Otis
>
>
>
>
>
> --- Subhrajyoti Moitra <su...@contata.co.in> wrote:
> > Hi,
> > I am trying to append one index to another. How should i do it?
> >
> > Let me explain my problem, probably people can suggest some better
> > way..
> >
> > i have indexed a set of pdf documents. These documents are being
> > retrieved from the DB. I have a unique docId associated with each
> > document. This is one of the fields in the index entries. I am using
> > pdfbox to parse the contents of the pdf document and convert it into
> > text for indexing.
> >
> > Scenario-I
> > Now some one adds a new document to the DB. What i am presently doing
> > is that, i retrieve all the documents from the DB,
> > including the new one, and create a fresh index out of these set of
> > documents. The problem here is time.
> > I have some 10,000 documents in my system, re-indexing every one of
> > them again is taking a hell-of-a long time.
> > What i want is to "APPEND" the new document index-data to the
> > existing index.
> >
> > Scenario-II
> > When an existing document in the DB is changed i want to remove that
> > document from the index (this is easy since i have the unique docId
> > with me) and add the new modified index-data to the existing index,
> > instead of again recreating the entire index.
> >
> > To sum up how do i do differential indexing. (hope i am using the
> > proper terminology)
> >
> > Some one please suggest some solutions to this.
> >
> > Thank you in advance.
> > Subhro.
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - File online, calculators, forms, and more
> http://tax.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: differential indexing
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Why not just have 'last_indexed_date' that stores the last time you
fetched from DB and indexed the PDF docs.
Then when you want to re-index only those documents that have changed
you use this last_indexed_date in your SELECT: SELECT * FROM your_table
WHERE change_date > last_index_date.
For each docId you get back you then know you have to do a delete & add
in the Lucene index.
Otis
--- Subhrajyoti Moitra <su...@contata.co.in> wrote:
> Hi,
> I am trying to append one index to another. How should i do it?
>
> Let me explain my problem, probably people can suggest some better
> way..
>
> i have indexed a set of pdf documents. These documents are being
> retrieved from the DB. I have a unique docId associated with each
> document. This is one of the fields in the index entries. I am using
> pdfbox to parse the contents of the pdf document and convert it into
> text for indexing.
>
> Scenario-I
> Now some one adds a new document to the DB. What i am presently doing
> is that, i retrieve all the documents from the DB,
> including the new one, and create a fresh index out of these set of
> documents. The problem here is time.
> I have some 10,000 documents in my system, re-indexing every one of
> them again is taking a hell-of-a long time.
> What i want is to "APPEND" the new document index-data to the
> existing index.
>
> Scenario-II
> When an existing document in the DB is changed i want to remove that
> document from the index (this is easy since i have the unique docId
> with me) and add the new modified index-data to the existing index,
> instead of again recreating the entire index.
>
> To sum up how do i do differential indexing. (hope i am using the
> proper terminology)
>
> Some one please suggest some solutions to this.
>
> Thank you in advance.
> Subhro.
__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org