You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vijay Veeraraghavan <vi...@gmail.com> on 2010/05/03 09:21:31 UTC

Indexing only newly created files

Dear all,
I am using lucene 3.0 to index the pdf reports that I generate
dynamically. I index the pdf file name (without extension), file path
and its absolute path as fields. I search with the file name without
extension; it retrieves a list, as usually 2 or more files are present
in the same name in different sub directories. As I create the index
for the first time it updates, assuming 100 pdf files in different
directories, the files meta info. If again I do indexing, while my
report generator scheduler has the produced 500 more pdf files
totaling to 600 files in different directories, I wish to index only
the new files to the index. But presently it’s doing the whole thing
again (600 files). How to implement this functionality? Think of the
thousands of pdf files created on each run.

P.S: I cannot keep the meta-info of generated pdf files in the java
memory, as it exceeds thousands in a single run, and update the index
looping this list.

new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
					Version.LUCENE_CURRENT), true,
					IndexWriter.MaxFieldLength.LIMITED);

is the boolean parameter is for this purpose? Please guide me.

-- 
Thanks
Vijay Veeraraghavan



-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing only newly created files

Posted by Vijay Veeraraghavan <vi...@gmail.com>.
dear all,

as replied below, does searching again for the document in the index
and if found skip the indexing else index it, is this not similar to
indexing all pdf documents once again, is not this overhead? As I am
not going to index the details of the pdf (so if an indexed pdf was
recreated i need not reindex it) but just the paths of the documents.

Vijay

>> Hey there,
>>
>> you might have to implement a some kind of unique identifier using an
>> indexed lucene field. When you are indexing you should fire a query with
>> the
>> uuid of your document (maybe the path to you pdf document) and check if
>> the
>> document is in the index already. You could also do a boolean query
>> combining UUID, timestamp and / or a hash value to see if the document
>> has
>> been changed. if so you can simply update the document by its UUID
>> (something like indexwriter.updateDocument(new Term("uuid",
>> value),document);)
>>
>> Unfortunately you have to implement this yourself but it should not be
>> that
>> much of a deal.
>>
>> simon
>>
>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>> vijay.raghavan08@gmail.com> wrote:
>>
>>> Dear all,
>>> I am using lucene 3.0 to index the pdf reports that I generate
>>> dynamically. I index the pdf file name (without extension), file path
>>> and its absolute path as fields. I search with the file name without
>>> extension; it retrieves a list, as usually 2 or more files are present
>>> in the same name in different sub directories. As I create the index
>>> for the first time it updates, assuming 100 pdf files in different
>>> directories, the files meta info. If again I do indexing, while my
>>> report generator scheduler has the produced 500 more pdf files
>>> totaling to 600 files in different directories, I wish to index only
>>> the new files to the index. But presently it’s doing the whole thing
>>> again (600 files). How to implement this functionality? Think of the
>>> thousands of pdf files created on each run.
>>>
>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>> memory, as it exceeds thousands in a single run, and update the index
>>> looping this list.
>>>
>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>>                                        Version.LUCENE_CURRENT), true,
>>>
>>> IndexWriter.MaxFieldLength.LIMITED);
>>>
>>> is the boolean parameter is for this purpose? Please guide me.
>>>
>>> --
>>> Thanks
>>> Vijay Veeraraghavan
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Vijay Veeraraghavan
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing only newly created files

Posted by Vijay Veeraraghavan <vi...@gmail.com>.
dear,
Thanks for you reply Mr. simon, I found it very useful.
I have another doubt, I create the index in a clustered environment (2
physical systems and 2 virtual). A shared system among the nodes is
where this index will be created. The scheduler runs in another remote
system which will create and save the pdfs to a remote file server. Is
it advisable to create a local index and add it the main index shared
by the nodes? Or create a local index and copy it to the nodes?

Thanks
Vijay

On 5/3/10, Simon Willnauer <si...@googlemail.com> wrote:
> Hey there,
>
> you might have to implement a some kind of unique identifier using an
> indexed lucene field. When you are indexing you should fire a query with the
> uuid of your document (maybe the path to you pdf document) and check if the
> document is in the index already. You could also do a boolean query
> combining UUID, timestamp and / or a hash value to see if the document has
> been changed. if so you can simply update the document by its UUID
> (something like indexwriter.updateDocument(new Term("uuid",
> value),document);)
>
> Unfortunately you have to implement this yourself but it should not be that
> much of a deal.
>
> simon
>
> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
> vijay.raghavan08@gmail.com> wrote:
>
>> Dear all,
>> I am using lucene 3.0 to index the pdf reports that I generate
>> dynamically. I index the pdf file name (without extension), file path
>> and its absolute path as fields. I search with the file name without
>> extension; it retrieves a list, as usually 2 or more files are present
>> in the same name in different sub directories. As I create the index
>> for the first time it updates, assuming 100 pdf files in different
>> directories, the files meta info. If again I do indexing, while my
>> report generator scheduler has the produced 500 more pdf files
>> totaling to 600 files in different directories, I wish to index only
>> the new files to the index. But presently it’s doing the whole thing
>> again (600 files). How to implement this functionality? Think of the
>> thousands of pdf files created on each run.
>>
>> P.S: I cannot keep the meta-info of generated pdf files in the java
>> memory, as it exceeds thousands in a single run, and update the index
>> looping this list.
>>
>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>                                        Version.LUCENE_CURRENT), true,
>>
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>> is the boolean parameter is for this purpose? Please guide me.
>>
>> --
>> Thanks
>> Vijay Veeraraghavan
>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing only newly created files

Posted by Simon Willnauer <si...@googlemail.com>.
Hey there,

you might have to implement a some kind of unique identifier using an
indexed lucene field. When you are indexing you should fire a query with the
uuid of your document (maybe the path to you pdf document) and check if the
document is in the index already. You could also do a boolean query
combining UUID, timestamp and / or a hash value to see if the document has
been changed. if so you can simply update the document by its UUID
(something like indexwriter.updateDocument(new Term("uuid",
value),document);)

Unfortunately you have to implement this yourself but it should not be that
much of a deal.

simon

On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
vijay.raghavan08@gmail.com> wrote:

> Dear all,
> I am using lucene 3.0 to index the pdf reports that I generate
> dynamically. I index the pdf file name (without extension), file path
> and its absolute path as fields. I search with the file name without
> extension; it retrieves a list, as usually 2 or more files are present
> in the same name in different sub directories. As I create the index
> for the first time it updates, assuming 100 pdf files in different
> directories, the files meta info. If again I do indexing, while my
> report generator scheduler has the produced 500 more pdf files
> totaling to 600 files in different directories, I wish to index only
> the new files to the index. But presently it’s doing the whole thing
> again (600 files). How to implement this functionality? Think of the
> thousands of pdf files created on each run.
>
> P.S: I cannot keep the meta-info of generated pdf files in the java
> memory, as it exceeds thousands in a single run, and update the index
> looping this list.
>
> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>                                        Version.LUCENE_CURRENT), true,
>                                        IndexWriter.MaxFieldLength.LIMITED);
>
> is the boolean parameter is for this purpose? Please guide me.
>
> --
> Thanks
> Vijay Veeraraghavan
>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>