You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by oh...@cox.net on 2009/08/04 17:40:29 UTC

Slightly Off-topic: How to decide whether or not to add a document?

Hi,

I have an app to initially create a Lucene index, and to populate it with documents. I'm now working on that app to insert new documents into that Lucene index.

In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.

[As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).

Ok, as I said, that works so far.

However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".

As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs(). IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.

In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.

I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?

FYI, this is on Redhat Linux (and on Windows in my test environment).

Thanks!

Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

I've been working on a indexing solution using Spring integration and
lucene. the example project uses jms to create work items (index add or
update) and then a service that polls for work to do. I should have this
complete soon and will be putting it on google code.  Not much of help right
now but hopefully when it's done then it should give you and idea and
probably you can tweak to your requirements.
Cheers
Amin

On Tue, Aug 4, 2009 at 6:20 PM, <oh...@cox.net> wrote:

> Hi Ian,
>
> Ok, thanks for the additional info.
>
> I've implemented  check for both file.lastModified and file.length(), and
> it seems to work in my dev environment (Windows), so I'll have to test on a
> "real" system.
>
> Thanks again,
> Jim
>
>
> ---- Ian Lea <ia...@gmail.com> wrote:
> > Jim
> >
> >
> > The sleep is simply
> >
> >           try { Thread.sleep(millis); }
> >           catch (InterruptedException ie) { }
> >
> > No threading issues that I'm aware of, despite the method living in
> > the Thread class.
> >
> > But you're right about it possibly impacting performance, if you've
> > got to sleep for a reasonable amount of time for each doc, if you've
> > got loads of docs.  You can improve it by getting a list of possible
> > files + size + lastmod + whatever, sleeping, then checking them all
> > again i.e. only sleep once for each pass rather than once per file.
> >
> > Yet another option is to forget about sleeping and check the lastmod
> > timestamp and only index the doc if was finished some time ago.
> >
> > And yet another ... make the producer write to /a/b/c and have a
> > standalone non-lucene job that reads /a/b/c doing whatever checks you
> > like, moving files to your input directory.
> >
> >
> > That's more than enough options from me.
> >
> >
> > --
> > Ian.
> >
> > On Tue, Aug 4, 2009 at 5:08 PM, <oh...@cox.net> wrote:
> > > Ian,
> > >
> > > One question about the 4th alternative:  I was wondering how you
> implemented the sleep() in Java, esp. in such a way as not to mess up any of
> the Lucene stuff (in case there's threading)?
> > >
> > > Right now, my indexer/inserter app doesn't explicitly do any threading
> stuff.
> > >
> > > Thanks,
> > > Jim
> > >
> > >
> > > ---- ohaya@cox.net wrote:
> > >> Hi Ian,
> > >>
> > >> Thanks for the quick response.
> > >>
> > >> I forgot to mention, but in our case, the "producers" is part of a
> commercial package, so we don't have a way to get them to change anything,
> so I think the 1st 3 suggestions are not feasible for us.
> > >>
> > >> I have considered something like the 4th suggestion (check file size,
> timeout, and check file size again).  I'm worried that it would impact the
> overall index insertion process, but that unless there's something better,
> that may be our best option :(...
> > >>
> > >> Thanks again,
> > >> Jim
> > >>
> > >>
> > >> ---- Ian Lea <ia...@gmail.com> wrote:
> > >> > A few suggestions:
> > >> >
> > >> > . Queue the docs once they are complete using something like JMS.
> > >> >
> > >> > . Get the document producers to write to e.g. xxx.tmp and rename to
> > >> > e.g. xxx.txt at the end
> > >> >
> > >> > . Get the document producers to write to a tmp folder and move to
> e.g.
> > >> > input/ when done
> > >> >
> > >> > . Find a file, store size, sleep for a while, check size and if
> changed, skip
> > >> >
> > >> > I've used all these at one time or another for assorted, mainly
> > >> > non-lucene, apps, and they are all workable.
> > >> >
> > >> >
> > >> > --
> > >> > Ian.
> > >> >
> > >> >
> > >> > On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
> > >> > > Hi,
> > >> > >
> > >> > > I have an app to initially create a Lucene index, and to populate
> it with documents.  I'm now working on that app to insert new documents into
> that Lucene index.
> > >> > >
> > >> > > In general, this new app, which is based loosely on the demo apps
> (e.g., IndexFiles.java), is working, i.e., I can run it with a "create"
> parameter, and it creates a good/valid index from the documents, and then I
> can run it with an "insert" parameter, and it inserts new documents into the
> index.
> > >> > >
> > >> > > [As I mentioned in an earlier thread, we only have a requirement
> to insert new documents into the index, no requirements for deleting
> documents or updating documents that have already been indexed).
> > >> > >
> > >> > > Ok, as I said, that works so far.
> > >> > >
> > >> > > However, in our case, the processes that are creating the
> documents that we are indexing are fairly long-lived, and write fairly large
> documents, and I'm worried that when an insert operation is run, some of the
> potential documents may still be being written to, and we wouldn't want the
> indexer to insert the document into the Lucene index until the document is
> "complete".
> > >> > >
> > >> > > As you know, the way that the demos such as IndexFiles work is
> that they call a method called IndexDocs().  IndexDocs() then recursively
> walks the directory tree, and calling the writer to add to the index.
> > >> > >
> > >> > > In this loop, IndexDocs() does a few checks (isDirectory(),
> canRead), and I think that it would "pick up" (find) some documents that are
> still "in progress" (being written to, and not closed) in our case.
> > >> > >
> > >> > > I was wondering if anyone here has a situation similar to this
> (having to index large documents that may be "in progress/being written
> to"), and how you handle this situation?
> > >> > >
> > >> > > FYI, this is on Redhat Linux (and on Windows in my test
> environment).
> > >> > >
> > >> > > Thanks!
> > >> > >
> > >> > > Jim
> > >> > >
> > >> > >
> > >> > >
> ---------------------------------------------------------------------
> > >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> > >
> > >> > >
> > >> >
> > >> >
> ---------------------------------------------------------------------
> > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by oh...@cox.net.

Hi Ian,

Ok, thanks for the additional info.

I've implemented  check for both file.lastModified and file.length(), and it seems to work in my dev environment (Windows), so I'll have to test on a "real" system.

Thanks again,
Jim


---- Ian Lea <ia...@gmail.com> wrote: 
> Jim
> 
> 
> The sleep is simply
> 
> 	    try { Thread.sleep(millis); }
> 	    catch (InterruptedException ie) { }
> 
> No threading issues that I'm aware of, despite the method living in
> the Thread class.
> 
> But you're right about it possibly impacting performance, if you've
> got to sleep for a reasonable amount of time for each doc, if you've
> got loads of docs.  You can improve it by getting a list of possible
> files + size + lastmod + whatever, sleeping, then checking them all
> again i.e. only sleep once for each pass rather than once per file.
> 
> Yet another option is to forget about sleeping and check the lastmod
> timestamp and only index the doc if was finished some time ago.
> 
> And yet another ... make the producer write to /a/b/c and have a
> standalone non-lucene job that reads /a/b/c doing whatever checks you
> like, moving files to your input directory.
> 
> 
> That's more than enough options from me.
> 
> 
> --
> Ian.
> 
> On Tue, Aug 4, 2009 at 5:08 PM, <oh...@cox.net> wrote:
> > Ian,
> >
> > One question about the 4th alternative:  I was wondering how you implemented the sleep() in Java, esp. in such a way as not to mess up any of the Lucene stuff (in case there's threading)?
> >
> > Right now, my indexer/inserter app doesn't explicitly do any threading stuff.
> >
> > Thanks,
> > Jim
> >
> >
> > ---- ohaya@cox.net wrote:
> >> Hi Ian,
> >>
> >> Thanks for the quick response.
> >>
> >> I forgot to mention, but in our case, the "producers" is part of a commercial package, so we don't have a way to get them to change anything, so I think the 1st 3 suggestions are not feasible for us.
> >>
> >> I have considered something like the 4th suggestion (check file size, timeout, and check file size again).  I'm worried that it would impact the overall index insertion process, but that unless there's something better, that may be our best option :(...
> >>
> >> Thanks again,
> >> Jim
> >>
> >>
> >> ---- Ian Lea <ia...@gmail.com> wrote:
> >> > A few suggestions:
> >> >
> >> > . Queue the docs once they are complete using something like JMS.
> >> >
> >> > . Get the document producers to write to e.g. xxx.tmp and rename to
> >> > e.g. xxx.txt at the end
> >> >
> >> > . Get the document producers to write to a tmp folder and move to e.g.
> >> > input/ when done
> >> >
> >> > . Find a file, store size, sleep for a while, check size and if changed, skip
> >> >
> >> > I've used all these at one time or another for assorted, mainly
> >> > non-lucene, apps, and they are all workable.
> >> >
> >> >
> >> > --
> >> > Ian.
> >> >
> >> >
> >> > On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
> >> > > Hi,
> >> > >
> >> > > I have an app to initially create a Lucene index, and to populate it with documents.  I'm now working on that app to insert new documents into that Lucene index.
> >> > >
> >> > > In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.
> >> > >
> >> > > [As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).
> >> > >
> >> > > Ok, as I said, that works so far.
> >> > >
> >> > > However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".
> >> > >
> >> > > As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs().  IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.
> >> > >
> >> > > In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.
> >> > >
> >> > > I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?
> >> > >
> >> > > FYI, this is on Redhat Linux (and on Windows in my test environment).
> >> > >
> >> > > Thanks!
> >> > >
> >> > > Jim
> >> > >
> >> > >
> >> > > ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> > >
> >> > >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by Ian Lea <ia...@gmail.com>.

Jim


The sleep is simply

	    try { Thread.sleep(millis); }
	    catch (InterruptedException ie) { }

No threading issues that I'm aware of, despite the method living in
the Thread class.

But you're right about it possibly impacting performance, if you've
got to sleep for a reasonable amount of time for each doc, if you've
got loads of docs.  You can improve it by getting a list of possible
files + size + lastmod + whatever, sleeping, then checking them all
again i.e. only sleep once for each pass rather than once per file.

Yet another option is to forget about sleeping and check the lastmod
timestamp and only index the doc if was finished some time ago.

And yet another ... make the producer write to /a/b/c and have a
standalone non-lucene job that reads /a/b/c doing whatever checks you
like, moving files to your input directory.


That's more than enough options from me.


--
Ian.

On Tue, Aug 4, 2009 at 5:08 PM, <oh...@cox.net> wrote:
> Ian,
>
> One question about the 4th alternative:  I was wondering how you implemented the sleep() in Java, esp. in such a way as not to mess up any of the Lucene stuff (in case there's threading)?
>
> Right now, my indexer/inserter app doesn't explicitly do any threading stuff.
>
> Thanks,
> Jim
>
>
> ---- ohaya@cox.net wrote:
>> Hi Ian,
>>
>> Thanks for the quick response.
>>
>> I forgot to mention, but in our case, the "producers" is part of a commercial package, so we don't have a way to get them to change anything, so I think the 1st 3 suggestions are not feasible for us.
>>
>> I have considered something like the 4th suggestion (check file size, timeout, and check file size again).  I'm worried that it would impact the overall index insertion process, but that unless there's something better, that may be our best option :(...
>>
>> Thanks again,
>> Jim
>>
>>
>> ---- Ian Lea <ia...@gmail.com> wrote:
>> > A few suggestions:
>> >
>> > . Queue the docs once they are complete using something like JMS.
>> >
>> > . Get the document producers to write to e.g. xxx.tmp and rename to
>> > e.g. xxx.txt at the end
>> >
>> > . Get the document producers to write to a tmp folder and move to e.g.
>> > input/ when done
>> >
>> > . Find a file, store size, sleep for a while, check size and if changed, skip
>> >
>> > I've used all these at one time or another for assorted, mainly
>> > non-lucene, apps, and they are all workable.
>> >
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
>> > > Hi,
>> > >
>> > > I have an app to initially create a Lucene index, and to populate it with documents.  I'm now working on that app to insert new documents into that Lucene index.
>> > >
>> > > In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.
>> > >
>> > > [As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).
>> > >
>> > > Ok, as I said, that works so far.
>> > >
>> > > However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".
>> > >
>> > > As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs().  IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.
>> > >
>> > > In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.
>> > >
>> > > I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?
>> > >
>> > > FYI, this is on Redhat Linux (and on Windows in my test environment).
>> > >
>> > > Thanks!
>> > >
>> > > Jim
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by oh...@cox.net.

Ian,

One question about the 4th alternative:  I was wondering how you implemented the sleep() in Java, esp. in such a way as not to mess up any of the Lucene stuff (in case there's threading)?

Right now, my indexer/inserter app doesn't explicitly do any threading stuff.

Thanks,
Jim


---- ohaya@cox.net wrote: 
> Hi Ian,
> 
> Thanks for the quick response.
> 
> I forgot to mention, but in our case, the "producers" is part of a commercial package, so we don't have a way to get them to change anything, so I think the 1st 3 suggestions are not feasible for us.
> 
> I have considered something like the 4th suggestion (check file size, timeout, and check file size again).  I'm worried that it would impact the overall index insertion process, but that unless there's something better, that may be our best option :(...
> 
> Thanks again,
> Jim
> 
> 
> ---- Ian Lea <ia...@gmail.com> wrote: 
> > A few suggestions:
> > 
> > . Queue the docs once they are complete using something like JMS.
> > 
> > . Get the document producers to write to e.g. xxx.tmp and rename to
> > e.g. xxx.txt at the end
> > 
> > . Get the document producers to write to a tmp folder and move to e.g.
> > input/ when done
> > 
> > . Find a file, store size, sleep for a while, check size and if changed, skip
> > 
> > I've used all these at one time or another for assorted, mainly
> > non-lucene, apps, and they are all workable.
> > 
> > 
> > --
> > Ian.
> > 
> > 
> > On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
> > > Hi,
> > >
> > > I have an app to initially create a Lucene index, and to populate it with documents.  I'm now working on that app to insert new documents into that Lucene index.
> > >
> > > In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.
> > >
> > > [As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).
> > >
> > > Ok, as I said, that works so far.
> > >
> > > However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".
> > >
> > > As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs().  IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.
> > >
> > > In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.
> > >
> > > I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?
> > >
> > > FYI, this is on Redhat Linux (and on Windows in my test environment).
> > >
> > > Thanks!
> > >
> > > Jim
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by oh...@cox.net.

Hi Ian,

Thanks for the quick response.

I forgot to mention, but in our case, the "producers" is part of a commercial package, so we don't have a way to get them to change anything, so I think the 1st 3 suggestions are not feasible for us.

I have considered something like the 4th suggestion (check file size, timeout, and check file size again).  I'm worried that it would impact the overall index insertion process, but that unless there's something better, that may be our best option :(...

Thanks again,
Jim


---- Ian Lea <ia...@gmail.com> wrote: 
> A few suggestions:
> 
> . Queue the docs once they are complete using something like JMS.
> 
> . Get the document producers to write to e.g. xxx.tmp and rename to
> e.g. xxx.txt at the end
> 
> . Get the document producers to write to a tmp folder and move to e.g.
> input/ when done
> 
> . Find a file, store size, sleep for a while, check size and if changed, skip
> 
> I've used all these at one time or another for assorted, mainly
> non-lucene, apps, and they are all workable.
> 
> 
> --
> Ian.
> 
> 
> On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
> > Hi,
> >
> > I have an app to initially create a Lucene index, and to populate it with documents.  I'm now working on that app to insert new documents into that Lucene index.
> >
> > In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.
> >
> > [As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).
> >
> > Ok, as I said, that works so far.
> >
> > However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".
> >
> > As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs().  IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.
> >
> > In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.
> >
> > I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?
> >
> > FYI, this is on Redhat Linux (and on Windows in my test environment).
> >
> > Thanks!
> >
> > Jim
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Slightly Off-topic: How to decide whether or not to add a document?

Posted by Ian Lea <ia...@gmail.com>.

A few suggestions:

. Queue the docs once they are complete using something like JMS.

. Get the document producers to write to e.g. xxx.tmp and rename to
e.g. xxx.txt at the end

. Get the document producers to write to a tmp folder and move to e.g.
input/ when done

. Find a file, store size, sleep for a while, check size and if changed, skip

I've used all these at one time or another for assorted, mainly
non-lucene, apps, and they are all workable.


--
Ian.


On Tue, Aug 4, 2009 at 4:40 PM, <oh...@cox.net> wrote:
> Hi,
>
> I have an app to initially create a Lucene index, and to populate it with documents.  I'm now working on that app to insert new documents into that Lucene index.
>
> In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a "create" parameter, and it creates a good/valid index from the documents, and then I can run it with an "insert" parameter, and it inserts new documents into the index.
>
> [As I mentioned in an earlier thread, we only have a requirement to insert new documents into the index, no requirements for deleting documents or updating documents that have already been indexed).
>
> Ok, as I said, that works so far.
>
> However, in our case, the processes that are creating the documents that we are indexing are fairly long-lived, and write fairly large documents, and I'm worried that when an insert operation is run, some of the potential documents may still be being written to, and we wouldn't want the indexer to insert the document into the Lucene index until the document is "complete".
>
> As you know, the way that the demos such as IndexFiles work is that they call a method called IndexDocs().  IndexDocs() then recursively walks the directory tree, and calling the writer to add to the index.
>
> In this loop, IndexDocs() does a few checks (isDirectory(), canRead), and I think that it would "pick up" (find) some documents that are still "in progress" (being written to, and not closed) in our case.
>
> I was wondering if anyone here has a situation similar to this (having to index large documents that may be "in progress/being written to"), and how you handle this situation?
>
> FYI, this is on Redhat Linux (and on Windows in my test environment).
>
> Thanks!
>
> Jim
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org