You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Smith <SS...@MainstreamData.com> on 2004/01/28 21:49:18 UTC

The case of the disappearing index files

We have started using lucene as the indexer for messages on our website.  We
are seeing a problem where some index files seem to disappear (we've seen
the segment file vanish as well as some others).

My first thought after looking though some archives is that maybe we are
getting the "too many open files" problem and this means that a file might
get deleted in preparation for being rewritten, but it can't be rewritten
because there are no file handles (this is on a Windows XP box).  Since the
indexer is pretty staight forward in that it opens an IndexWriter, adds new
messages received in the last minute and then closes the IndexWriter, I'm
pretty sure it's ok.  Besides, we didn't see this problem until we started
doing lots of searches.

I'm feeling less comfortable with the search code.  Here are a couple of
snippets.  The first was a transliteration of some code that I saw in a Doug
C. posting (it was in v1.2 form and I needed it in v1.3)

	private Searcher m_Searcher = null;
	private long m_LastModified;
	private void getSearcher()
		throws IOException
	{
		// has the index been modified since last we looked?
		long newModified =
IndexReader.getCurrentVersion(m_IndexDirectory);
		if (m_LastModified != newModified)
		{
			// Get a new searcher and orphan the old one w/o
closing
			m_Searcher = new IndexSearcher(m_IndexDirectory);
			m_LastModified = newModified;		}
	}

Here's a somewhat simplified version (I search more fields) of the search
code that calls it.

	public synchronized Hits SimpleSearch(String a_SearchString)
		throws IOException, ParseException
	{
		Query q = QueryParser.parse(a_SearchString, "Body",
m_Analyzer);
    	
		try
		{
			getSearcher();
		}
		catch (IOException e)
		{
			// if we can't generate searcher, then claim
			// nothing is there
			m_lggr.error(e.getMessage());
			return null;
		}

		Hits hits = m_Searcher.search(q);

		return hits;
	}

The caller then can walk through the hits list to get the messages.

Originally, I would close the searcher after I got the hits, but I found
that you couldn't access the documents in the Hits structure once the
IndexSearcher was closed (Looking at the source, it looks like the Hits list
doesn't actually have the documents in it, but simply has references to them
which it uses the Searcher object to get at).  So, I now never close the
Searcher (though I'll create a new one if the index has been modified since
the last time I looked).

One other thing, I know the web guy using this is creating a new object
everytime he does a search (which I will talk to him about since I think
that's the wrong thing based on what I've read).  Is that my only problem?
Do I really want to wait until garbage collection deletes the old Searchers
for the files it has opened to get closed?

Does anyone see anything wrong with the above code or anything I should do
to optimize it?  Suggestions anyone?

Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Google search algorithm

Posted by Dror Matalon <dr...@zapatec.com>.
This is not quite related to Lucene but I found a web page that
has quite a few links about this subject:

http://www.google.com/search?q=google+page+rank&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

:-).


On Wed, Jan 28, 2004 at 11:10:28PM -0800, Ardor Wei wrote:
> We all know Lucene algorithm (thanks to open source:).
> Anybody has a general idea of how Google search
> algorithm works? How is the page ranking (I don't mean
> the paid ones) determined by Google? I have strong
> interest to know this. Any idea or feedback will be
> appreciated. Thanks!
> 
> Ardor
> 
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free web site building tool. Try it!
> http://webhosting.yahoo.com/ps/sb/
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Google search algorithm

Posted by Magnus Johansson <ma...@technohuman.com>.
I read somewhere that it used a hidden markov model.

It checks each page and gives each link a click probability.
It also gives a probability that the user will enter a new
address instead of clicking a link.

We then, by using a hidden markov model, calculate the
probability that the user will be at a particular page
after an infinite time using random browsing according
to the probabilies found.

This probability is then used as a basis for ranking
results.

Magnus Johansson


> We all know Lucene algorithm (thanks to open source:).
> Anybody has a general idea of how Google search
> algorithm works? How is the page ranking (I don't mean
> the paid ones) determined by Google? I have strong
> interest to know this. Any idea or feedback will be
> appreciated. Thanks!
>
> Ardor
>
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free web site building tool. Try it!
> http://webhosting.yahoo.com/ps/sb/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Google search algorithm

Posted by Ardor Wei <ar...@yahoo.com>.
We all know Lucene algorithm (thanks to open source:).
Anybody has a general idea of how Google search
algorithm works? How is the page ranking (I don't mean
the paid ones) determined by Google? I have strong
interest to know this. Any idea or feedback will be
appreciated. Thanks!

Ardor

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: The case of the disappearing index files

Posted by Rociel Buico <bu...@yahoo.com>.
i also had an experience on this, what i did is i wrap my 
searcher into a singleton object, and check if it is being used 
by another thread, then i let the thread caller to put on wait state, 
until the other thread finish using the searcher.
 
maybe it can help
 
buics


Scott Smith <SS...@MainstreamData.com> wrote:
We have started using lucene as the indexer for messages on our website. We
are seeing a problem where some index files seem to disappear (we've seen
the segment file vanish as well as some others).

My first thought after looking though some archives is that maybe we are
getting the "too many open files" problem and this means that a file might
get deleted in preparation for being rewritten, but it can't be rewritten
because there are no file handles (this is on a Windows XP box). Since the
indexer is pretty staight forward in that it opens an IndexWriter, adds new
messages received in the last minute and then closes the IndexWriter, I'm
pretty sure it's ok. Besides, we didn't see this problem until we started
doing lots of searches.

I'm feeling less comfortable with the search code. Here are a couple of
snippets. The first was a transliteration of some code that I saw in a Doug
C. posting (it was in v1.2 form and I needed it in v1.3)

private Searcher m_Searcher = null;
private long m_LastModified;
private void getSearcher()
throws IOException
{
// has the index been modified since last we looked?
long newModified =
IndexReader.getCurrentVersion(m_IndexDirectory);
if (m_LastModified != newModified)
{
// Get a new searcher and orphan the old one w/o
closing
m_Searcher = new IndexSearcher(m_IndexDirectory);
m_LastModified = newModified; }
}

Here's a somewhat simplified version (I search more fields) of the search
code that calls it.

public synchronized Hits SimpleSearch(String a_SearchString)
throws IOException, ParseException
{
Query q = QueryParser.parse(a_SearchString, "Body",
m_Analyzer);

try
{
getSearcher();
}
catch (IOException e)
{
// if we can't generate searcher, then claim
// nothing is there
m_lggr.error(e.getMessage());
return null;
}

Hits hits = m_Searcher.search(q);

return hits;
}

The caller then can walk through the hits list to get the messages.

Originally, I would close the searcher after I got the hits, but I found
that you couldn't access the documents in the Hits structure once the
IndexSearcher was closed (Looking at the source, it looks like the Hits list
doesn't actually have the documents in it, but simply has references to them
which it uses the Searcher object to get at). So, I now never close the
Searcher (though I'll create a new one if the index has been modified since
the last time I looked).

One other thing, I know the web guy using this is creating a new object
everytime he does a search (which I will talk to him about since I think
that's the wrong thing based on what I've read). Is that my only problem?
Do I really want to wait until garbage collection deletes the old Searchers
for the files it has opened to get closed?

Does anyone see anything wrong with the above code or anything I should do
to optimize it? Suggestions anyone?

Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


"We shape clay into a pot but it is the emptyness inside that holds whatever we want." Lao Tzu

---------------------------------
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!