You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by 장용석 <ne...@gmail.com> on 2009/01/05 09:45:37 UTC

about TopFieldDocs

Hi.. :)

I have a simple question..

I have two sample code.

1) TopDocCollector collector = new TopDocCollector(5 * hitsPerPage);
    QueryParser parser = new QueryParser(fieldName, analyzer);
    query = parser.parse("keyword");

    searcher.search(query, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
   Document doc = searcher.doc(hits[i].doc);

2)
Sort sort = new Sort(fieldName,true);
QueryParser parser = new QueryParser(fieldName, analyzer); query =
parser.parse("keyword");
TopFieldDocs tfd = searcher.search(query,null,50,sort);

hits = tfd.scoreDocs;
Document doc = searcher.doc(hits[i].doc);

In that case, what is the difference with between ScoreDoc[] hits =
collector.topDocs().scoreDocs
and ScoreDoc[] hits = tfd.scoreDocs?


and.. in case 2)
It did throw exception java.lang.OutOfMemoryError: Java heap space.
I did not set jvm option and my index size is about 1G.
and after search collector.getTotalHits() is 2585.

I thins 2585 is not many documents....

What do i do for fix this problem? just increase jvm heap memory size? or
Is there other way?

I need some advice..:)

Sorry for my bad English..

thanks.

Jang.

-- 
DEV용식
http://devyongsik.tistory.com

Re: about TopFieldDocs

Posted by 장용석 <ne...@gmail.com>.

Thanks for your help.

It's really helpful for me.

thanks very much. :-)

-Jang.

-- 
DEV용식
http://devyongsik.tistory.com

Re: about TopFieldDocs

Posted by Mark Miller <ma...@gmail.com>.

Erick Erickson wrote:
> The number of documents
> is irrelevant here, what is relevant is the number of
> distinct terms in your "fieldName" field.
>   
Depending on the size of your index, the number of docs will matter
though. You have to store the unique terms in a String[] array, but you
also store an int[] array the size of maxdoc that indexes into the
unique terms array. Depending on your index, this could be as much or
more of a cost than the unique terms.

It doesn't matter how many documents you get back though for a
particular search - its just how many docs are in the index.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: about TopFieldDocs

Posted by Erick Erickson <er...@gmail.com>.

Mostly, the difference is in the sorting. Your
example (1) scores by document relevance whereas
your example (2) sorts by whatever is in fieldName.

example (2), because it is sorting, will try to
cache all the distinct *terms* in your index for that field,
which is probably where  your out of memory is
coming from. The number of documents
is irrelevant here, what is relevant is the number of
distinct terms in your "fieldName" field.

I'd get a copy of Luke to look at your index and see if it's
what you expect.

And yes, increasing memory should help. How much memory
are you running with when you get the OOM error? Sometimes
the default memory allocation is very small and most of it's taken
up with the program leaving very little left over for caching.

Best
Erick

2009/1/5 장용석 <ne...@gmail.com>

> Hi.. :)
>
> I have a simple question..
>
> I have two sample code.
>
> 1) TopDocCollector collector = new TopDocCollector(5 * hitsPerPage);
>    QueryParser parser = new QueryParser(fieldName, analyzer);
>    query = parser.parse("keyword");
>
>    searcher.search(query, collector);
>    ScoreDoc[] hits = collector.topDocs().scoreDocs;
>   Document doc = searcher.doc(hits[i].doc);
>
> 2)
> Sort sort = new Sort(fieldName,true);
> QueryParser parser = new QueryParser(fieldName, analyzer); query =
> parser.parse("keyword");
> TopFieldDocs tfd = searcher.search(query,null,50,sort);
>
> hits = tfd.scoreDocs;
> Document doc = searcher.doc(hits[i].doc);
>
> In that case, what is the difference with between ScoreDoc[] hits =
> collector.topDocs().scoreDocs
> and ScoreDoc[] hits = tfd.scoreDocs?
>
>
> and.. in case 2)
> It did throw exception java.lang.OutOfMemoryError: Java heap space.
> I did not set jvm option and my index size is about 1G.
> and after search collector.getTotalHits() is 2585.
>
> I thins 2585 is not many documents....
>
> What do i do for fix this problem? just increase jvm heap memory size? or
> Is there other way?
>
> I need some advice..:)
>
> Sorry for my bad English..
>
> thanks.
>
> Jang.
>
> --
> DEV용식
> http://devyongsik.tistory.com
>

Calculated terms during a query

Posted by Joe MarkAnthony <mr...@comcast.net>.

Greetings,
      I would like to search for items based on 'calculated' terms.
Specifically, say I am using Lucene to search a collection of tasks, with
fields "start_date" and "end_date", among others.

The question to solve is:
"Find all tasks that took longer than 100 days".

So the easy answer is to create a third field "task_duration", and store
that by subtracting start_date from end_date during indexing.  OK, this
works fine (using NumberTools, and so forth).

However, this solution doesn't work well when you start adding more fields.
For example, in my scenario, there actually are four fields:
"planned_start_date",
"actual_start_date","planned_end_date","actual_end_date".

Now, to support any such reasonable calculated query ("How many tasks were
more than 10 days late" or "How many tasks started on time", etc) you now
have to store 5 'calculated' terms for each document.  This can get out of
hand.

So, is there a better way to do this in Lucene?
I know some people will say this is an example of where a relational
database is best, and perhaps such concepts do not fit within text
indexing...ok, understood.

But perhaps there is a better way - has any thought gone into this for
Lucene?

Thanks in advance,
J



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org