You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2009/10/02 16:39:31 UTC

Alternative to looping through Hits

I am currently attempting to create a comma separated list of IDs from a 
given Hits collection.

However, when we end up processing 6,000 or more hits, it takes 25-30 
seconds per collection.  I've been trying to find a faster way to change 
the search results to the comma separated list.  Do any of you have any 
advice?  Thanks in advance.

Trevor Watson


My current code looks like

Lucene.Net.Search.Searcher search = new 
Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
jobId.ToString()));
            Lucene.Net.Search.Hits hits = search.Search(query);

            string docIds = "";
            totalDocuments = hits.Length();

           
          // Test #1
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            while (hi.MoveNext())
                docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
+ ", ";

          // Test #2
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);

                docIds += docHit.GetField("DocumentId").StringValue() + 
", ";
            }

RE: Alternative to looping through Hits

Posted by Franklin Simmons <fs...@sccmediaserver.com>.
You could try using TopFieldDocCollector, TopDocs and an extended FieldSelector.  String.Join is fairly quick I think. This might be overkill though ;-)

...

Lucene.Net.Search.TopFieldDocCollector collector = new TopFieldDocCollector(reader, Sort.RELEVANCE, max_hits);

search.Search(query, null, collector);

Lucene.Net.Search.TopDocs top_docs = collector.TopDocs();
string [] values = new string[top_docs.scoreDocs.Length];
MyFieldSelector field_selector = new MyFieldSelector("DocumentId");

for(int i = 0; i < values.Length; i++) 
{
      Lucene.Net.Search.ScoreDoc score_document = top_docs.scoreDocs[i];
      Lucene.Net.Documents.Document document = searcher.Doc(score_document.doc, field_selector);
      values[i] = document.GetFieldable("DocumentId").StringValue();   
}

string csv = String.Join(" ,",values);


...
class MyFieldSelector : Lucene.Net.Documents.FieldSelector
{
      string field_name;

	public MyFieldSelector(string field_name)
	{
		this.field_name = field_name;
	}

	public Lucene.Net.Documents.FieldSelectorResult Accept(string field_name)
      {
          if(this.field_name == field_name) return Lucene.Net.Documents.FieldSelectorResult.LOAD;
          return Lucene.Net.Documents.FieldSelectorResult.NO_LOAD;
      }
}

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Friday, October 02, 2009 10:40 AM
To: lucene-net-user@incubator.apache.org
Subject: Alternative to looping through Hits

I am currently attempting to create a comma separated list of IDs from a 
given Hits collection.

However, when we end up processing 6,000 or more hits, it takes 25-30 
seconds per collection.  I've been trying to find a faster way to change 
the search results to the comma separated list.  Do any of you have any 
advice?  Thanks in advance.

Trevor Watson


My current code looks like

Lucene.Net.Search.Searcher search = new 
Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
jobId.ToString()));
            Lucene.Net.Search.Hits hits = search.Search(query);

            string docIds = "";
            totalDocuments = hits.Length();

           
          // Test #1
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            while (hi.MoveNext())
                docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
+ ", ";

          // Test #2
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);

                docIds += docHit.GetField("DocumentId").StringValue() + 
", ";
            }

RE: Alternative to looping through Hits

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
String concatenation is slow usually. Try with a string builder, or
better generate a List<>, then at the end of the loop use
Array.Join(List<>.ToArray()). I expect this will be faster.
GetDocument() can be slow, if this is a non-distributed system you will
be able to speed performance in a live system by caching recently
retrieved documents

On the Lucene side you could try using a custom HitCollector which
stores the List<> of IDs. This would mean you don't have to iterate hits
which would be marginally faster. However it will be marginal.

Presumably (I haven't tested this) the more Lucene fields you store with
your Document, the longer each GetDocument takes. Check that you are
only storing fields you need to store.

Yours,

Moray



--------------------------------------
Moray McConnachie
Director of IT
Oxford Analytica

+44 1865 261 600 http://www.oxan.com <http://www.oxan.com>  

> -----Original Message-----
> From: Trevor Watson [mailto:twatson@datassimilate.com
<ma...@datassimilate.com> ]
> Sent: 02 October 2009 15:40
> To: lucene-net-user@incubator.apache.org
> Subject: Alternative to looping through Hits
>
> I am currently attempting to create a comma separated list of
> IDs from a given Hits collection.
>
> However, when we end up processing 6,000 or more hits, it
> takes 25-30 seconds per collection.  I've been trying to find
> a faster way to change the search results to the comma
> separated list.  Do any of you have any advice?  Thanks in advance.
>
> Trevor Watson
>
>
> My current code looks like
>
> Lucene.Net.Search.Searcher search = new
> Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\"
 + jobId.ToString()));
>             Lucene.Net.Search.Hits hits = search.Search(query);
>
>             string docIds = "";
>             totalDocuments = hits.Length();
>
>           
>           // Test #1
>             Lucene.Net.Search.HitIterator hi =
> (Lucene.Net.Search.HitIterator)hits.Iterator();
>             while (hi.MoveNext())
>                 docIds +=
> ((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("Do
cumentId").StringValue()
> + ", ";
>
>           // Test #2
>             for (int iCount = 0; iCount < totalDocuments; iCount++)
>             {
>                 Lucene.Net.Documents.Document docHit =
> hits.Doc(iCount);
>
>                 docIds +=
> docHit.GetField("DocumentId").StringValue() + ", ";
>             }
>
> 


RE: indexing special characters

Posted by "Monteiro, Alvaro" <Al...@sage.pt>.
Thank you for your quick reply. Unfortunately I have other higher priority tasks at the moment. As soon as I can ill send a test case showing the differences.

Al

-----Original Message-----
From: Digy [mailto:digydigy@gmail.com] 
Sent: sexta-feira, 2 de Outubro de 2009 19:28
To: lucene-net-user@incubator.apache.org
Subject: RE: indexing special characters

I don't remember any backward compatibility related bug report. I used the
following code to test 2.0 & 2.3.2 and didn't see any difference.

 

            

RAMDirectory dir = new RAMDirectory(); 

 

IndexWriter wr = new IndexWriter(dir,new StandardAnalyzer(),true);

            Document doc = new Document();

            Field f = new Field("field1", "café algodão", Field.Store.YES,
Field.Index.TOKENIZED);

            doc.Add(f);

            wr.AddDocument(doc);

            wr.Close();

 

            IndexSearcher sr = new IndexSearcher(dir);

            QueryParser qp = new QueryParser("field1", new
StandardAnalyzer());

            Query q = qp.Parse("algodão");

            MessageBox.Show(sr.Search(q).Length().ToString());

            sr.Close();

 

 

 

Can you send a simple test case showing the difference between versions?

 

DIGY

 

 

 

-----Original Message-----
From: Monteiro, Alvaro [mailto:Alvaro.Monteiro@sage.pt] 
Sent: Friday, October 02, 2009 6:45 PM
To: lucene-net-user@incubator.apache.org
Subject: indexing special characters

 

I've started using the latest build for lucene.net (2.3). 

It is a lot faster than 2.0.

 

I've noticed something very strange, although the indexing process is the
same, when I use the latest dll and I search for 

a word with a special character (like "café", "algodão") no results are
given.

However, if I change the dll to 2.0, index the exact same thing, I search
for a word of this kind and I have results. No change in the code
whatsoever!!!

 

Does anyone have any idea?

 

Thank you so much.

 

Alvaro Monteiro


RE: indexing special characters

Posted by "Monteiro, Alvaro" <Al...@sage.pt>.
Problem solved.
Well, there's nothing like simple code :-) The thing is, I was using a customanalyzer with the following code:

class CustomAnalyzer: Lucene.Net.Analysis.Standard.StandardAnalyzer
    {
        public static readonly System.String[] PORTUGUESE_STOP_WORDS = new System.String[] { "a", "uma", "e", "são", "como", "onde", "ser", "mas", "por", "se", "em", "dentro", "é", "ela", "ele",  "não", "de", "ou", "s", "tais", "t", "que", "seus", "então", "lá", "esses", "eles", "isto", "para", "era", "irá", "com" };

        public CustomAnalyzer() : base(StandardAnalyzer.STOP_WORDS)
		{
            
		}
 
        public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
        {
            TokenStream result = new StandardTokenizer(reader);
            result = new StandardFilter(result);
            result = new LowerCaseFilter(result);
            result = new ISOLatin1AccentFilter(result);
            result = new StopFilter(result, PORTUGUESE_STOP_WORDS);
            return result;
        }
    }


This customanalyzer doesn't work well in lucene 2.3. If I only place the array of stop words in the constructor of standard analyzer it works quite well.

Thank you very much for your help Digy.

Al

-----Original Message-----
From: Digy [mailto:digydigy@gmail.com] 
Sent: sexta-feira, 2 de Outubro de 2009 19:28
To: lucene-net-user@incubator.apache.org
Subject: RE: indexing special characters

I don't remember any backward compatibility related bug report. I used the
following code to test 2.0 & 2.3.2 and didn't see any difference.

 

            

RAMDirectory dir = new RAMDirectory(); 

 

IndexWriter wr = new IndexWriter(dir,new StandardAnalyzer(),true);

            Document doc = new Document();

            Field f = new Field("field1", "café algodão", Field.Store.YES,
Field.Index.TOKENIZED);

            doc.Add(f);

            wr.AddDocument(doc);

            wr.Close();

 

            IndexSearcher sr = new IndexSearcher(dir);

            QueryParser qp = new QueryParser("field1", new
StandardAnalyzer());

            Query q = qp.Parse("algodão");

            MessageBox.Show(sr.Search(q).Length().ToString());

            sr.Close();

 

 

 

Can you send a simple test case showing the difference between versions?

 

DIGY

 

 

 

-----Original Message-----
From: Monteiro, Alvaro [mailto:Alvaro.Monteiro@sage.pt] 
Sent: Friday, October 02, 2009 6:45 PM
To: lucene-net-user@incubator.apache.org
Subject: indexing special characters

 

I've started using the latest build for lucene.net (2.3). 

It is a lot faster than 2.0.

 

I've noticed something very strange, although the indexing process is the
same, when I use the latest dll and I search for 

a word with a special character (like "café", "algodão") no results are
given.

However, if I change the dll to 2.0, index the exact same thing, I search
for a word of this kind and I have results. No change in the code
whatsoever!!!

 

Does anyone have any idea?

 

Thank you so much.

 

Alvaro Monteiro


RE: indexing special characters

Posted by Digy <di...@gmail.com>.
I don't remember any backward compatibility related bug report. I used the
following code to test 2.0 & 2.3.2 and didn't see any difference.

 

            

RAMDirectory dir = new RAMDirectory(); 

 

IndexWriter wr = new IndexWriter(dir,new StandardAnalyzer(),true);

            Document doc = new Document();

            Field f = new Field("field1", "café algodão", Field.Store.YES,
Field.Index.TOKENIZED);

            doc.Add(f);

            wr.AddDocument(doc);

            wr.Close();

 

            IndexSearcher sr = new IndexSearcher(dir);

            QueryParser qp = new QueryParser("field1", new
StandardAnalyzer());

            Query q = qp.Parse("algodão");

            MessageBox.Show(sr.Search(q).Length().ToString());

            sr.Close();

 

 

 

Can you send a simple test case showing the difference between versions?

 

DIGY

 

 

 

-----Original Message-----
From: Monteiro, Alvaro [mailto:Alvaro.Monteiro@sage.pt] 
Sent: Friday, October 02, 2009 6:45 PM
To: lucene-net-user@incubator.apache.org
Subject: indexing special characters

 

I've started using the latest build for lucene.net (2.3). 

It is a lot faster than 2.0.

 

I've noticed something very strange, although the indexing process is the
same, when I use the latest dll and I search for 

a word with a special character (like "café", "algodão") no results are
given.

However, if I change the dll to 2.0, index the exact same thing, I search
for a word of this kind and I have results. No change in the code
whatsoever!!!

 

Does anyone have any idea?

 

Thank you so much.

 

Alvaro Monteiro


indexing special characters

Posted by "Monteiro, Alvaro" <Al...@sage.pt>.
I've started using the latest build for lucene.net (2.3). 
It is a lot faster than 2.0.

I've noticed something very strange, although the indexing process is the same, when I use the latest dll and I search for 
a word with a special character (like "café", "algodão") no results are given.
However, if I change the dll to 2.0, index the exact same thing, I search for a word of this kind and I have results. No change in the code whatsoever!!!

Does anyone have any idea?

Thank you so much.

Alvaro Monteiro

RE: Alternative to looping through Hits

Posted by Digy <di...@gmail.com>.
Use "HitCollector" instead of "Hits". "Hits" re-executes the search when you
need more than 100 hits.

DIGY

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Friday, October 02, 2009 5:40 PM
To: lucene-net-user@incubator.apache.org
Subject: Alternative to looping through Hits

I am currently attempting to create a comma separated list of IDs from a 
given Hits collection.

However, when we end up processing 6,000 or more hits, it takes 25-30 
seconds per collection.  I've been trying to find a faster way to change 
the search results to the comma separated list.  Do any of you have any 
advice?  Thanks in advance.

Trevor Watson


My current code looks like

Lucene.Net.Search.Searcher search = new 
Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
jobId.ToString()));
            Lucene.Net.Search.Hits hits = search.Search(query);

            string docIds = "";
            totalDocuments = hits.Length();

           
          // Test #1
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            while (hi.MoveNext())
                docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").Str
ingValue() 
+ ", ";

          // Test #2
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);

                docIds += docHit.GetField("DocumentId").StringValue() + 
", ";
            }


Re: Alternative to looping through Hits

Posted by Matt Honeycutt <mb...@gmail.com>.
With ~6,000 appends, I would expect StringBuilder to be
*significantly*faster.  Most benchmarks I've seen show that it is
faster to use a
SringBuilder than string concatenation once you pass ~30 appends...  if it
was slower, maybe something else is going on?

On Fri, Oct 2, 2009 at 10:02 AM, Trevor Watson <tw...@datassimilate.com>wrote:

> I had done StringBuilder.Append for the HitsIterator.  It actually
> increased the time by about 5 seconds.  It might be just computer issue at
> that time, however, it didn't seem to be beneficial time-wise.
>
>
>
>
> Gerald Pape wrote:
>
>> Hi,
>> would start with using StringBuilder instead of string, maybe this gives
>> some performance boost.
>>
>>
>>
>>
>>
>> From:   Trevor Watson <tw...@datassimilate.com>
>> To:     lucene-net-user@incubator.apache.org
>> Date:   02.10.2009 16:42
>> Subject:        Alternative to looping through Hits
>>
>>
>>
>> I am currently attempting to create a comma separated list of IDs from a
>> given Hits collection.
>>
>> However, when we end up processing 6,000 or more hits, it takes 25-30
>> seconds per collection.  I've been trying to find a faster way to change the
>> search results to the comma separated list.  Do any of you have any advice?
>>  Thanks in advance.
>>
>> Trevor Watson
>>
>>
>> My current code looks like
>>
>> Lucene.Net.Search.Searcher search = new
>> Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" +
>> jobId.ToString()));
>>            Lucene.Net.Search.Hits hits = search.Search(query);
>>
>>            string docIds = "";
>>            totalDocuments = hits.Length();
>>
>>            // Test #1
>>            Lucene.Net.Search.HitIterator hi =
>> (Lucene.Net.Search.HitIterator)hits.Iterator();
>>            while (hi.MoveNext())
>>                docIds +=
>> ((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue()
>>
>> + ", ";
>>
>>          // Test #2
>>            for (int iCount = 0; iCount < totalDocuments; iCount++)
>>            {
>>                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
>>
>>                docIds += docHit.GetField("DocumentId").StringValue() + ",
>> ";
>>            }
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Alternative to looping through Hits

Posted by Michael Neel <mi...@gmail.com>.
How big is your index?  Is it possible to use a RAMDirectory and run
the search from memory?  A virus scanner or desktop search indexer
might be getting in the way of the I/O calls reading the disk.

I would also play more with getting rid of all string + string code,
either with string builder or making a list of strings, and then using
string.join at the end.

On Fri, Oct 2, 2009 at 11:02 AM, Trevor Watson
<tw...@datassimilate.com> wrote:
> I had done StringBuilder.Append for the HitsIterator.  It actually increased
> the time by about 5 seconds.  It might be just computer issue at that time,
> however, it didn't seem to be beneficial time-wise.
>
>
>
> Gerald Pape wrote:
>>
>> Hi,
>> would start with using StringBuilder instead of string, maybe this gives
>> some performance boost.
>>
>>
>>
>>
>>
>> From:   Trevor Watson <tw...@datassimilate.com>
>> To:     lucene-net-user@incubator.apache.org
>> Date:   02.10.2009 16:42
>> Subject:        Alternative to looping through Hits
>>
>>
>>
>> I am currently attempting to create a comma separated list of IDs from a
>> given Hits collection.
>>
>> However, when we end up processing 6,000 or more hits, it takes 25-30
>> seconds per collection.  I've been trying to find a faster way to change the
>> search results to the comma separated list.  Do any of you have any advice?
>>  Thanks in advance.
>>
>> Trevor Watson
>>
>>
>> My current code looks like
>>
>> Lucene.Net.Search.Searcher search = new
>> Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" +
>> jobId.ToString()));
>>            Lucene.Net.Search.Hits hits = search.Search(query);
>>
>>            string docIds = "";
>>            totalDocuments = hits.Length();
>>
>>            // Test #1
>>            Lucene.Net.Search.HitIterator hi =
>> (Lucene.Net.Search.HitIterator)hits.Iterator();
>>            while (hi.MoveNext())
>>                docIds +=
>> ((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue()
>> + ", ";
>>
>>          // Test #2
>>            for (int iCount = 0; iCount < totalDocuments; iCount++)
>>            {
>>                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
>>
>>                docIds += docHit.GetField("DocumentId").StringValue() + ",
>> ";
>>            }
>>
>>
>>
>>
>>
>>
>>
>
>



-- 
Michael C. Neel (@ViNull)
http://www.ViNull.com
Microsoft MVP & ASPInsider

Do you FeelTheFunc(.com)?

Re: Alternative to looping through Hits

Posted by Trevor Watson <tw...@datassimilate.com>.
I had done StringBuilder.Append for the HitsIterator.  It actually 
increased the time by about 5 seconds.  It might be just computer issue 
at that time, however, it didn't seem to be beneficial time-wise.



Gerald Pape wrote:
> Hi, 
>
> would start with using StringBuilder instead of string, maybe this gives 
> some performance boost.
>
>
>
>
>
> From:   Trevor Watson <tw...@datassimilate.com>
> To:     lucene-net-user@incubator.apache.org
> Date:   02.10.2009 16:42
> Subject:        Alternative to looping through Hits
>
>
>
> I am currently attempting to create a comma separated list of IDs from a 
> given Hits collection.
>
> However, when we end up processing 6,000 or more hits, it takes 25-30 
> seconds per collection.  I've been trying to find a faster way to change 
> the search results to the comma separated list.  Do any of you have any 
> advice?  Thanks in advance.
>
> Trevor Watson
>
>
> My current code looks like
>
> Lucene.Net.Search.Searcher search = new 
> Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
> jobId.ToString()));
>             Lucene.Net.Search.Hits hits = search.Search(query);
>
>             string docIds = "";
>             totalDocuments = hits.Length();
>
>  
>           // Test #1
>             Lucene.Net.Search.HitIterator hi = 
> (Lucene.Net.Search.HitIterator)hits.Iterator();
>             while (hi.MoveNext())
>                 docIds += 
> ((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
>
> + ", ";
>
>           // Test #2
>             for (int iCount = 0; iCount < totalDocuments; iCount++)
>             {
>                 Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
>
>                 docIds += docHit.GetField("DocumentId").StringValue() + 
> ", ";
>             }
>
>
>
>
>
>
>   


Re: Alternative to looping through Hits

Posted by Gerald Pape <ge...@at.ibm.com>.
Hi, 

would start with using StringBuilder instead of string, maybe this gives 
some performance boost.





From:   Trevor Watson <tw...@datassimilate.com>
To:     lucene-net-user@incubator.apache.org
Date:   02.10.2009 16:42
Subject:        Alternative to looping through Hits



I am currently attempting to create a comma separated list of IDs from a 
given Hits collection.

However, when we end up processing 6,000 or more hits, it takes 25-30 
seconds per collection.  I've been trying to find a faster way to change 
the search results to the comma separated list.  Do any of you have any 
advice?  Thanks in advance.

Trevor Watson


My current code looks like

Lucene.Net.Search.Searcher search = new 
Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
jobId.ToString()));
            Lucene.Net.Search.Hits hits = search.Search(query);

            string docIds = "";
            totalDocuments = hits.Length();

 
          // Test #1
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            while (hi.MoveNext())
                docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 

+ ", ";

          // Test #2
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);

                docIds += docHit.GetField("DocumentId").StringValue() + 
", ";
            }


Re: Alternative to looping through Hits

Posted by Trevor Watson <tw...@datassimilate.com>.
Thank you all for your help on this issue.  As always, greatly appreciated.

I took the Store.YES away from the data fields and just left them 
tokenized.  The looping is now almost instant in all cases.  I guess the 
full OCR of the documents was slowing it down more than I thought it would.

Thank you again.

Trevor

Trevor Watson wrote:
> I am currently attempting to create a comma separated list of IDs from 
> a given Hits collection.
>
> However, when we end up processing 6,000 or more hits, it takes 25-30 
> seconds per collection.  I've been trying to find a faster way to 
> change the search results to the comma separated list.  Do any of you 
> have any advice?  Thanks in advance.
>
> Trevor Watson
>
>
> My current code looks like
>
> Lucene.Net.Search.Searcher search = new 
> Lucene.Net.Search.IndexSearcher(string.Format("c:\\sv_index\\" + 
> jobId.ToString()));
>            Lucene.Net.Search.Hits hits = search.Search(query);
>
>            string docIds = "";
>            totalDocuments = hits.Length();
>
>                    // Test #1
>            Lucene.Net.Search.HitIterator hi = 
> (Lucene.Net.Search.HitIterator)hits.Iterator();
>            while (hi.MoveNext())
>                docIds += 
> ((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
> + ", ";
>
>          // Test #2
>            for (int iCount = 0; iCount < totalDocuments; iCount++)
>            {
>                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
>
>                docIds += docHit.GetField("DocumentId").StringValue() + 
> ", ";
>            }
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.409 / Virus Database: 270.14.3/2409 - Release Date: 10/02/09 06:46:00
>
>
>
>
>
>   
> ------------------------------------------------------------------------
>
>
>   


RE: Alternative to looping through Hits

Posted by Franklin Simmons <fs...@sccmediaserver.com>.
Trevor,

Is your index optimized? How many documents are in your index? Is the OCR field stored, and if so, is the "DocumentId" field always the first field in the document? My impression is that can make a difference, however I can't recall specific discussions regarding that conjecture....You might consider adding finely grained time span measurements to identify the bottleneck. 

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Friday, October 02, 2009 2:23 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Alternative to looping through Hits

Thank you so far for the help with this.  I've been trying the different 
suggestions that you all posted on here. 

The Lucene index contains a numeric index (the value I want), 4 text 
fields (With simple data (i.e. Form, Publication, Email)) or people's 
names) and 1 text field with the OCR of the image that is referenced by 
the line (very large in some cases).  The data is currently stored for 
the text fields (for testing), and hopefully making that only tokenized 
and not save the actual info will speed some things up.

The following is a list of the times the loops are taking.

Any advice on speeding any of them up to better time?

Thanks in advance.

End Fieldable:22 seconds
--------------start code--------------
            Lucene.Net.Search.TopFieldDocCollector collector = new 
Lucene.Net.Search.TopFieldDocCollector(reader, 
Lucene.Net.Search.Sort.RELEVANCE, 100000);
            search.Search(query, null, collector);
            Lucene.Net.Search.TopDocs topDocs = collector.TopDocs();
            string[] values = new string[topDocs.scoreDocs.Length];
            LuceneUtilities.MyFieldSelector field_selector = new 
LuceneUtilities.MyFieldSelector("DocumentId");
            for(int i = 0; i < values.Length; i++)
            {
                Lucene.Net.Search.ScoreDoc score_document = 
topDocs.scoreDocs[i];
                Lucene.Net.Documents.Document document = 
search.Doc(score_document.doc, field_selector);
                values[i] = 
document.GetFieldable("DocumentId").StringValue();
            }

            string csv = String.Join(" ,",values);
--------------end code--------------


End TopDocs, plus string:30 seconds
--------------start code--------------
            string docIds = "";  
            totalDocuments = hits.Length();

            dtStart = DateTime.Now;
            docIds = "";
            //Lucene.Net.Search.TopDocs topDocs = search.Search(query, 
null, 100000);
            topDocs = search.Search(query, null, 100000);

            foreach (Lucene.Net.Search.ScoreDoc sd in topDocs.scoreDocs)
            {
                Lucene.Net.Documents.Document docTest = search.Doc(sd.doc);
                docIds += docTest.GetField("DocumentId").StringValue() + 
", ";
            }
            dtCurrent = DateTime.Now;
--------------end code--------------

End HitIterator (string array):29 seconds
--------------start code--------------
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            string[] sTest1 = new string[hits.Length()];
            int iCount1 = 0;

            dtStart = DateTime.Now;
            while (hi.MoveNext())
            {
                sTest1[iCount1] = 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue();
                iCount1++;
                //docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
+ ", ";
            }
--------------end code--------------

End HitIterator (arrayList):30 seconds
--------------start code--------------
hi = (Lucene.Net.Search.HitIterator)hits.Iterator();
            StringBuilder sb = new StringBuilder();
            ArrayList alTest = new ArrayList();

            dtStart = DateTime.Now;
            while (hi.MoveNext())
                
alTest.Add(((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue());

--------------end code--------------

End Hits (array):30 seconds
--------------start code--------------
        string[] sFinalDocs = new string[totalDocuments];
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
                //docIds += docHit.GetField("DocumentId").StringValue() 
+ ", ";
                sFinalDocs[iCount] = 
docHit.GetField("DocumentId").StringValue();
            }
            docIds = String.Join(", ", sFinalDocs);--------------end 
code--------------


Re: Alternative to looping through Hits

Posted by Trevor Watson <tw...@datassimilate.com>.
Thank you so far for the help with this.  I've been trying the different 
suggestions that you all posted on here. 

The Lucene index contains a numeric index (the value I want), 4 text 
fields (With simple data (i.e. Form, Publication, Email)) or people's 
names) and 1 text field with the OCR of the image that is referenced by 
the line (very large in some cases).  The data is currently stored for 
the text fields (for testing), and hopefully making that only tokenized 
and not save the actual info will speed some things up.

The following is a list of the times the loops are taking.

Any advice on speeding any of them up to better time?

Thanks in advance.

End Fieldable:22 seconds
--------------start code--------------
            Lucene.Net.Search.TopFieldDocCollector collector = new 
Lucene.Net.Search.TopFieldDocCollector(reader, 
Lucene.Net.Search.Sort.RELEVANCE, 100000);
            search.Search(query, null, collector);
            Lucene.Net.Search.TopDocs topDocs = collector.TopDocs();
            string[] values = new string[topDocs.scoreDocs.Length];
            LuceneUtilities.MyFieldSelector field_selector = new 
LuceneUtilities.MyFieldSelector("DocumentId");
            for(int i = 0; i < values.Length; i++)
            {
                Lucene.Net.Search.ScoreDoc score_document = 
topDocs.scoreDocs[i];
                Lucene.Net.Documents.Document document = 
search.Doc(score_document.doc, field_selector);
                values[i] = 
document.GetFieldable("DocumentId").StringValue();
            }

            string csv = String.Join(" ,",values);
--------------end code--------------


End TopDocs, plus string:30 seconds
--------------start code--------------
            string docIds = "";  
            totalDocuments = hits.Length();

            dtStart = DateTime.Now;
            docIds = "";
            //Lucene.Net.Search.TopDocs topDocs = search.Search(query, 
null, 100000);
            topDocs = search.Search(query, null, 100000);

            foreach (Lucene.Net.Search.ScoreDoc sd in topDocs.scoreDocs)
            {
                Lucene.Net.Documents.Document docTest = search.Doc(sd.doc);
                docIds += docTest.GetField("DocumentId").StringValue() + 
", ";
            }
            dtCurrent = DateTime.Now;
--------------end code--------------

End HitIterator (string array):29 seconds
--------------start code--------------
            Lucene.Net.Search.HitIterator hi = 
(Lucene.Net.Search.HitIterator)hits.Iterator();
            string[] sTest1 = new string[hits.Length()];
            int iCount1 = 0;

            dtStart = DateTime.Now;
            while (hi.MoveNext())
            {
                sTest1[iCount1] = 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue();
                iCount1++;
                //docIds += 
((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue() 
+ ", ";
            }
--------------end code--------------

End HitIterator (arrayList):30 seconds
--------------start code--------------
hi = (Lucene.Net.Search.HitIterator)hits.Iterator();
            StringBuilder sb = new StringBuilder();
            ArrayList alTest = new ArrayList();

            dtStart = DateTime.Now;
            while (hi.MoveNext())
                
alTest.Add(((Lucene.Net.Search.Hit)hi.Current).GetDocument().GetField("DocumentId").StringValue());

--------------end code--------------

End Hits (array):30 seconds
--------------start code--------------
        string[] sFinalDocs = new string[totalDocuments];
            for (int iCount = 0; iCount < totalDocuments; iCount++)
            {
                Lucene.Net.Documents.Document docHit = hits.Doc(iCount);
                //docIds += docHit.GetField("DocumentId").StringValue() 
+ ", ";
                sFinalDocs[iCount] = 
docHit.GetField("DocumentId").StringValue();
            }
            docIds = String.Join(", ", sFinalDocs);--------------end 
code--------------