You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2011/06/21 20:52:25 UTC

[Lucene.Net] MultiSearcher & duplicate IDs

Sorry to keep posting questions like this, so many tasks on the go I 
still haven't ever had the time to sit and research lucene fully.  One 
of these days!

Our development of a piece of software I'm working on has hit a small snag.

We currently have the following layout for our back-end data for our 
projects.

Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index

The database is just for quick reads and file counts and other 
information.  In the database we have a file table with the following layout

FileId
FileName
FilePath
<.... additional info and flags>

The content index contains

<FileId>
<contents of a file, indexed but not stored>

The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>

The reason we have 2 indexes (is that the proper plural?) is that we 
don't store the contents field, so if we want to change info regarding 
the file in the index, we couldn't re-create the row from the existing 
data.  We'd have to re-extract the information from the file (which can 
be very time consuming), but we can easily re-create the FileInfo index row.

We thought that using a MultiSearcher was the best way to do a combined 
search between the FileInfo index and the contents index.  It worked 
like a charm, until we started searching across both indexes.

When we use a MultiSearcher and search for for example 
"FileName:test.txt AND contents:eml" we end up with a Hits object with 
duplicate entries in FileId.  This is because the hit from the FileInfo 
index and the hit from the Contents index both are returned.  So 1 file 
gives 2 entries, 1 for each index.

Is there a way around this without looping through the entire Hits 
collection and making my own collection of IDs?

Thanks in advance.

Trevor Watson

Re: [Lucene.Net] MultiSearcher & duplicate IDs

Posted by Trevor Watson <tw...@datassimilate.com>.

Wow, this is awesome, there's so much for me to learn yet.

The only problem I'm having with this is the line that reads

string[] temp = new string[(int)(terms.Length * 1.2)];

I'm not sure where the 1.2 comes from and running the code as it stand 
frequently resulted in me getting an out of index range error.

However, I changed it to read

string[] temp = new string[(int)(i + 1)];


Which probably isn't correct, but seems to work.  I might have to see 
about modifying the software to go back to using a single index again.

Thanks again DIGY.

Trevor


On 06/21/2011 4:16 PM, Digy wrote:
>> The reason we have 2 indexes (is that the proper plural?) is that we don't
> store the contents field,
>> so if we want to change info regarding the file in the index, we couldn't
> re-create the row from the existing data.
>
> If this is the real problem, you can construct a document roughly equal to
> the original one(assuming you use "Field.TermVector.WITH_POSITIONS").
>
> DIGY
>
>        class TVM : TermVectorMapper
>          {
>              string[] terms;
>              string text = null;
>
>              public override void SetExpectations(string field, int numTerms,bool storeOffsets, bool storePositions)
>              {
>                  terms = new string[numTerms];
>              }
>
>              public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[]  positions)
>              {
>
>                  foreach(int i in positions)
>                  {
>                      if (terms.Length<  i + 1)
>                      {
>                          string[] temp = new string[(int)(terms.Length * 1.2)];
>                          terms.CopyTo(temp, 0);
>                          terms = temp;
>                      }
>                      terms[i]=term;
>                  }
>              }
>
>              public override string ToString()
>              {
>                  if(text==null)
>                  {
>                      StringBuilder sb = new StringBuilder();
>                      foreach(string s in terms) sb.Append(s + " ");
>                      text = sb.ToString();
>                  }
>                  return text;
>              }
>          }
>
>          ........
>
>          TVM tvm = new TVM();
>          reader.GetTermFreqVector(0, "text", tvm);
> 	string doc = tvm.ToString();
>
>
> -----Original Message-----
> From: Trevor Watson [mailto:twatson@datassimilate.com]
> Sent: Tuesday, June 21, 2011 9:52 PM
> To: lucene-net-user@lucene.apache.org
> Subject: [Lucene.Net] MultiSearcher&  duplicate IDs
>
>
> Sorry to keep posting questions like this, so many tasks on the go I
> still haven't ever had the time to sit and research lucene fully.  One
> of these days!
>
> Our development of a piece of software I'm working on has hit a small snag.
>
> We currently have the following layout for our back-end data for our
> projects.
>
> Project Folder\database
> Project Folder\Content index
> Project Folder\Sub Folder\File Info index
>
> The database is just for quick reads and file counts and other
> information.  In the database we have a file table with the following layout
>
> FileId
> FileName
> FilePath
> <.... additional info and flags>
>
> The content index contains
>
> <FileId>
> <contents of a file, indexed but not stored>
>
> The File Info index contains
> <FileId>
> <FileName>
> <FilePath>
> <File MetaData>
> <... other file related data>
>
> The reason we have 2 indexes (is that the proper plural?) is that we
> don't store the contents field, so if we want to change info regarding
> the file in the index, we couldn't re-create the row from the existing
> data.  We'd have to re-extract the information from the file (which can
> be very time consuming), but we can easily re-create the FileInfo index row.
>
> We thought that using a MultiSearcher was the best way to do a combined
> search between the FileInfo index and the contents index.  It worked
> like a charm, until we started searching across both indexes.
>
> When we use a MultiSearcher and search for for example
> "FileName:test.txt AND contents:eml" we end up with a Hits object with
> duplicate entries in FileId.  This is because the hit from the FileInfo
> index and the hit from the Contents index both are returned.  So 1 file
> gives 2 entries, 1 for each index.
>
> Is there a way around this without looping through the entire Hits
> collection and making my own collection of IDs?
>
> Thanks in advance.
>
> Trevor Watson
>
>

RE: [Lucene.Net] MultiSearcher & duplicate IDs

Posted by Digy <di...@gmail.com>.

> The reason we have 2 indexes (is that the proper plural?) is that we don't
store the contents field, 

> so if we want to change info regarding the file in the index, we couldn't
re-create the row from the existing data.

 

If this is the real problem, you can construct a document roughly equal to
the original one(assuming you use "Field.TermVector.WITH_POSITIONS").

 

DIGY

 

      class TVM : TermVectorMapper

        {

            string[] terms;

            string text = null;

 

            public override void SetExpectations(string field, int numTerms,
bool storeOffsets, bool storePositions)

            {

                terms = new string[numTerms];

            }

 

            public override void Map(string term, int frequency,
TermVectorOffsetInfo[] offsets, int[] positions)

            {

                

                foreach(int i in positions)

                {

                    if (terms.Length < i + 1)

                    {

                        string[] temp = new string[(int)(terms.Length *
1.2)];

                        terms.CopyTo(temp, 0);

                        terms = temp;

                    }

                    terms[i]=term;

                }

            }

 

            public override string ToString()

            {

                if(text==null)

                {

                    StringBuilder sb = new StringBuilder();

                    foreach(string s in terms) sb.Append(s + " ");

                    text = sb.ToString();

                }

                return text;

            }

        }

 

        ........

 

        TVM tvm = new TVM();

        reader.GetTermFreqVector(0, "text", tvm);

string doc = tvm.ToString();

 

 

 

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Tuesday, June 21, 2011 9:52 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] MultiSearcher & duplicate IDs

 

Sorry to keep posting questions like this, so many tasks on the go I 

still haven't ever had the time to sit and research lucene fully.  One 

of these days!

 

Our development of a piece of software I'm working on has hit a small snag.

 

We currently have the following layout for our back-end data for our 

projects.

 

Project Folder\database

Project Folder\Content index

Project Folder\Sub Folder\File Info index

 

The database is just for quick reads and file counts and other 

information.  In the database we have a file table with the following layout

 

FileId

FileName

FilePath

<.... additional info and flags>

 

The content index contains

 

<FileId>

<contents of a file, indexed but not stored>

 

The File Info index contains

<FileId>

<FileName>

<FilePath>

<File MetaData>

<... other file related data>

 

The reason we have 2 indexes (is that the proper plural?) is that we 

don't store the contents field, so if we want to change info regarding 

the file in the index, we couldn't re-create the row from the existing 

data.  We'd have to re-extract the information from the file (which can 

be very time consuming), but we can easily re-create the FileInfo index row.

 

We thought that using a MultiSearcher was the best way to do a combined 

search between the FileInfo index and the contents index.  It worked 

like a charm, until we started searching across both indexes.

 

When we use a MultiSearcher and search for for example 

"FileName:test.txt AND contents:eml" we end up with a Hits object with 

duplicate entries in FileId.  This is because the hit from the FileInfo 

index and the hit from the Contents index both are returned.  So 1 file 

gives 2 entries, 1 for each index.

 

Is there a way around this without looping through the entire Hits 

collection and making my own collection of IDs?

 

Thanks in advance.

 

Trevor Watson

RE: [Lucene.Net] MultiSearcher & duplicate IDs

Posted by Franklin Simmons <fs...@sccmediaserver.com>.

Since you say it is possible to weed out redundant hits as a post-process, you should be able to solve the problem with a custom Lucene.Net.Search.Filter.

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Tuesday, June 21, 2011 2:52 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] MultiSearcher & duplicate IDs

Sorry to keep posting questions like this, so many tasks on the go I still haven't ever had the time to sit and research lucene fully.  One of these days!

Our development of a piece of software I'm working on has hit a small snag.

We currently have the following layout for our back-end data for our projects.

Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index

The database is just for quick reads and file counts and other information.  In the database we have a file table with the following layout

FileId
FileName
FilePath
<.... additional info and flags>

The content index contains

<FileId>
<contents of a file, indexed but not stored>

The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>

The reason we have 2 indexes (is that the proper plural?) is that we don't store the contents field, so if we want to change info regarding the file in the index, we couldn't re-create the row from the existing data.  We'd have to re-extract the information from the file (which can be very time consuming), but we can easily re-create the FileInfo index row.

We thought that using a MultiSearcher was the best way to do a combined search between the FileInfo index and the contents index.  It worked like a charm, until we started searching across both indexes.

When we use a MultiSearcher and search for for example "FileName:test.txt AND contents:eml" we end up with a Hits object with duplicate entries in FileId.  This is because the hit from the FileInfo index and the hit from the Contents index both are returned.  So 1 file gives 2 entries, 1 for each index.

Is there a way around this without looping through the entire Hits collection and making my own collection of IDs?

Thanks in advance.

Trevor Watson