You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2011/06/21 20:52:25 UTC
[Lucene.Net] MultiSearcher & duplicate IDs
Sorry to keep posting questions like this, so many tasks on the go I
still haven't ever had the time to sit and research lucene fully. One
of these days!
Our development of a piece of software I'm working on has hit a small snag.
We currently have the following layout for our back-end data for our
projects.
Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index
The database is just for quick reads and file counts and other
information. In the database we have a file table with the following layout
FileId
FileName
FilePath
<.... additional info and flags>
The content index contains
<FileId>
<contents of a file, indexed but not stored>
The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>
The reason we have 2 indexes (is that the proper plural?) is that we
don't store the contents field, so if we want to change info regarding
the file in the index, we couldn't re-create the row from the existing
data. We'd have to re-extract the information from the file (which can
be very time consuming), but we can easily re-create the FileInfo index row.
We thought that using a MultiSearcher was the best way to do a combined
search between the FileInfo index and the contents index. It worked
like a charm, until we started searching across both indexes.
When we use a MultiSearcher and search for for example
"FileName:test.txt AND contents:eml" we end up with a Hits object with
duplicate entries in FileId. This is because the hit from the FileInfo
index and the hit from the Contents index both are returned. So 1 file
gives 2 entries, 1 for each index.
Is there a way around this without looping through the entire Hits
collection and making my own collection of IDs?
Thanks in advance.
Trevor Watson
Re: [Lucene.Net] MultiSearcher & duplicate IDs
Posted by Trevor Watson <tw...@datassimilate.com>.
Wow, this is awesome, there's so much for me to learn yet.
The only problem I'm having with this is the line that reads
string[] temp = new string[(int)(terms.Length * 1.2)];
I'm not sure where the 1.2 comes from and running the code as it stand
frequently resulted in me getting an out of index range error.
However, I changed it to read
string[] temp = new string[(int)(i + 1)];
Which probably isn't correct, but seems to work. I might have to see
about modifying the software to go back to using a single index again.
Thanks again DIGY.
Trevor
On 06/21/2011 4:16 PM, Digy wrote:
>> The reason we have 2 indexes (is that the proper plural?) is that we don't
> store the contents field,
>> so if we want to change info regarding the file in the index, we couldn't
> re-create the row from the existing data.
>
> If this is the real problem, you can construct a document roughly equal to
> the original one(assuming you use "Field.TermVector.WITH_POSITIONS").
>
> DIGY
>
> class TVM : TermVectorMapper
> {
> string[] terms;
> string text = null;
>
> public override void SetExpectations(string field, int numTerms,bool storeOffsets, bool storePositions)
> {
> terms = new string[numTerms];
> }
>
> public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
> {
>
> foreach(int i in positions)
> {
> if (terms.Length< i + 1)
> {
> string[] temp = new string[(int)(terms.Length * 1.2)];
> terms.CopyTo(temp, 0);
> terms = temp;
> }
> terms[i]=term;
> }
> }
>
> public override string ToString()
> {
> if(text==null)
> {
> StringBuilder sb = new StringBuilder();
> foreach(string s in terms) sb.Append(s + " ");
> text = sb.ToString();
> }
> return text;
> }
> }
>
> ........
>
> TVM tvm = new TVM();
> reader.GetTermFreqVector(0, "text", tvm);
> string doc = tvm.ToString();
>
>
> -----Original Message-----
> From: Trevor Watson [mailto:twatson@datassimilate.com]
> Sent: Tuesday, June 21, 2011 9:52 PM
> To: lucene-net-user@lucene.apache.org
> Subject: [Lucene.Net] MultiSearcher& duplicate IDs
>
>
> Sorry to keep posting questions like this, so many tasks on the go I
> still haven't ever had the time to sit and research lucene fully. One
> of these days!
>
> Our development of a piece of software I'm working on has hit a small snag.
>
> We currently have the following layout for our back-end data for our
> projects.
>
> Project Folder\database
> Project Folder\Content index
> Project Folder\Sub Folder\File Info index
>
> The database is just for quick reads and file counts and other
> information. In the database we have a file table with the following layout
>
> FileId
> FileName
> FilePath
> <.... additional info and flags>
>
> The content index contains
>
> <FileId>
> <contents of a file, indexed but not stored>
>
> The File Info index contains
> <FileId>
> <FileName>
> <FilePath>
> <File MetaData>
> <... other file related data>
>
> The reason we have 2 indexes (is that the proper plural?) is that we
> don't store the contents field, so if we want to change info regarding
> the file in the index, we couldn't re-create the row from the existing
> data. We'd have to re-extract the information from the file (which can
> be very time consuming), but we can easily re-create the FileInfo index row.
>
> We thought that using a MultiSearcher was the best way to do a combined
> search between the FileInfo index and the contents index. It worked
> like a charm, until we started searching across both indexes.
>
> When we use a MultiSearcher and search for for example
> "FileName:test.txt AND contents:eml" we end up with a Hits object with
> duplicate entries in FileId. This is because the hit from the FileInfo
> index and the hit from the Contents index both are returned. So 1 file
> gives 2 entries, 1 for each index.
>
> Is there a way around this without looping through the entire Hits
> collection and making my own collection of IDs?
>
> Thanks in advance.
>
> Trevor Watson
>
>
RE: [Lucene.Net] MultiSearcher & duplicate IDs
Posted by Digy <di...@gmail.com>.
> The reason we have 2 indexes (is that the proper plural?) is that we don't
store the contents field,
> so if we want to change info regarding the file in the index, we couldn't
re-create the row from the existing data.
If this is the real problem, you can construct a document roughly equal to
the original one(assuming you use "Field.TermVector.WITH_POSITIONS").
DIGY
class TVM : TermVectorMapper
{
string[] terms;
string text = null;
public override void SetExpectations(string field, int numTerms,
bool storeOffsets, bool storePositions)
{
terms = new string[numTerms];
}
public override void Map(string term, int frequency,
TermVectorOffsetInfo[] offsets, int[] positions)
{
foreach(int i in positions)
{
if (terms.Length < i + 1)
{
string[] temp = new string[(int)(terms.Length *
1.2)];
terms.CopyTo(temp, 0);
terms = temp;
}
terms[i]=term;
}
}
public override string ToString()
{
if(text==null)
{
StringBuilder sb = new StringBuilder();
foreach(string s in terms) sb.Append(s + " ");
text = sb.ToString();
}
return text;
}
}
........
TVM tvm = new TVM();
reader.GetTermFreqVector(0, "text", tvm);
string doc = tvm.ToString();
-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com]
Sent: Tuesday, June 21, 2011 9:52 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] MultiSearcher & duplicate IDs
Sorry to keep posting questions like this, so many tasks on the go I
still haven't ever had the time to sit and research lucene fully. One
of these days!
Our development of a piece of software I'm working on has hit a small snag.
We currently have the following layout for our back-end data for our
projects.
Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index
The database is just for quick reads and file counts and other
information. In the database we have a file table with the following layout
FileId
FileName
FilePath
<.... additional info and flags>
The content index contains
<FileId>
<contents of a file, indexed but not stored>
The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>
The reason we have 2 indexes (is that the proper plural?) is that we
don't store the contents field, so if we want to change info regarding
the file in the index, we couldn't re-create the row from the existing
data. We'd have to re-extract the information from the file (which can
be very time consuming), but we can easily re-create the FileInfo index row.
We thought that using a MultiSearcher was the best way to do a combined
search between the FileInfo index and the contents index. It worked
like a charm, until we started searching across both indexes.
When we use a MultiSearcher and search for for example
"FileName:test.txt AND contents:eml" we end up with a Hits object with
duplicate entries in FileId. This is because the hit from the FileInfo
index and the hit from the Contents index both are returned. So 1 file
gives 2 entries, 1 for each index.
Is there a way around this without looping through the entire Hits
collection and making my own collection of IDs?
Thanks in advance.
Trevor Watson
RE: [Lucene.Net] MultiSearcher & duplicate IDs
Posted by Franklin Simmons <fs...@sccmediaserver.com>.
Since you say it is possible to weed out redundant hits as a post-process, you should be able to solve the problem with a custom Lucene.Net.Search.Filter.
-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com]
Sent: Tuesday, June 21, 2011 2:52 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] MultiSearcher & duplicate IDs
Sorry to keep posting questions like this, so many tasks on the go I still haven't ever had the time to sit and research lucene fully. One of these days!
Our development of a piece of software I'm working on has hit a small snag.
We currently have the following layout for our back-end data for our projects.
Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index
The database is just for quick reads and file counts and other information. In the database we have a file table with the following layout
FileId
FileName
FilePath
<.... additional info and flags>
The content index contains
<FileId>
<contents of a file, indexed but not stored>
The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>
The reason we have 2 indexes (is that the proper plural?) is that we don't store the contents field, so if we want to change info regarding the file in the index, we couldn't re-create the row from the existing data. We'd have to re-extract the information from the file (which can be very time consuming), but we can easily re-create the FileInfo index row.
We thought that using a MultiSearcher was the best way to do a combined search between the FileInfo index and the contents index. It worked like a charm, until we started searching across both indexes.
When we use a MultiSearcher and search for for example "FileName:test.txt AND contents:eml" we end up with a Hits object with duplicate entries in FileId. This is because the hit from the FileInfo index and the hit from the Contents index both are returned. So 1 file gives 2 entries, 1 for each index.
Is there a way around this without looping through the entire Hits collection and making my own collection of IDs?
Thanks in advance.
Trevor Watson