You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2002/05/01 00:41:56 UTC

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

Just a couple of clarification points:
- the number of files that Lucene uses depends on the number of segments 
in the index and the number of *stored* fields
- if your fields are not stored but only indexed, they do not require 
separate files. Otherwise, an .fnn file is created for each field.
- if at least one document uses a given field name in an index, that 
index requires the .fnn file for that field
- index segments are created when documents are added to the index. For 
each 10 docs you get a new segment.
- optimizing the index removes all segments are replaces them with one 
new segment that contains all of the documents
- optimization is done periodically as more documents are added 
(controlled by IndexWriter.mergeFactor), but can be done manually 
whenever needed

With all this, I think Lucene does use too many files...
Some additional info: there is a field on IndexWriter called infoStream. 
If this is set to a PrintStream (such as System.out), various diagnostic 
messages about the merging process will be printed to that stream. You 
might find this helpful in tuning the merge parameters.

Hope this helps.
Good luck.

Dmitry.




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

Posted by petite_abeille <pe...@mac.com>.
On Wednesday, May 1, 2002, at 12:41 AM, Dmitry Serebrennikov wrote:

> - the number of files that Lucene uses depends on the number of 
> segments in the index and the number of *stored* fields
> - if your fields are not stored but only indexed, they do not require 
> separate files. Otherwise, an .fnn file is created for each field.

Ok. That's good as all my fields are indexed but not stored in Lucene. 
Only one field is stored in any one index: the uuid of an object (as a 
Keyword).

> - if at least one document uses a given field name in an index, that 
> index requires the .fnn file for that field

Ok. So, in theory, more homogeneous index should use less files all 
things being equal?

> - index segments are created when documents are added to the index. For 
> each 10 docs you get a new segment.
> - optimizing the index removes all segments are replaces them with one 
> new segment that contains all of the documents
> - optimization is done periodically as more documents are added 
> (controlled by IndexWriter.mergeFactor), but can be done manually 
> whenever needed

Ok. When doing the optimization, are there any temporary files getting 
created?

> With all this, I think Lucene does use too many files...

That's my impression also...

> Some additional info: there is a field on IndexWriter called 
> infoStream. If this is set to a PrintStream (such as System.out), 
> various diagnostic messages about the merging process will be printed 
> to that stream.

Yep. I guess I overlooked that.

> You might find this helpful in tuning the merge parameters.

Just to make sure: using a small merge factor (eg 2) will reduce the 
number of files or just optimize (aka merge) the index more often?

> Hope this helps.
> Good luck.

Thanks. Very helpful indeed :-)

R.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


best practice for indexing multiple equiv fieldnames

Posted by Landon Cox <lc...@interactive-media.com>.
I'm planning to use Lucene to index scads of XML files whose data model
includes replicated blocks of tags.  Translation: a novice question follows.

My files have a common XML pattern (for illustrative purposes):

<blocks>
   <block id="123">some text 1</block>
   <block id="456">some text 2</block>
   <block id="789">some text 3</block>
</blocks>

Each block has a unique id, but the tagname is identical.  The actual data
model has nested tags within these blocks - ie: metadata with the same
tagnames within each block.  So, in the real data model, there are multiple
identical tagnames that are associated with a specific parent.  Something
more like this:

<blocks>
   <block id="123">
      <author>Joe Blow</author>
      <job>hack</job>
   </block>
   <block id="456">
      <author>Jane Doe</author>
      <job>President</job>
   </block>
</blocks>

In latter case, I need to be able to search by author or job, for example,
and get the tag's text contents as well as the parent block id.

Adding a field name of "block" or "author" or "job" multiple times to the
same Lucene Document, according to the Lucene javadoc, has the effect of
appending the text for search purposes.  I take that to mean, in order to
use a 'hit' I would need to somehow uniquely identify the field from which
the content came even though the content was appended for search purposes.

If I searched an 'author' field name and got a hit, I would not be able to
disambiguate which block id the actual hit belonged to.  Or if I searched on
"job", how would I know a hit belonged to block id 456 instead of block id
123 parent?

What is the Lucene approach for indexing a single document that has the same
field name appearing in multiple places and then using the hit to find the
exact association of block id in the above example?

Hope this question makes sense.  I'm sure I'm missing something
obvious/simple in how the API would work in this case.  Thanks,

Landon Cox


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>