You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Jokin Cuadrado <jo...@gmail.com> on 2007/07/17 10:00:46 UTC

Re: Index physical size

maybe the java index is using compression, while if you need
compression in lucene.net you must use an external library (SharpZLib)
and tell lucene.net to use it. there must be a "how to" use
compression in lucene.net in the web.

Jokin

On 7/17/07, Simone Busoli <si...@gmail.com> wrote:
>
>  Hello,
>
>  I discovered that an index optimization done by Lucene.Net with
> IndexWriter.Optimize() is less "optimizing" than the same operation done on
> the same index with Java Lucene. I found it out because I am using Luke to
> browse my index and when opening the index with Luke automatically reduces
> its size of 50%, even if it had just been optimized by my application
> running Lucene.Net.
>
>  Did anyone else notice this?
>

Unsuscribe

Posted by "Harris, Tobin" <to...@tobinharris.com>.

Unsubscribe

Re: Index physical size

Posted by Laxmilal Menaria <lm...@chambal.com>.

Hi,

I have created a Index using Lucene.net 2.0 and call
IndexWriter.Optimize()after indexing completed, Index size is 191 MB,
and  after  some time I
have  open this same index  using Luke , but  I am not able  to see the
reducing  size. its shows 191 MB.

--LM

On 7/18/07, Jokin Cuadrado <jo...@gmail.com> wrote:
>
> wich version of lucene you use?
> have you a reader opened?
>
> it seems reasonable to me, because if i remember well, the cleaning of
> old unused files is made when the index is opened. ¿have you tried to
> open the index with lucene.net after creating it to see if the result
> is the same?
>
> jokin
>
> On 7/18/07, Simone Busoli <si...@gmail.com> wrote:
> > I don't know. This is the situation when I create and optimize the index
> > with Lucene.Net:
> >
> > segments   28 Byte
> > _i5.cfs      543 kByte
> > deletable   12 Byte
> > _bd.cfs      317 kByte
> >
> > Once the index is opened with Luke only segments and _i5.cfs remain,
> > untouched. So the only difference is that _bd.cfs and deletable are
> > removed. Well, deletable looked like a good candidate to be deleted, but
> > what about _bd.cfs? It looks like it wans't needed then.
> >
> > Simone
> >
> > Jokin Cuadrado wrote:
> > > I'm wandering about, but may be an issue with the text codification
> > > used? if it's just the 50%, maybe lucene.net it's using a codification
> > > than needs 2 bytes for each character by default, and luke is using
> > > one that only needs 1 byte.
> > >
> > > regard the number of files,  maybe luke don't take acount of the
> > > "deletables" file, that contains the files that are no longer used and
> > > may be deleted because it don't delete files. But i think that it's no
> > > relevant to the another question.
> > >
> > > jokin.
> > >
> > > On 7/17/07, Simone Busoli <si...@gmail.com> wrote:
> > >>
> > >>  Hi Jokin,
> > >>
> > >>  actually I found some information about it. As far as I've
> discovered
> > >> compression can be applied to fields of documents, before adding them
> > >> to the
> > >> index, even if Lucene.Net doesn't supply it out of the box. But the
> > >> issue I
> > >> reported doesn't have to do with this, because index size reduction
> > >> seems to
> > >> be applied to a higher level by Luke, I mean, to an index already
> > >> containing
> > >> documents with uncompressed fields. In fact, when reopening the index
> > >> with
> > >> Lucene.Net after it's been opened - and you see, optimized - by Luke,
> > >> I am
> > >> still able to read it, even if I didn't configure support for
> > >> compression.
> > >> This means that Luke didn't compress the contents of the documents
> > >> contained
> > >> in the index (it would be a weird behavior after all), but instead
> did
> > >> something like optimizing the format of the files of the index.
> Another
> > >> detail is that when I write my index with Lucene.Net I end up with at
> > >> least
> > >> 3 files, while when I open it with Luke I always get 2 files only.
> > >> And yes,
> > >> I am calling IndexWriter.Optimize() when finished indexing. Am I
> missing
> > >> something maybe?
> > >>
> > >>  Simone
> > >
> >
>



-- 
Thanks,
Laxmilal menaria

http://www.minalyzer.com/
http://www.chambal.com/

Re: Index physical size

Posted by Laxmilal Menaria <lm...@chambal.com>.

My steps is :

1.  Create index using Writer and to add stuff
2. optimize and close index writer only.



On 7/18/07, Simone Busoli <si...@gmail.com> wrote:
>
>  I'm using the latest release of Lucene.Net.
>
> Here's the steps of the application:
>
> 1. create index
> 2. open index reader to remove stuff
> 3. close index reader
> 4. open index writer to add stuff
> 5. optimize and close index writer
>
> 2-5 are repeated at intervals. So there's always at most one object
> writing to the index at one point in time.
>
> Not a big issue after all, but thanks for your help.
>
> Simone
>
> Jokin Cuadrado wrote:
>
> wich version of lucene you use?
> have you a reader opened?
>
> it seems reasonable to me, because if i remember well, the cleaning of
> old unused files is made when the index is opened. ¿have you tried to
> open the index with lucene.net after creating it to see if the result
> is the same?
>
> jokin
>
> On 7/18/07, Simone Busoli <si...@gmail.com>wrote:
>
> I don't know. This is the situation when I create and optimize the index
> with Lucene.Net:
>
> segments   28 Byte
> _i5.cfs      543 kByte
> deletable   12 Byte
> _bd.cfs      317 kByte
>
> Once the index is opened with Luke only segments and _i5.cfs remain,
> untouched. So the only difference is that _bd.cfs and deletable are
> removed. Well, deletable looked like a good candidate to be deleted, but
> what about _bd.cfs? It looks like it wans't needed then.
>
> Simone
>
> Jokin Cuadrado wrote:
> > I'm wandering about, but may be an issue with the text codification
> > used? if it's just the 50%, maybe lucene.net it's using a codification
> > than needs 2 bytes for each character by default, and luke is using
> > one that only needs 1 byte.
> >
> > regard the number of files,  maybe luke don't take acount of the
> > "deletables" file, that contains the files that are no longer used and
> > may be deleted because it don't delete files. But i think that it's no
> > relevant to the another question.
> >
> > jokin.
> >
> > On 7/17/07, Simone Busoli <si...@gmail.com>wrote:
> >>
> >>  Hi Jokin,
> >>
> >>  actually I found some information about it. As far as I've discovered
> >> compression can be applied to fields of documents, before adding them
> >> to the
> >> index, even if Lucene.Net doesn't supply it out of the box. But the
> >> issue I
> >> reported doesn't have to do with this, because index size reduction
> >> seems to
> >> be applied to a higher level by Luke, I mean, to an index already
> >> containing
> >> documents with uncompressed fields. In fact, when reopening the index
> >> with
> >> Lucene.Net after it's been opened - and you see, optimized - by Luke,
> >> I am
> >> still able to read it, even if I didn't configure support for
> >> compression.
> >> This means that Luke didn't compress the contents of the documents
> >> contained
> >> in the index (it would be a weird behavior after all), but instead did
> >> something like optimizing the format of the files of the index. Another
>
> >> detail is that when I write my index with Lucene.Net I end up with at
> >> least
> >> 3 files, while when I open it with Luke I always get 2 files only.
> >> And yes,
> >> I am calling IndexWriter.Optimize() when finished indexing. Am I
> missing
> >> something maybe?
> >>
> >>  Simone
> >
>
>
>


-- 
Thanks,
Laxmilal menaria

http://www.minalyzer.com/
http://www.chambal.com/

Re: Index physical size

Posted by Jokin Cuadrado <jo...@gmail.com>.

according to what I say is the expected behavior.  Until you open
another reader lucene won't delete the old index files left behind. if
you open the index with luke you are using a reader and cleaning the
old files.

to ensure I'm correct you can open a reader with lucene.net after the
step 5 to test if it deletes the remaining files.

jokin

On 7/18/07, Simone Busoli <si...@gmail.com> wrote:
>
>  I'm using the latest release of Lucene.Net.
>
>  Here's the steps of the application:
>
>  1. create index
>  2. open index reader to remove stuff
>  3. close index reader
>  4. open index writer to add stuff
>  5. optimize and close index writer
>
>  2-5 are repeated at intervals. So there's always at most one object writing
> to the index at one point in time.
>
>  Not a big issue after all, but thanks for your help.
>
>  Simone
>
>

Re: Index physical size

Posted by Jokin Cuadrado <jo...@gmail.com>.

wich version of lucene you use?
have you a reader opened?

it seems reasonable to me, because if i remember well, the cleaning of
old unused files is made when the index is opened. ¿have you tried to
open the index with lucene.net after creating it to see if the result
is the same?

jokin

On 7/18/07, Simone Busoli <si...@gmail.com> wrote:
> I don't know. This is the situation when I create and optimize the index
> with Lucene.Net:
>
> segments   28 Byte
> _i5.cfs      543 kByte
> deletable   12 Byte
> _bd.cfs      317 kByte
>
> Once the index is opened with Luke only segments and _i5.cfs remain,
> untouched. So the only difference is that _bd.cfs and deletable are
> removed. Well, deletable looked like a good candidate to be deleted, but
> what about _bd.cfs? It looks like it wans't needed then.
>
> Simone
>
> Jokin Cuadrado wrote:
> > I'm wandering about, but may be an issue with the text codification
> > used? if it's just the 50%, maybe lucene.net it's using a codification
> > than needs 2 bytes for each character by default, and luke is using
> > one that only needs 1 byte.
> >
> > regard the number of files,  maybe luke don't take acount of the
> > "deletables" file, that contains the files that are no longer used and
> > may be deleted because it don't delete files. But i think that it's no
> > relevant to the another question.
> >
> > jokin.
> >
> > On 7/17/07, Simone Busoli <si...@gmail.com> wrote:
> >>
> >>  Hi Jokin,
> >>
> >>  actually I found some information about it. As far as I've discovered
> >> compression can be applied to fields of documents, before adding them
> >> to the
> >> index, even if Lucene.Net doesn't supply it out of the box. But the
> >> issue I
> >> reported doesn't have to do with this, because index size reduction
> >> seems to
> >> be applied to a higher level by Luke, I mean, to an index already
> >> containing
> >> documents with uncompressed fields. In fact, when reopening the index
> >> with
> >> Lucene.Net after it's been opened - and you see, optimized - by Luke,
> >> I am
> >> still able to read it, even if I didn't configure support for
> >> compression.
> >> This means that Luke didn't compress the contents of the documents
> >> contained
> >> in the index (it would be a weird behavior after all), but instead did
> >> something like optimizing the format of the files of the index. Another
> >> detail is that when I write my index with Lucene.Net I end up with at
> >> least
> >> 3 files, while when I open it with Luke I always get 2 files only.
> >> And yes,
> >> I am calling IndexWriter.Optimize() when finished indexing. Am I missing
> >> something maybe?
> >>
> >>  Simone
> >
>

Re: Index physical size

Posted by Simone Busoli <si...@gmail.com>.

I don't know. This is the situation when I create and optimize the index
with Lucene.Net:

segments   28 Byte
_i5.cfs      543 kByte
deletable   12 Byte
_bd.cfs      317 kByte

Once the index is opened with Luke only segments and _i5.cfs remain,
untouched. So the only difference is that _bd.cfs and deletable are
removed. Well, deletable looked like a good candidate to be deleted, but
what about _bd.cfs? It looks like it wans't needed then.

Simone

Jokin Cuadrado wrote:
> I'm wandering about, but may be an issue with the text codification
> used? if it's just the 50%, maybe lucene.net it's using a codification
> than needs 2 bytes for each character by default, and luke is using
> one that only needs 1 byte.
>
> regard the number of files,  maybe luke don't take acount of the
> "deletables" file, that contains the files that are no longer used and
> may be deleted because it don't delete files. But i think that it's no
> relevant to the another question.
>
> jokin.
>
> On 7/17/07, Simone Busoli <si...@gmail.com> wrote:
>>
>>  Hi Jokin,
>>
>>  actually I found some information about it. As far as I've discovered
>> compression can be applied to fields of documents, before adding them
>> to the
>> index, even if Lucene.Net doesn't supply it out of the box. But the
>> issue I
>> reported doesn't have to do with this, because index size reduction
>> seems to
>> be applied to a higher level by Luke, I mean, to an index already
>> containing
>> documents with uncompressed fields. In fact, when reopening the index
>> with
>> Lucene.Net after it's been opened - and you see, optimized - by Luke,
>> I am
>> still able to read it, even if I didn't configure support for
>> compression.
>> This means that Luke didn't compress the contents of the documents
>> contained
>> in the index (it would be a weird behavior after all), but instead did
>> something like optimizing the format of the files of the index. Another
>> detail is that when I write my index with Lucene.Net I end up with at
>> least
>> 3 files, while when I open it with Luke I always get 2 files only.
>> And yes,
>> I am calling IndexWriter.Optimize() when finished indexing. Am I missing
>> something maybe?
>>
>>  Simone
>

Re: Index physical size

Posted by Jokin Cuadrado <jo...@gmail.com>.

I'm wandering about, but may be an issue with the text codification
used? if it's just the 50%, maybe lucene.net it's using a codification
than needs 2 bytes for each character by default, and luke is using
one that only needs 1 byte.

regard the number of files,  maybe luke don't take acount of the
"deletables" file, that contains the files that are no longer used and
may be deleted because it don't delete files. But i think that it's no
relevant to the another question.

jokin.

On 7/17/07, Simone Busoli <si...@gmail.com> wrote:
>
>  Hi Jokin,
>
>  actually I found some information about it. As far as I've discovered
> compression can be applied to fields of documents, before adding them to the
> index, even if Lucene.Net doesn't supply it out of the box. But the issue I
> reported doesn't have to do with this, because index size reduction seems to
> be applied to a higher level by Luke, I mean, to an index already containing
> documents with uncompressed fields. In fact, when reopening the index with
> Lucene.Net after it's been opened - and you see, optimized - by Luke, I am
> still able to read it, even if I didn't configure support for compression.
> This means that Luke didn't compress the contents of the documents contained
> in the index (it would be a weird behavior after all), but instead did
> something like optimizing the format of the files of the index. Another
> detail is that when I write my index with Lucene.Net I end up with at least
> 3 files, while when I open it with Luke I always get 2 files only. And yes,
> I am calling IndexWriter.Optimize() when finished indexing. Am I missing
> something maybe?
>
>  Simone