You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by 张志田 <zh...@dianping.com> on 2010/05/05 08:24:09 UTC

How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation?

Any body can give me a hand?

Thanks in advance.

Garry

Re: How can I merge .cfx and .cfs into a single cfs file?

Posted by 张志田 <zh...@dianping.com>.
Thank you Mike.

Garry
----- Original Message ----- 
From: "Michael McCandless" <lu...@mikemccandless.com>
To: <ja...@lucene.apache.org>
Sent: Wednesday, May 05, 2010 8:24 PM
Subject: Re: How can I merge .cfx and .cfs into a single cfs file?


Lucene considers an index with a single .cfx and a single .cfs as optimized.

Also, note that how Lucene stores files in the index is an impl detail
-- it can change from release to release -- so relying on any of these
details is dangerous.

That said, with recent Lucene versions, if you really want to force
these two files to be consolidated, you can make a custom MergePolicy
that returns a merge from the findMergesForOptimize for this case (the
default MergePolicy returns null since it thinks this case is already
optimized).

Or... if you index with two separate IndexWriter sessions, and then
call optimize, that should also merge down to one file.

Mike

2010/5/5 张志田 <zh...@dianping.com>:
> Uwe, thank you very much.
>
> What is the mechanizm lucene will merge these two kinds of files? Sometimes I found there was only one .cfs file, but in another time there may be one cfs and cfx. I understand the .cfx is used to store the term vectors etc, but why does the index result not seem to be consistent?
>
> Thanks,
> Garry
> ----- Original Message -----
> From: "Uwe Goetzke" <uw...@healy-hudson.com>
> To: <ja...@lucene.apache.org>
> Sent: Wednesday, May 05, 2010 3:57 PM
> Subject: AW: How can I merge .cfx and .cfs into a single cfs file?
>
>
> Index all into a directory and determine the size of all files in it.
>
> From http://lucene.apache.org/java/3_0_1/fileformats.html
> Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.
>
> In addition to
> Compound File  .cfs  An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.
>
> Uwe
>
>
> -----Ursprüngliche Nachricht-----
> Von: 张志田 [mailto:zhitian.zhang@dianping.com]
> Gesendet: Mittwoch, 5. Mai 2010 08:24
> An: java-user@lucene.apache.org
> Betreff: How can I merge .cfx and .cfs into a single cfs file?
>
> Hi all,
>
> I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one?
>
> By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation?
>
> Any body can give me a hand?
>
> Thanks in advance.
>
> Garry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How can I merge .cfx and .cfs into a single cfs file?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Lucene considers an index with a single .cfx and a single .cfs as optimized.

Also, note that how Lucene stores files in the index is an impl detail
-- it can change from release to release -- so relying on any of these
details is dangerous.

That said, with recent Lucene versions, if you really want to force
these two files to be consolidated, you can make a custom MergePolicy
that returns a merge from the findMergesForOptimize for this case (the
default MergePolicy returns null since it thinks this case is already
optimized).

Or... if you index with two separate IndexWriter sessions, and then
call optimize, that should also merge down to one file.

Mike

2010/5/5 张志田 <zh...@dianping.com>:
> Uwe, thank you very much.
>
> What is the mechanizm lucene will merge these two kinds of files? Sometimes I found there was only one .cfs file, but in another time there may be one cfs and cfx. I understand the .cfx is used to store the term vectors etc, but why does the index result not seem to be consistent?
>
> Thanks,
> Garry
> ----- Original Message -----
> From: "Uwe Goetzke" <uw...@healy-hudson.com>
> To: <ja...@lucene.apache.org>
> Sent: Wednesday, May 05, 2010 3:57 PM
> Subject: AW: How can I merge .cfx and .cfs into a single cfs file?
>
>
> Index all into a directory and determine the size of all files in it.
>
> From http://lucene.apache.org/java/3_0_1/fileformats.html
> Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.
>
> In addition to
> Compound File  .cfs  An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.
>
> Uwe
>
>
> -----Ursprüngliche Nachricht-----
> Von: 张志田 [mailto:zhitian.zhang@dianping.com]
> Gesendet: Mittwoch, 5. Mai 2010 08:24
> An: java-user@lucene.apache.org
> Betreff: How can I merge .cfx and .cfs into a single cfs file?
>
> Hi all,
>
> I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one?
>
> By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation?
>
> Any body can give me a hand?
>
> Thanks in advance.
>
> Garry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How can I merge .cfx and .cfs into a single cfs file?

Posted by 张志田 <zh...@dianping.com>.
Uwe, thank you very much.

What is the mechanizm lucene will merge these two kinds of files? Sometimes I found there was only one .cfs file, but in another time there may be one cfs and cfx. I understand the .cfx is used to store the term vectors etc, but why does the index result not seem to be consistent?

Thanks,
Garry
----- Original Message ----- 
From: "Uwe Goetzke" <uw...@healy-hudson.com>
To: <ja...@lucene.apache.org>
Sent: Wednesday, May 05, 2010 3:57 PM
Subject: AW: How can I merge .cfx and .cfs into a single cfs file?


Index all into a directory and determine the size of all files in it.

From http://lucene.apache.org/java/3_0_1/fileformats.html 
Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.

In addition to
Compound File  .cfs  An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.

Uwe


-----Ursprüngliche Nachricht-----
Von: 张志田 [mailto:zhitian.zhang@dianping.com] 
Gesendet: Mittwoch, 5. Mai 2010 08:24
An: java-user@lucene.apache.org
Betreff: How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation?

Any body can give me a hand?

Thanks in advance.

Garry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: How can I merge .cfx and .cfs into a single cfs file?

Posted by Uwe Goetzke <uw...@healy-hudson.com>.
Index all into a directory and determine the size of all files in it.

>From http://lucene.apache.org/java/3_0_1/fileformats.html 
Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.

In addition to
Compound File  	.cfs  	An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.

Uwe


-----Ursprüngliche Nachricht-----
Von: 张志田 [mailto:zhitian.zhang@dianping.com] 
Gesendet: Mittwoch, 5. Mai 2010 08:24
An: java-user@lucene.apache.org
Betreff: How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation?

Any body can give me a hand?

Thanks in advance.

Garry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org