You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2012/05/12 05:01:56 UTC

[jira] [Created] (LUCENE-4050) Change SegmentInfos format to plain text

Andrzej Bialecki  created LUCENE-4050:
-----------------------------------------

             Summary: Change SegmentInfos format to plain text
                 Key: LUCENE-4050
                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Andrzej Bialecki 
             Fix For: 4.0


I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.

SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.

This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 

Then the plain text SegmentInfos would contain just the following information:

* list of global files for this commit point (if any)
* list of segments for this commit point, and their corresponding codec class names
* user data map


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276204#comment-13276204 ] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

bq.  However, we could change the two-phase commit implementation to the following:

I think that's a good solution?  It seems important to keep the non-codec-controlled write/read as simple as possible...

The only small thing we lose is if a disk full is going to strike... today we write the 0s ahead (in prepareCommit) so that we'll hit disk full during prepareCommit and not commit... but I think the chance of those 4 bytes hitting the disk full is very low so the simpler code is better...
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276325#comment-13276325 ] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. The only small thing we lose is if a disk full is going to strike... 
I thought about this too - if it's really a big concern we could use the following trick: > 99% filesystems keep data in blocks that are multiples of 512 bytes. We could add filler bytes at the end of the file so that it comes out to a round multiple of 512 B, and only then append the marker and the checksum. This way we will know that writing a marker required allocation of a new block, and if it succeeded then writing a checksum should also succeed.
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned LUCENE-4050:
-----------------------------------------

    Assignee: Robert Muir
    
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273927#comment-13273927 ] 

Robert Muir commented on LUCENE-4050:
-------------------------------------

{quote}
In fact Lucene used to use rename to commit the segments file but this
proved problematic on Windows (sometimes the rename would hit "access
denied" error).
{quote}

Well, problematic at least once right? I dont think it justifies doing
things a strange way.

Surely this is just some problem only on windows 3.1 and java 1.2 or
something and now fixed, since this is how every other linux/cygwin program
(e.g. vi) works.

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273858#comment-13273858 ] 

Robert Muir commented on LUCENE-4050:
-------------------------------------

I agree this is a total mess. We should really revisit how we handle:

# commit file (in my opinion this should just be a list of segments! only!)
  currently segmentinfos stores a ton of stuff more than this, it stores
  per-segment metadata within this file when it really should not.
# per-segment metadata. In this case we have a lot of confusion with 
  segmentinfo and fieldinfo. It would be great for the codec to have more
  flexibility here, via abstract classes/interfaces+attributes or something
  that ensures its lossless yet still a codec can add what it needs. Really
  for the most part segmentinfo is basically useless since many values actually
  return "well if you want to know this, then go look at the fieldinfos".
# actual commit strategy. We do a lot of funky stuff like writing fake bogus
  data, seeking backwards, etc. Why not just a normal atomic rename like
  any other computer program on the planet????

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273921#comment-13273921 ] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

+1 to fully separate (separate files maybe?) the codec-neutral "list
of committed segments" from "the codec-specific details/metadata for
each segment".

Then, a codec can easily store its own stuff in the segment metadata.

And I agree the FieldInfo/SegmentInfo duality is confusing...

Plain text encoding of these files would be really nice but isn't as
important, I think... and will be a fair amount of work (I suspect we
need a JSON or YAML or something that represents lists, maps,
different native types, etc.).  I think this is separate / can come
later.

{quote}
We do a lot of funky stuff like writing fake bogus
data, seeking backwards, etc. Why not just a normal atomic rename like
any other computer program on the planet????
{quote}

In fact Lucene used to use rename to commit the segments file but this
proved problematic on Windows (sometimes the rename would hit "access
denied" error).

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277041#comment-13277041 ] 

Marvin Humphrey commented on LUCENE-4050:
-----------------------------------------

Ever considered using hard links instead of renaming?
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277281#comment-13277281 ] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

bq. Ever considered using hard links instead of renaming?

That's a neat option ... but I think it's only in Java 7 that we can create hard links (java.nio.file.Files.createLink)?  And even then it's an optional operation...
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274064#comment-13274064 ] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. Plain text encoding of these files would be really nice but isn't as important, I think...

Yeah, it could be sufficient if we would agree on necessarily separate the "plain list of segments:codec" from the segmentInfo/fieldInfo parts and push those parts down to the codec-specific formats.

Then we could just use a version number as the first element of this file to allow for extensions in the future, like e.g. switching to JSON or to some other format du jour.

bq. Surely this is just some problem only on windows 3.1 and java 1.2 or something and now fixed, since this is how every other linux/cygwin program (e.g. vi) works.

I'm not so sure. I know for a fact that Windows doesn't allow renames or deletes of open files, no matter if it's open by you or by some other process (e.g. user examining the file in Notepad.exe), and IIRC the issue was that JVM doesn't release OS file handles quickly enough.
                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274271#comment-13274271 ] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

Discussing this further with Robert, it looks like this is a (smaller) part of a larger issue, in that SegmentInfo+FieldInfo should be made extensible and the process of reading/writing this information should be *completely codec-specific*. Let's make a separate issue for that part.

And the smaller issue discussed here is to record only the information about a commit point in a *completely codec-independent, versioned format*, whatever that format is. Let's call it CommitInfo or whatever other name fits. This part would be written to a file that is separate from the codec-dependent parts.

Regarding two-phase commit and checksums - one reason we have SegmentInfosWriter/Reader was the AppendingCodec, because we couldn't make it work for append-only filesystems. However, we could change the two-phase commit implementation to the following:

* write the data to the CommitInfo file
* write a marker indicating "end of data, checksum follows"
* finally, write the checksum

Then the reading code knows that:
* if there's a marker missing then the file is invalid
* if the marker is present then the checksum must be present too
* and the checksum must be correct.

This implementation doesn't require seek back / overwrite so it's supported on any filesystem.
                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-4050:
--------------------------------------

    Issue Type: Bug  (was: Improvement)

It's actually a bug - it's not possible to cleanly extend index format via Codec-s without addressing this issue.
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-4050) Make segments_NN file codec-independent

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-4050:
--------------------------------------

    Summary: Make segments_NN file codec-independent  (was: Change SegmentInfos format to plain text)

Changing the title to better reflect the scope of this issue.
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org