You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2009/04/13 00:51:14 UTC

[jira] Created: (LUCENE-1597) New Document and Field API

New Document and Field API
--------------------------

                 Key: LUCENE-1597
                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Michael Busch
            Priority: Minor


This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)

It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 

Main ideas:
- separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
- I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
- A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
- Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".

Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch reassigned LUCENE-1597:
-------------------------------------

    Assignee: Michael Busch

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1597) New Document and Field API

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698560#action_12698560 ] 

Michael McCandless commented on LUCENE-1597:
--------------------------------------------


This looks great!  Many random thoughts...

This is largely a cleaner restructuring of what's already held in
*Field, cutting over to AttributeSource so that we gain extensibility
to other attrs people would want to store.  It also decouples type
from value, which is great.

It's also quite different from Lucy/KS's approach which is to use
carefully thought out subclasses to represent the type hierarchy.  Ie
Lucy/KS uses "the language" (classes/subclasses) to express things,
and this approach uses AttributeSource (which is sort of our
workaround for Java not allowing multiple inheritance).

This approach subdivides a type into N fully orthogonal attributes, so
a type is some combination of configured instances of these
attributes.  This actually mirrors what Field does today (in that we
have Field.Store.X, Field.Index.X, Field.TermVector.X).

This can sometimes be awkward because attributes are "flat", eg
TermVectorAttribute only makes sense for indexed fields, or for a
BinaryFieldValue most attributes are not allowed.  We don't get strong
type checking of such "mistakes", vs KS/Lucy's approach.

How would you turn on/off [future] CSF storage?  A separate attr?  A
boolean on StoreAttribute?

NumericFieldAttribute seems awkward (one shouldn't have to turn on/off
zero padding, trie; or rather it's better to operate in "use cases"
like "I want to do range filtering" or "I want to sort").  Seems like
maybe we need a SortAttribute and RangeFilterAttribute
(or... something).

Presumably would could make an "iterate over all fields" utility so
that a consumer of document wouldn't have to differentiate b/w fixed &
variable fields.

In this model, can one re-use FieldValue for maximizing indexing
throughput?  Seems like yes?

StoredFieldsWriter is needing to do instanceof checks & casting,
which'd be nice to [somehow] avoid.

It'd be great to land this before 2.9 (and cut back to Java 1.4) but
maybe that's too ambitious.

Should we make "get me your TokenStream" (get/setAnalyzer) a part of
IndexAttribute?

Can a single FieldDescriptor be shared among many fields?  Seems like
we'd have to take name out of FieldDescriptor (I don't think the name
should be in FieldDescriptor, anyway).

Also how would we correspondingly fix FieldInfos to "generically"
store & merge attribute values?  (EG TermVectorAttribute's
isStoreOffsets/Positions get "merged" and changed whenever segments
are merged, or docs are added to RAM buffer).  Seems like each
attribute needs a write/read/merge?

One thing I like about DocumentDescriptor is it can be the basis for
app-level schema... we could eventually allows serialize/deserialize
(eg XML or JSON) of the doc DocumentDescriptor.  In fact wouldn't
FieldInfos simply store a DocumentDescriptor (having been merged from
all the docs in that segment)?  It also may enable some speedups
during indexing eg I can imagine (future) having an indexing chain
that's provided the DocumentDescriptor it will handle, up front.

Can we maybe rename Descriptor -> Type?  Eg FieldDescriptor ->
FieldType?


> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1597) New Document and Field API

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698443#action_12698443 ] 

Yonik Seeley commented on LUCENE-1597:
--------------------------------------

Separating FieldDescriptor and FieldValue sounds interesting... but I don't see the need for DocumentDescriptor, or the need to set it on the IndexWriter (and then have to have the distinction between fixed and variable fields).

What about something along the lines of
{code}
class Field {
  FieldDescriptor descriptor;
  String fieldName;  // or alternately, the descriptor could contain the name
  FieldValue[] fieldValues;
  float boost;
}

class InputDocument {
  Map<String fieldName, Field>  OR List<Field> fields;
}
{code}

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703367#action_12703367 ] 

Michael Busch commented on LUCENE-1597:
---------------------------------------

Thanks for the thorough review, Mike. Reading your response made me really excited, because you exactly understood most of the thoughts I put into this code, without me even mentioning them :) Thanks for writing them down!

I started including your suggestions into my patch and will reply with more detail to your individual points as I'm working on them.

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1597:
----------------------------------

    Attachment: lucene-new-doc-api.patch

You should start with looking at newdoc/demo/DocumentProducer.java. This class shows how a user of Lucene would add documents to a Lucene index with the new API.

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1597:
----------------------------------

    Fix Version/s: 3.1

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703407#action_12703407 ] 

Michael Busch commented on LUCENE-1597:
---------------------------------------

{quote}
Can we maybe rename Descriptor -> Type? Eg FieldDescriptor ->
FieldType?
{quote}

Done.

{quote}
Can a single FieldDescriptor be shared among many fields? Seems like
we'd have to take name out of FieldDescriptor (I don't think the name
should be in FieldDescriptor, anyway).
{quote}

I agree, this should be possible. I removed the name.

{quote}
NumericFieldAttribute seems awkward (one shouldn't have to turn on/off
zero padding, trie; or rather it's better to operate in "use cases"
like "I want to do range filtering" or "I want to sort"). Seems like
maybe we need a SortAttribute and RangeFilterAttribute
(or... something).
{quote}

Yep I agree. Some things in this prototype are quite goofy, because I 
wanted to mainly demonstrate the main ideas. The attributes you suggest
make sense to me.


> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1597) New Document and Field API

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703391#action_12703391 ] 

Michael Busch commented on LUCENE-1597:
---------------------------------------

{quote}
How would you turn on/off [future] CSF storage? A separate attr? A
boolean on StoreAttribute?
{quote}

I was thinking about adding a separate attribute. But here is one
thing I haven't figured out yet: it should actually be perfectly fine
to store a value in a CSF and *also* in the 'normal' store. The
problem is that the type of data input is the limiting factor here: if
the user provides the data as a byte array, then everything works
fine. However, if the data is provide as a Reader, then it's not
guaranteed that the reader can be read more than once. To implement
reset() is optional, as the javadocs say.

So maybe we should state in our javadocs that a reader must support
reset(), otherwise writing the data into more than one data structures
will result in an undefined behavior? Alternatively we could introduce
a new class called ResetableReader, where reset() is abstract, and
change the API in 3.0 to only accept that type of reader?

Btw. the same is true for fields that provide the data as a
TokenStream. 

> New Document and Field API
> --------------------------
>
>                 Key: LUCENE-1597
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1597
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>         Attachments: lucene-new-doc-api.patch
>
>
> This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :)
> It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations.
> The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. 
> Main ideas:
> - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue
> - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class<->Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document "instances" are created and added via addDocument().
> - A Document instance allows adding "variable fields" in addition to the "fixed fields" the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves.
> - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package "newdoc".
> Again, this is not a "real" patch, but rather a demo of how a new API could roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org