You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Daniel Naber (JIRA)" <ji...@apache.org> on 2007/12/07 23:36:43 UTC

[jira] Created: (LUCENE-1084) increase default maxFieldLength?

increase default maxFieldLength?
--------------------------------

                 Key: LUCENE-1084
                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.2
            Reporter: Daniel Naber


To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.

A better new default might be Integer.MAX_VALUE.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1084.
----------------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566728#action_12566728 ] 

Steven Rowe commented on LUCENE-1084:
-------------------------------------

Mike, I see you added a test for the user-specified max field length - cool.

It made me think of the IndexHTML.java usage, which should probably be changed to conform to the new style:

{code}
      writer = new IndexWriter(index, new StandardAnalyzer(), create, 
                               IndexWriter.MaxFieldLength.LIMITED);
      writer.setMaxFieldLength(1000000);
{code}

should be:

{code}
      writer = new IndexWriter(index, new StandardAnalyzer(), create, 
                               new IndexWriter.MaxFieldLength(1000000));
{code}

Hmm, now that I look, I can see several other new IndexWriter() / setMaxFieldLength() sequences that should be changed ... I'll submit a patch shortly.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1084:
------------------------------------------

    Assignee: Michael McCandless

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559549#action_12559549 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------


{quote}
Does this break back-compatibility if we change this in a minor version?
{quote}

Technically it does, but I think (guesssing?) most of the time when
this limit kicks in, the application (if they knew about it) would
consider it a bug?  Ie, fixing this would almost always improve
applications?

I think most Lucene users are not aware that this truncation is
occurring (it's not something you'd expect).

Or, are we thinking that there are quite a few applications where
truly enormous documents are indexed and this change is going to cause
massive slowdown for such applications?


> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559887#action_12559887 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------

{quote}
This kind of limit is common on web search engines. It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file). So changing it might indeed hurt folks who're indexing uncontrolled web content.
{quote}

OK, it seems like it's an important safeguard, and risky to change, so
let's wait for 3.0.

Maybe we could increase it from 10K --> 100K to reduce the times when
a legit document is truncated?

{quote}
An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor. That way, there is no silent loss (or gain) of content - the user must specify.
{quote}

I think this is a good idea; it basically forces the user to confront
the truncation issue up front.


> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1084:
---------------------------------------

    Fix Version/s: 2.4

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566791#action_12566791 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------

Good -- I'll commit.  Thanks!

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.part3.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1084:
---------------------------------------

    Fix Version/s:     (was: 2.4)
                   3.0

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 3.0
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1084:
--------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 3.0
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566717#action_12566717 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------

This looks good!  Thanks Steven.  I'll commit.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1084:
--------------------------------

    Attachment: LUCENE-1084.part2.take2.patch

The javadoc description of the MaxFieldLength parameter (mfl) in the IndexWriter constructors is now "whether or not to limit field lengths".  

This should probably be rewritten in light of the change allowing user specification of values other than LIMITED or UNLIMITED.  How's this?:

{code}
/** ...
 * @param mfl Maximum field length: LIMITED, UNLIMITED, or user-specified
 *   via the MaxFieldLength constructor.
 * ...
 */
{code}

Attaching part2.take2 patch including this javadoc change and the MaxFieldLength extension and constructor changes.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1084:
--------------------------------

    Attachment: LUCENE-1084.part2.patch

bq. I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit?

+1

Attaching a patch implementing this idea.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566624#action_12566624 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------

Yeah, the intention here was that you specify LIMITED or UNLIMITED enum parameter on creating the IW, and then change the actual limit using setMaxFieldLength.

But I agree it'd be cleaner if you could actually just set the limit (once) on construction.

I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit?

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1084.
----------------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 3.0)
                   2.4
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

I just committed this.  Thanks Daniel for opening it, and thanks Steven for the patch!

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560792#action_12560792 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------


This patch looks great; thanks Steven!

This forces users to confront, up front, when creating IndexWriter,
whether they want to truncate the fields.

It deprecates all existing IndexWriter ctors, noting that in 3.0 they
will be removed, in favor of the version that adds the MaxFieldLength
argument.

We can commit this now, and then removed the deprecated APIs in 3.0.
I think with this change it's OK to leave the default truncation limit
at 10,000?

I think we can leave the IndexHTML.java with that current large limit?

I plan to commit in a day or two!

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 3.0
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559730#action_12559730 ] 

Doug Cutting commented on LUCENE-1084:
--------------------------------------

This kind of limit is common on web search engines.  It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file).  So changing it might indeed hurt folks who're indexing uncontrolled web content.


> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1084:
--------------------------------

    Attachment: LUCENE-1084.patch

Attaching a patch implementing my suggestion to add an explicit maximum field length parameter to IndexWriter constructors.  I named the inner class MaxFieldLength instead of FieldLength.

The patch deprecates the pre-existing constructors, and adds a new constructor with the added parameter for each of the pre-existing constructors.  As a result, the patch could be applied to the trunk as soon as 2.3 has been released; the deprecated pre-existing constructors would then be removed as part of the 3.0 release.

The patch also replaces all IndexWriter constructor calls in the tree with their equivalents taking the explicit max field length parameter.  (I excluded IndexModifier, because it is already scheduled for removal as part of the 3.0 release.)

I see that in o.a.l.demo.IndexHTML.java, the max field length is set significantly above the current default limit:

71:   writer = new IndexWriter(index, new StandardAnalyzer(), create);
72:   writer.setMaxFieldLength(1000000);

Should the above two lines be replaced with the following? (I did *not* do this in the attached patch):

   writer = new IndexWriter(index, new StandardAnalyzer(), create, 
       IndexWriter.MaxFieldLength.UNLIMITED);

That is, should the demo use Integer.MAX_VALUE instead of 1,000,000 for the maximum field length?

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 3.0
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559583#action_12559583 ] 

Grant Ingersoll commented on LUCENE-1084:
-----------------------------------------

I don't know the answer to those questions, I just know we faced similar issues with changing the StandardTokenizer.  It was clearly doing something wrong (incorrectly marking acronyms), yet we had to do the deprecation dance to handle it b/c there may be applications that rely on that functionality.  This one seems a little  less problematic, I agree.  There may, however, be applications out there that occasionally hit a really large document and don't want it to throw off their indexing, so they truncate it.

I think we ought to have a separate discussion on our back compatibility standards on java-dev, as I, personally, think we have some room for improvement when it comes to these kinds of things.   I will kick it off, as it has been something I have been mulling for a while.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559748#action_12559748 ] 

Steven Rowe commented on LUCENE-1084:
-------------------------------------

An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor.  That way, there is no silent loss (or gain) of content - the user must specify.  

Maybe the type of this extra parameter could be an enumeration, along the lines of BooleanClause.Occur, e.g. IndexWriter.FieldLength.LIMITED and IndexWriter.FieldLength.UNLIMITED.  LIMITED would set the max to the current default, setMaxFieldLimit() would operate as it does currently, and UNLIMITED would set the max to Integer.MAX_VALUE.

This solution would definitely be a backward-incompatible change, and would have to wait for the 3.0 release.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Reopened: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe reopened LUCENE-1084:
---------------------------------

         Assignee: Steven Rowe  (was: Michael McCandless)
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Reopening to attach a patch to allow for (variable) user specification of maximum field length in IndexWriter constructors.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566600#action_12566600 ] 

Grant Ingersoll commented on LUCENE-1084:
-----------------------------------------

Sorry for the lateness on this, was just looking at another patch that had the deprecated constructor and was going to replace it...

How, exactly, is one supposed to specify an alternate value for the MaxFieldLength in the constructor when using the IndexWriter.MaxFieldLength constructor when it is private and not extendable?  Or is it that I pass in a LIMITED or UNLIMITED and then immediately go call indexWriter.setMaxFieldLength()?  That doesn't make sense to me from a user's point of view of trying to understand how to set the maxFieldLength, so I feel like I must be missing something.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559490#action_12559490 ] 

Grant Ingersoll commented on LUCENE-1084:
-----------------------------------------

Does this break back-compatibility if we change this in a minor version? 

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1084) increase default maxFieldLength?

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1084:
--------------------------------

    Attachment: LUCENE-1084.part3.patch

Patch replacing [IW constructor, MFL setter] sequences with IW constructor taking new'd MFL.

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>            Assignee: Steven Rowe
>             Fix For: 2.4
>
>         Attachments: LUCENE-1084.part2.patch, LUCENE-1084.part2.take2.patch, LUCENE-1084.part3.patch, LUCENE-1084.patch
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1084) increase default maxFieldLength?

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559440#action_12559440 ] 

Michael McCandless commented on LUCENE-1084:
--------------------------------------------

+1

Users frequently trip up on this one, and the fact that we silently truncate at 10,000 tokens is "unexpected".  I think the default should be Integer.MAX_VALUE.

How about doing this for 2.4?

> increase default maxFieldLength?
> --------------------------------
>
>                 Key: LUCENE-1084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1084
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Daniel Naber
>             Fix For: 2.4
>
>
> To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.
> A better new default might be Integer.MAX_VALUE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org