You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2003/04/08 22:00:12 UTC
DO NOT REPLY [Bug 18833] -
maxFieldLength design flaw: large documents silently truncated
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833
maxFieldLength design flaw: large documents silently truncated
------- Additional Comments From cutting@apache.org 2003-04-08 20:00 -------
This is fairly common in search engines. For example, Google silently truncates
pages whose HTML is longer than 100kB, around the same point where Lucene
truncates. The problem is that crawlers and file system walkers would otherwise
attempt to index things like gigantic log files, binaries, etc.
I see your point though that for some classes of use, when the set of documents
is tightly controlled and it is a requirement that every single word is indexed,
this is a problem. The workaround is simple, although perhaps not obvious.
My concern with changing the default is that it would break all those folks who
depend on the current setting to keep their indexing from blowing up.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org