You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2009/01/15 23:45:59 UTC

[jira] Created: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

OOM erros with CheckIndex with indexes containg a lot of fields with norms
--------------------------------------------------------------------------

                 Key: LUCENE-1520
                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index
    Affects Versions: 2.9
            Reporter: Uwe Schindler


All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.

You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).

CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.

In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.

Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.

I will prepare a patch tomorrow.

Mike proposed another quick fix for CheckIndex:
bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1520.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.9

Committed revision 734967.

Thanks Uwe!

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664480#action_12664480 ] 

Michael McCandless commented on LUCENE-1520:
--------------------------------------------

OK, even better -- I'll commit.

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1520.patch, LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1520:
----------------------------------

    Attachment: LUCENE-1520.patch

Again a slightly improved patch. byte[] is only allocated one time for all fields in CheckIndex. The length check is unnecessary, because the array is preallocated to maxDoc. Moved this a little bit modified to compare SegmentInfo docCount and Reader maxDoc

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1520:
------------------------------------------

    Assignee: Michael McCandless

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664481#action_12664481 ] 

Michael McCandless commented on LUCENE-1520:
--------------------------------------------

Committed revision 734974.  Thanks!

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1520.patch, LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1520:
----------------------------------

    Attachment: LUCENE-1520.patch

Again my last patch with optimized memory usage, now on the current svn trunk after Mike's commit.

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1520.patch, LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1520:
----------------------------------

    Attachment: LUCENE-1520.patch

This is a patch for Mike's suggestion: It just fixes CheckIndex to not use norms(fieldname) which caches, but uses the uncached 3-arg variant. TestCheckIndex passes.

No more OOM error with the many-field-index.

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>         Attachments: LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1520) OOM erros with CheckIndex with indexes containg a lot of fields with norms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664472#action_12664472 ] 

Michael McCandless commented on LUCENE-1520:
--------------------------------------------

Patch looks good; I'll commit shortly.

> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1520
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1520
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader, MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if you access norms of a field, the norm's byte[maxdoc()] array is not freed until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each (!) field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. With proper synchronization (which is done on the norms cache in SegmentReader) you can do the best with SoftReference, as this reference is garbage collected only when an OOM may happen. If the byte[] array is freed (but it is only freed if no other references exist), a lter call to getNorms() creates a new array. When code is hard referencing the norms array, it will not be freed, so no problem. The same could be done for the other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a one-time allocated dummy norm array. So the same index without norms enabled for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org