You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/11/16 03:25:13 UTC

[jira] Created: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

ParallelReader should support getSequentialSubReaders if possible
-----------------------------------------------------------------

                 Key: LUCENE-2766
                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Andrzej Bialecki 


Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.

Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932336#action_12932336 ] 

Mark Miller commented on LUCENE-2766:
-------------------------------------

And if you don't necessarily need to descend into a deep/non standard reader graph - but one step at a time.

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932469#action_12932469 ] 

Yonik Seeley commented on LUCENE-2766:
--------------------------------------

Same merge policy would normally end up giving different results since the data is different.
If you have the primary index, and are building an aux index, what you want is a policy that won't merge at all, but that you can manually flush at the end of every segment in the primary index.

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-2766:
--------------------------------------

    Attachment: LUCENE-2766.patch

Patch and unit test that implements getSequentialSubReaders. The other part (a suitable MergePolicy) is left as an exercise for the reader for now ;)

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>         Attachments: LUCENE-2766.patch
>
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977379#action_12977379 ] 

Michael McCandless commented on LUCENE-2766:
--------------------------------------------

bq. Or reworded: is there any reason we shouldn't actually only support this way going forwards in the future?

+1 to requiring that PR only handle the sync'd case.

In fact... I think PR should only support atomic readers, and we can have sugar (static method) somewhere that can take N sync'd composite readers, get their subs, assert that they are "sync'd", and make a MultiReader of all the ParallelReaders against the aligned subs (basically the same as the patch here).

Ie, once we only support the sync'd case, I don't see why PR should also be an MR.  We should just re-use MR for that and not duplicate code?

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>         Attachments: LUCENE-2766.patch
>
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932471#action_12932471 ] 

Mark Miller commented on LUCENE-2766:
-------------------------------------

Thats the other side of the coin though (the harder part it would seem). Doesn't seem too difficult to add support to ParallelReader for getSequentialSubReaders for the right cases - the hard part is keeping synched up segments in your indexes. But this issue seemed to assume that part separately.

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932326#action_12932326 ] 

Mark Miller commented on LUCENE-2766:
-------------------------------------

If you forgot about detection to start, and put it on the user to declare they will keep segments in sync, then its pretty simple isn't it? Something like:

{code}
  public IndexReader[] getSequentialSubReaders() {
    if (!synchedSubReaders) {
      return null;
    } else {
      int numReaders = readers.size();
      IndexReader firstReader = readers.get(0);
      IndexReader[] firstReaderSubReaders = firstReader
          .getSequentialSubReaders();
      IndexReader[] seqSubReaders;
      if (firstReaderSubReaders != null) {
        int segCnt = firstReaderSubReaders.length;
        seqSubReaders = new IndexReader[segCnt];
        try {
          for (int j = 0; j < segCnt; j++) {
            ParallelReader pr = new ParallelReader();
            seqSubReaders[j] = pr;
            for (int i = 0; i < numReaders; i++) {
              IndexReader reader = readers.get(i);
              IndexReader[] subs = reader.getSequentialSubReaders();
              if (subs == null) {
                return null;
              }
              pr.add(subs[j]);
            }
          }
        } catch (IOException e) {
          throw new RuntimeException(e);
        }
        return seqSubReaders;
      }
      return null;
    }
  }
{code}

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932465#action_12932465 ] 

Andrzej Bialecki  commented on LUCENE-2766:
-------------------------------------------

Also, the process of creating secondary indexes needs to use the same merge policy, so that it arrives at segments with exactly the same count and same sequence of docIds...

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977376#action_12977376 ] 

Robert Muir commented on LUCENE-2766:
-------------------------------------

I'm gonna hold off on LUCENE-2771 until we figure this one out... because it would make your getSequentialSubReaders in the synced=true case quite heavy (without modifications).

This is because in that issue the norms caching is removed from the non-atomic readers
(Dir/MultiReader) and pushed onto SlowMultiReaderWrapper/ParallelReader.

So one idea is to fix parallelreader to not 'sometimes' return getSequentialSubReaders,
but instead have two supported approaches, one that supports the 'synced' case properly with
per-segment search (and a suitable mergepolicy to go with it), another (deprecated) one
to support the synced=false case?

Or reworded: is there any reason we shouldn't actually *only* support this way going forwards
in the future?


> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>         Attachments: LUCENE-2766.patch
>
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2766) ParallelReader should support getSequentialSubReaders if possible

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977519#action_12977519 ] 

Robert Muir commented on LUCENE-2766:
-------------------------------------

bq. I'm gonna hold off on LUCENE-2771 until we figure this one out... because it would make your getSequentialSubReaders in the synced=true case quite heavy (without modifications).

Sorry, I was wrong on this... I totally forgot the norms cache is lazy-loaded always in that patch. I'll commit LUCENE-2771 it shouldnt affect this!

> ParallelReader should support getSequentialSubReaders if possible
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2766
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>         Attachments: LUCENE-2766.patch
>
>
> Applications that need to use ParallelReader can't currently use per-segment optimizations because getSequentialSubReaders returns null.
> Considering the strict requirements on input indexes that ParallelReader already enforces it's usually the case that the additional indexes are built with the knowledge of the primary index, in order to keep the docId-s synchronized. If that's the case then it's conceivable that these indexes could be created with the same number of segments, which in turn would mean that their docId-s are synchronized on a per-segment level. ParallelReader should detect such cases, and in getSequentialSubReader it should return an array of ParallelReader-s created from corresponding segments of input indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org