You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shai Erera <se...@gmail.com> on 2014/05/01 10:28:20 UTC

Re: Fields, Index segments and docIds (second Try)

I'm glad it helped you. Good luck with the implementation.

One thing I didn't mention (though it's in the jdocs) -- it's not enough to
have the documents of each index aligned, you also have to have the
segments aligned. That is, if both indexes have documents 0-5 aligned, but
one index contains a single segment and the other one 2 segments, that's
not going to work.

It is possible to do w/ some care -- when you build the German index,
disable merges (use NoMergePolicy) and flush whenever you indexed enough
documents to match an existing segment on e.g. the Common index.

Or, if rebuilding all indexes won't take long, you can always rebuild all
of them.

Shai


On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <ol...@wanadoo.fr>wrote:

> On 04/30/2014 10:48 AM, Shai Erera wrote:
>
>> I hope I got all the details right, if I didn't then please clarify. Also,
>> I haven't read the entire thread, so if someone already suggested this ...
>> well, it probably means it's the right solution :)
>>
>> It sounds like you could use Lucene's ParallelCompositeReader, which
>> already handles multiple IndexReaders that are aligned by their internal
>> document IDs. The way it would work, as far as I understand your scenario
>> is something like the following table (columns denote different indexes).
>> Each index contains a subset of relevant fields, where common contains the
>> common fields, and each language index contains the respective language
>> fields.
>>
>> DocID        LuceneID  Common  English       German        ....
>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>                                 EN_sentences  DE_sentences
>> "SecondDoc"  1         A,B,C
>> "ThirdDoc"   2         A,B,C
>>
>> Each index can contain all relevant fields, or only a subset (e.g. maybe
>> not all documents have a value for the 'B' field in the 'common' index).
>> What's absolutely very important here though is that the indexes are
>> created very carefully, and if e.g. SecondDoc is not translated into
>> German, *you must still have an empty document* in the German index, or
>> otherwise, document IDs will not align.
>>
>
> That's exactly how I saw it and what I need to do. So, I'll have a very
> good look at
>
> ParallelCompositeReader
>
>
>> Lucene does not offer a way to build those indexes though (patches
>> welcome!!).
>>
>
> This answers my question 1. Thanks.  :)
> I somehow hoped that there was already support for that kind of situation
> in lucene but well,
> now at least I know that I won't find an already made solution to my
> problem in the lucene classes and that I will have to code one myself,
> by taking inspiration in the lucene classes that do similar processing.
>
>> We've started some effort very long time ago on LUCENE-1879
>> (there's a patch and a discussion for an alternative approach) as well as
>> there is a very useful suggestion in ParallelCompositeReader's jdocs (use
>> LogDocMergePolicy).
>>
>
> Wow, priceless. This gives me some headstart and inspiration. :)
>
>
>> One challenge is how to support multi-threaded indexing, but perhaps this
>> isn't a problem in your application? It sounds like, by you writing that a
>> user will "download the german index", that the indexes are built offline?
>>
> Indeed. The index is built offline, in a single thread, and once it is
> built, it is read only.
> Cant find an easier situation. :)
>
>
>  Another challenge is how to control segment merging, so that the *exact
>> same segments* are merged over the parallel indexes. Again, if your
>> application builds the indexes offline, then this should be easier to
>> accomplish.
>>
>> I assume though that when you index e.g. the German documents, then the
>> already indexes 'common' fields do not change for a document. If they do,
>> you will need to rebuild the 'common' index too.
>>
>> Once you achieve a correct parallel index, it is very easy to open a
>> ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
>> Common+German, or Common+English+German and search it, since the internal
>> document IDs are perfectly aligned.
>>
>> Shai
>>
>
> Many thanks for the awesome answer and the help (I love you).
> As I really really really need this to happen, I'm going to start working
> on this really soon.
>
> I'm definately not an expert on threads/filesystems/and lucene inner
> workings, so I can't promise to contribute a miracoulous patch though.
> Especially since I won't work on the muli-thread aspect of the problem.
> But I'll do the best I can and contribute back whatever code I can produce.
>
> Many thanks, again. :)
>
>>
>>
>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>> jose.carlos.canova@gmail.com> wrote:
>>
>>  My suggestion is you not worry about the docId, in practice it is an
>>> "internal lucene" id, quite similar with a rowId on a database, each
>>> index
>>> may generate a different docId (it is their problem) from a translated
>>> document, you may use your own ID that relates one document to another on
>>> different index mainly because like you mention are translated documents
>>> that on theory can be ranked differently from language to language (it is
>>> not an obligation that a set of documents on different languages spams
>>> the
>>> same rank order but i am not 100% sure about this),
>>>
>>> Second reason is that 'they may change the internal structure of lucene
>>> without warrant', and then you lose the forward compatibility.
>>>
>>> I am not an expert on Lucene like Schindler, but reading their
>>> documentation understood that they have a special attention on
>>> "internal lucene" and "experimental lucene" which means internal is "non
>>> warrant compatible", and experimental "may be removed".
>>>
>>> For example they (apache-lucene) discover a "new manner" to relate each
>>> document that is more efficient and change some mechanism, then your
>>> application uses an internal mechanism that is high coupled with lucene
>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>> version and   on future have to rewrite some code because and this might
>>> cause some "management conflict" if your project follows a continuous
>>> integration and you are subordinated on a management structure (bad to
>>> you).
>>>
>>> I saw this on several projects that uses Lucene around they do not
>>> upgrade
>>> their lucene components on their new releases one example if i am not
>>> wrong
>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>>> that "The project was abandoned because the manner how they integrate
>>> with
>>> Lucene was not fully functional".
>>>
>>> Another interesting thing is that developing around Lucene is more
>>> effective, you guarantee that your product will work and they guarantee
>>> that Lucene works too. This is related with design by contract.
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <olivier.binda@wanadoo.fr
>>>
>>>> wrote:
>>>> Hello.
>>>>
>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>> disrespect, but after thinking it through today,
>>>> I need to and would really love to have the answer to the following
>>>> question :
>>>>
>>>> 1) At lucene indexing time, is it possible to rewrite a read-only index
>>>>
>>> so
>>>
>>>> that some fields are only found in some segments (and how ?)
>>>>
>>>>
>>>> Uwe Schindler suggested using different index and a MultiReader for my
>>>> needs and It probably answers my second question, better formulated as
>>>>
>>> "Is
>>>
>>>> it possible to restrict  an index to some of it's segments ? " as a
>>>> CompositeReader with AtomicReaders (or a custom Directory) that read the
>>>> aforementioned segments might do the trick
>>>>
>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>
>>> solve
>>>
>>>> my needs as I have around 300000 documents of the following kind :
>>>>
>>>> READ ONLY Document :
>>>> // common fields shipped with the App that aren't language related
>>>> A:
>>>> B:
>>>> C:
>>>> // fields shipped with the English package (a zip)
>>>> EN:
>>>> EN_Words:
>>>> EN_Sentences:
>>>> some DocValues
>>>> // fields shipped with the German package (a zip)
>>>> DE:
>>>> DE_Words:
>>>> DE_Sentences:
>>>> some DocValues
>>>> ...
>>>> There might be hundreds of language package that my users might use
>>>>
>>>>
>>>> If I use different indexes
>>>> indexA for the common stuff,
>>>> indexEN for the English package,
>>>> indexDE for the german package,
>>>>
>>>> For sure, I will be able to make a big index out of those by using a
>>>> MultiReader
>>>> BUT it really makes an union out of the three index (right ?) which
>>>> means
>>>> I'll have 900000 documents
>>>> and the documents in the indexA won't have any relations to the
>>>> documents
>>>> in indexEN (right ?) except if I give each document an id in each index
>>>>
>>> and
>>>
>>>> make a join at query time which is a big no no, because I use a
>>>>
>>> queryParser
>>>
>>>> and users may enter queries like "A:gah AND (DE:schlaffen OR EN:sleep)"
>>>>
>>>> Or I am mistaken and there is a way to create a document in three
>>>> different index that stay in relations with the same docId ?
>>>>
>>>>
>>>> My solution if question 1 is possible :
>>>>
>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>> Document
>>>> are stored in
>>>>
>>>> SEGMENT 1
>>>> // common fields shipped with the App that aren't language related
>>>> A:
>>>> B:
>>>> C:
>>>>
>>>> SEGMENT 2
>>>> // fields shipped with the English package (a zip)
>>>> EN:
>>>> EN_Words:
>>>> EN_Sentences:
>>>> some DocValues
>>>>
>>>> SEGMENT 3
>>>> // fields shipped with the German package (a zip)
>>>> DE:
>>>> DE_Words:
>>>> DE_Sentences:
>>>> some DocValues
>>>>
>>>>
>>>> I only need to ship SEGMENT 1 in the App and let users download SEGMENT
>>>> 2
>>>> or SEGMENT 3 whether they want english or german
>>>> and use a composite reader with atomic readers (right ?) to use my
>>>> frankenstein index at query time with a queryparser
>>>>
>>>>
>>>> Also, In case question 1 is possible. I would really like to know too,
>>>> if
>>>> it is possible to remap at build time docIds in a read-only index.
>>>> An application of this would be :
>>>>
>>>> At day 1, I shipp my app with 2 languages packages : English and german
>>>> (documents are uniquely identified by a docId... or by an external id
>>>> (thanks to a docId<-> external id map)
>>>>
>>>> At day 2, I ship an additional language package (French) because I'm
>>>> able
>>>> to build an index with English, German, French with the same exact
>>>> docIds
>>>> for each document that the index shipped at day 1
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Some feedback on parrallel index building (Fields, Index segments and docIds)

Posted by Olivier Binda <ol...@wanadoo.fr>.

Some feedback (might be usefull for other users) :

I have experimented a bit and it seems that I have been able to build a 
parrallel index for my use case
(9 different index, with docIds in sync, with only 1 segment).

I had to set the IndexWriterConfig of all my indexWriters with
setRAMBufferSizeMB(...)
setMaxBufferedDocs(...)
to first build everything in in RAM, by adding a document (sometimes 
empty to build a whole row) in all index (the columns)
then I did a forceMerge(1, true) on each indexWriter and close()


To test if it was ok, I had added a docValues to each document
docAddedOrd = 0L
...
++docAddedOrd
for (index in indexes) {
    document = Document()
    document.add(NumericDocValuesField("docAddedOrd", docAddedOrd))
....
}


And then I checked if the docId was equal to the docValue



I had less success without the calls to setRAMBufferSizeMB() and 
setMaxBufferedDocs() :
I managed to build some small indexes with LogDocMergePolicy, but as 
soon as the index got too big,
the docIds went out of sync (merges dprobably happened and shuffled the 
docIds)

I tried to commit() -> it made it worse
LogByteSizePolicy, NoMergePolicy  -> didn't fix it


There. Now that I'm able to build a parrallel index, I'll check if I can 
read it with a Parrallel reader.


Best regards,
Olivier

On 05/02/2014 02:42 PM, Shai Erera wrote:I don't think that you need to 
be concerned with the internal docIDs much. Just imagine the indexes as 
a big table with multiple columns, where columns are grouped together. 
Each group is a different index. If a document does not have a value in 
one column, then you have an empty cell. if a document doesn't have a 
value in entire group of columns, then you denote that by adding an 
empty document. Oh, and make sure to use a LogMergePolicy, so segments 
are merged in the same order across all indexes. And given that you 
rebuild the indexes every time, you can create them one-by-one. You 
don't need to do that in parallel to all indexes, unless it's more 
convenient for you. Shai On Fri, May 2, 2014 at 9:28 AM, Olivier Binda 
<ol...@wanadoo.fr>wrote:
>> On 05/02/2014 06:05 AM, Shai Erera wrote:
>>
>>> If you're always rebuilding, let alone forceMerge, you shouldn't have too
>>> much trouble implementing it. Just make sure that you add documents in the
>>> same order to all indexes.
>>>
>>> If you're always rebuilding, how come you have deletions? Anyway, you must
>>> also delete in all indexes.
>>>
>> Indeed, I don't have deletions and I'm mainly concerned with merges.
>> But I just want to understand the whole docId remapping process,
>> out of curiosity and also because obtaining a docId (and not losing it)
>> seems to be the key of parallel indexes
>>
>>   On May 2, 2014 1:57 AM, "Olivier Binda" <ol...@wanadoo.fr> wrote:
>>>   On 05/01/2014 10:28 AM, Shai Erera wrote:
>>>>   I'm glad it helped you. Good luck with the implementation.
>>>>>   Thanks. First I started looking at the lucene internal code. To
>>>> understand
>>>> when/where and why docIds are changing/need to be changed (in merge and
>>>> doc
>>>> deletions) .
>>>> I have always wanted to understand this and I think the understanding may
>>>> help me somehow.
>>>>
>>>>   One thing I didn't mention (though it's in the jdocs) -- it's not enough
>>>>> to
>>>>> have the documents of each index aligned, you also have to have the
>>>>> segments aligned. That is, if both indexes have documents 0-5 aligned,
>>>>> but
>>>>> one index contains a single segment and the other one 2 segments, that's
>>>>> not going to work.
>>>>>
>>>>>   That's good to know.
>>>>    It is possible to do w/ some care -- when you build the German index,
>>>>
>>>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>>>>> documents to match an existing segment on e.g. the Common index.
>>>>>
>>>>> Or, if rebuilding all indexes won't take long, you can always rebuild
>>>>> all
>>>>> of them.
>>>>>
>>>>>   Yes. That's what I am usually doing (it takes less than 1 minute)
>>>> Yet, I usually do a forceMarge too to only have 1 segment :/
>>>>
>>>>    Shai
>>>>
>>>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <
>>>>> olivier.binda@wanadoo.fr>
>>>>> wrote:
>>>>>
>>>>>    On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>>>
>>>>>>    I hope I got all the details right, if I didn't then please clarify.
>>>>>>
>>>>>>> Also,
>>>>>>> I haven't read the entire thread, so if someone already suggested this
>>>>>>> ...
>>>>>>> well, it probably means it's the right solution :)
>>>>>>>
>>>>>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>>>>>> already handles multiple IndexReaders that are aligned by their
>>>>>>> internal
>>>>>>> document IDs. The way it would work, as far as I understand your
>>>>>>> scenario
>>>>>>> is something like the following table (columns denote different
>>>>>>> indexes).
>>>>>>> Each index contains a subset of relevant fields, where common contains
>>>>>>> the
>>>>>>> common fields, and each language index contains the respective
>>>>>>> language
>>>>>>> fields.
>>>>>>>
>>>>>>> DocID        LuceneID  Common  English       German        ....
>>>>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>>>>                                    EN_sentences  DE_sentences
>>>>>>> "SecondDoc"  1         A,B,C
>>>>>>> "ThirdDoc"   2         A,B,C
>>>>>>>
>>>>>>> Each index can contain all relevant fields, or only a subset (e.g.
>>>>>>> maybe
>>>>>>> not all documents have a value for the 'B' field in the 'common'
>>>>>>> index).
>>>>>>> What's absolutely very important here though is that the indexes are
>>>>>>> created very carefully, and if e.g. SecondDoc is not translated into
>>>>>>> German, *you must still have an empty document* in the German index,
>>>>>>> or
>>>>>>> otherwise, document IDs will not align.
>>>>>>>
>>>>>>>    That's exactly how I saw it and what I need to do. So, I'll have a
>>>>>>> very
>>>>>>>
>>>>>> good look at
>>>>>>
>>>>>> ParallelCompositeReader
>>>>>>
>>>>>>
>>>>>>    Lucene does not offer a way to build those indexes though (patches
>>>>>>
>>>>>>> welcome!!).
>>>>>>>
>>>>>>>    This answers my question 1. Thanks.  :)
>>>>>>>
>>>>>> I somehow hoped that there was already support for that kind of
>>>>>> situation
>>>>>> in lucene but well,
>>>>>> now at least I know that I won't find an already made solution to my
>>>>>> problem in the lucene classes and that I will have to code one myself,
>>>>>> by taking inspiration in the lucene classes that do similar processing.
>>>>>>
>>>>>>    We've started some effort very long time ago on LUCENE-1879
>>>>>>
>>>>>>> (there's a patch and a discussion for an alternative approach) as well
>>>>>>> as
>>>>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs
>>>>>>> (use
>>>>>>> LogDocMergePolicy).
>>>>>>>
>>>>>>>    Wow, priceless. This gives me some headstart and inspiration. :)
>>>>>>>
>>>>>>    One challenge is how to support multi-threaded indexing, but perhaps
>>>>>>
>>>>>>> this
>>>>>>> isn't a problem in your application? It sounds like, by you writing
>>>>>>> that a
>>>>>>> user will "download the german index", that the indexes are built
>>>>>>> offline?
>>>>>>>
>>>>>>>    Indeed. The index is built offline, in a single thread, and once it
>>>>>>> is
>>>>>>>
>>>>>> built, it is read only.
>>>>>> Cant find an easier situation. :)
>>>>>>
>>>>>>
>>>>>>     Another challenge is how to control segment merging, so that the
>>>>>> *exact
>>>>>>
>>>>>>   same segments* are merged over the parallel indexes. Again, if your
>>>>>>> application builds the indexes offline, then this should be easier to
>>>>>>> accomplish.
>>>>>>>
>>>>>>> I assume though that when you index e.g. the German documents, then
>>>>>>> the
>>>>>>> already indexes 'common' fields do not change for a document. If they
>>>>>>> do,
>>>>>>> you will need to rebuild the 'common' index too.
>>>>>>>
>>>>>>> Once you achieve a correct parallel index, it is very easy to open a
>>>>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>>>>> Common+English,
>>>>>>> Common+German, or Common+English+German and search it, since the
>>>>>>> internal
>>>>>>> document IDs are perfectly aligned.
>>>>>>>
>>>>>>> Shai
>>>>>>>
>>>>>>>    Many thanks for the awesome answer and the help (I love you).
>>>>>>>
>>>>>> As I really really really need this to happen, I'm going to start
>>>>>> working
>>>>>> on this really soon.
>>>>>>
>>>>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>>>>> workings, so I can't promise to contribute a miracoulous patch though.
>>>>>> Especially since I won't work on the muli-thread aspect of the problem.
>>>>>> But I'll do the best I can and contribute back whatever code I can
>>>>>> produce.
>>>>>>
>>>>>> Many thanks, again. :)
>>>>>>
>>>>>>
>>>>>>   On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>>>>> jose.carlos.canova@gmail.com> wrote:
>>>>>>>
>>>>>>>     My suggestion is you not worry about the docId, in practice it is
>>>>>>> an
>>>>>>>
>>>>>>>   "internal lucene" id, quite similar with a rowId on a database, each
>>>>>>>> index
>>>>>>>> may generate a different docId (it is their problem) from a
>>>>>>>> translated
>>>>>>>> document, you may use your own ID that relates one document to
>>>>>>>> another
>>>>>>>> on
>>>>>>>> different index mainly because like you mention are translated
>>>>>>>> documents
>>>>>>>> that on theory can be ranked differently from language to language
>>>>>>>> (it
>>>>>>>> is
>>>>>>>> not an obligation that a set of documents on different languages
>>>>>>>> spams
>>>>>>>> the
>>>>>>>> same rank order but i am not 100% sure about this),
>>>>>>>>
>>>>>>>> Second reason is that 'they may change the internal structure of
>>>>>>>> lucene
>>>>>>>> without warrant', and then you lose the forward compatibility.
>>>>>>>>
>>>>>>>> I am not an expert on Lucene like Schindler, but reading their
>>>>>>>> documentation understood that they have a special attention on
>>>>>>>> "internal lucene" and "experimental lucene" which means internal is
>>>>>>>> "non
>>>>>>>> warrant compatible", and experimental "may be removed".
>>>>>>>>
>>>>>>>> For example they (apache-lucene) discover a "new manner" to relate
>>>>>>>> each
>>>>>>>> document that is more efficient and change some mechanism, then your
>>>>>>>> application uses an internal mechanism that is high coupled with
>>>>>>>> lucene
>>>>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>>>>>> version and   on future have to rewrite some code because and this
>>>>>>>> might
>>>>>>>> cause some "management conflict" if your project follows a continuous
>>>>>>>> integration and you are subordinated on a management structure (bad
>>>>>>>> to
>>>>>>>> you).
>>>>>>>>
>>>>>>>> I saw this on several projects that uses Lucene around they do not
>>>>>>>> upgrade
>>>>>>>> their lucene components on their new releases one example if i am not
>>>>>>>> wrong
>>>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which
>>>>>>>> means
>>>>>>>> that "The project was abandoned because the manner how they integrate
>>>>>>>> with
>>>>>>>> Lucene was not fully functional".
>>>>>>>>
>>>>>>>> Another interesting thing is that developing around Lucene is more
>>>>>>>> effective, you guarantee that your product will work and they
>>>>>>>> guarantee
>>>>>>>> that Lucene works too. This is related with design by contract.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>>>>> olivier.binda@wanadoo.fr
>>>>>>>>
>>>>>>>>    wrote:
>>>>>>>>
>>>>>>>>> Hello.
>>>>>>>>>
>>>>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>>>>>> disrespect, but after thinking it through today,
>>>>>>>>> I need to and would really love to have the answer to the following
>>>>>>>>> question :
>>>>>>>>>
>>>>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only
>>>>>>>>> index
>>>>>>>>>
>>>>>>>>>    so
>>>>>>>>>
>>>>>>>>    that some fields are only found in some segments (and how ?)
>>>>>>>>
>>>>>>>>> Uwe Schindler suggested using different index and a MultiReader for
>>>>>>>>> my
>>>>>>>>> needs and It probably answers my second question, better formulated
>>>>>>>>> as
>>>>>>>>>
>>>>>>>>>    "Is
>>>>>>>>>
>>>>>>>>    it possible to restrict  an index to some of it's segments ? " as a
>>>>>>>>
>>>>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
>>>>>>>>> the
>>>>>>>>> aforementioned segments might do the trick
>>>>>>>>>
>>>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>>>>>
>>>>>>>>>    solve
>>>>>>>>>
>>>>>>>>    my needs as I have around 300000 documents of the following kind :
>>>>>>>>
>>>>>>>>> READ ONLY Document :
>>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>>> A:
>>>>>>>>> B:
>>>>>>>>> C:
>>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>>> EN:
>>>>>>>>> EN_Words:
>>>>>>>>> EN_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>>> DE:
>>>>>>>>> DE_Words:
>>>>>>>>> DE_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> ...
>>>>>>>>> There might be hundreds of language package that my users might use
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If I use different indexes
>>>>>>>>> indexA for the common stuff,
>>>>>>>>> indexEN for the English package,
>>>>>>>>> indexDE for the german package,
>>>>>>>>>
>>>>>>>>> For sure, I will be able to make a big index out of those by using a
>>>>>>>>> MultiReader
>>>>>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>>>>>> means
>>>>>>>>> I'll have 900000 documents
>>>>>>>>> and the documents in the indexA won't have any relations to the
>>>>>>>>> documents
>>>>>>>>> in indexEN (right ?) except if I give each document an id in each
>>>>>>>>> index
>>>>>>>>>
>>>>>>>>>    and
>>>>>>>>>
>>>>>>>>    make a join at query time which is a big no no, because I use a
>>>>>>>>
>>>>>>>>>    queryParser
>>>>>>>>>
>>>>>>>>    and users may enter queries like "A:gah AND (DE:schlaffen OR
>>>>>>>>
>>>>>>>>> EN:sleep)"
>>>>>>>>>
>>>>>>>>> Or I am mistaken and there is a way to create a document in three
>>>>>>>>> different index that stay in relations with the same docId ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My solution if question 1 is possible :
>>>>>>>>>
>>>>>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>>>>>> Document
>>>>>>>>> are stored in
>>>>>>>>>
>>>>>>>>> SEGMENT 1
>>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>>> A:
>>>>>>>>> B:
>>>>>>>>> C:
>>>>>>>>>
>>>>>>>>> SEGMENT 2
>>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>>> EN:
>>>>>>>>> EN_Words:
>>>>>>>>> EN_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>>
>>>>>>>>> SEGMENT 3
>>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>>> DE:
>>>>>>>>> DE_Words:
>>>>>>>>> DE_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I only need to ship SEGMENT 1 in the App and let users download
>>>>>>>>> SEGMENT
>>>>>>>>> 2
>>>>>>>>> or SEGMENT 3 whether they want english or german
>>>>>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>>>>>> frankenstein index at query time with a queryparser
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, In case question 1 is possible. I would really like to know
>>>>>>>>> too,
>>>>>>>>> if
>>>>>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>>>>>> An application of this would be :
>>>>>>>>>
>>>>>>>>> At day 1, I shipp my app with 2 languages packages : English and
>>>>>>>>> german
>>>>>>>>> (documents are uniquely identified by a docId... or by an external
>>>>>>>>> id
>>>>>>>>> (thanks to a docId<-> external id map)
>>>>>>>>>
>>>>>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>>>>>> able
>>>>>>>>> to build an index with English, German, French with the same exact
>>>>>>>>> docIds
>>>>>>>>> for each document that the index shipped at day 1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    ------------------------------------------------------------
>>>>>>>>>
>>>>>>>> ---------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fields, Index segments and docIds (second Try)

Posted by Shai Erera <se...@gmail.com>.

I don't think that you need to be concerned with the internal docIDs much.
Just imagine the indexes as a big table with multiple columns, where
columns are grouped together. Each group is a different index. If a
document does not have a value in one column, then you have an empty cell.
if a document doesn't have a value in entire group of columns, then you
denote that by adding an empty document.

Oh, and make sure to use a LogMergePolicy, so segments are merged in the
same order across all indexes.

And given that you rebuild the indexes every time, you can create them
one-by-one. You don't need to do that in parallel to all indexes, unless
it's more convenient for you.

Shai


On Fri, May 2, 2014 at 9:28 AM, Olivier Binda <ol...@wanadoo.fr>wrote:

> On 05/02/2014 06:05 AM, Shai Erera wrote:
>
>> If you're always rebuilding, let alone forceMerge, you shouldn't have too
>> much trouble implementing it. Just make sure that you add documents in the
>> same order to all indexes.
>>
>> If you're always rebuilding, how come you have deletions? Anyway, you must
>> also delete in all indexes.
>>
>
> Indeed, I don't have deletions and I'm mainly concerned with merges.
> But I just want to understand the whole docId remapping process,
> out of curiosity and also because obtaining a docId (and not losing it)
> seems to be the key of parallel indexes
>
>  On May 2, 2014 1:57 AM, "Olivier Binda" <ol...@wanadoo.fr> wrote:
>>
>>  On 05/01/2014 10:28 AM, Shai Erera wrote:
>>>
>>>  I'm glad it helped you. Good luck with the implementation.
>>>>
>>>>  Thanks. First I started looking at the lucene internal code. To
>>> understand
>>> when/where and why docIds are changing/need to be changed (in merge and
>>> doc
>>> deletions) .
>>> I have always wanted to understand this and I think the understanding may
>>> help me somehow.
>>>
>>>  One thing I didn't mention (though it's in the jdocs) -- it's not enough
>>>> to
>>>> have the documents of each index aligned, you also have to have the
>>>> segments aligned. That is, if both indexes have documents 0-5 aligned,
>>>> but
>>>> one index contains a single segment and the other one 2 segments, that's
>>>> not going to work.
>>>>
>>>>  That's good to know.
>>>
>>>   It is possible to do w/ some care -- when you build the German index,
>>>
>>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>>>> documents to match an existing segment on e.g. the Common index.
>>>>
>>>> Or, if rebuilding all indexes won't take long, you can always rebuild
>>>> all
>>>> of them.
>>>>
>>>>  Yes. That's what I am usually doing (it takes less than 1 minute)
>>> Yet, I usually do a forceMarge too to only have 1 segment :/
>>>
>>>   Shai
>>>
>>>>
>>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <
>>>> olivier.binda@wanadoo.fr>
>>>> wrote:
>>>>
>>>>   On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>>
>>>>>   I hope I got all the details right, if I didn't then please clarify.
>>>>>
>>>>>> Also,
>>>>>> I haven't read the entire thread, so if someone already suggested this
>>>>>> ...
>>>>>> well, it probably means it's the right solution :)
>>>>>>
>>>>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>>>>> already handles multiple IndexReaders that are aligned by their
>>>>>> internal
>>>>>> document IDs. The way it would work, as far as I understand your
>>>>>> scenario
>>>>>> is something like the following table (columns denote different
>>>>>> indexes).
>>>>>> Each index contains a subset of relevant fields, where common contains
>>>>>> the
>>>>>> common fields, and each language index contains the respective
>>>>>> language
>>>>>> fields.
>>>>>>
>>>>>> DocID        LuceneID  Common  English       German        ....
>>>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>>>                                   EN_sentences  DE_sentences
>>>>>> "SecondDoc"  1         A,B,C
>>>>>> "ThirdDoc"   2         A,B,C
>>>>>>
>>>>>> Each index can contain all relevant fields, or only a subset (e.g.
>>>>>> maybe
>>>>>> not all documents have a value for the 'B' field in the 'common'
>>>>>> index).
>>>>>> What's absolutely very important here though is that the indexes are
>>>>>> created very carefully, and if e.g. SecondDoc is not translated into
>>>>>> German, *you must still have an empty document* in the German index,
>>>>>> or
>>>>>> otherwise, document IDs will not align.
>>>>>>
>>>>>>   That's exactly how I saw it and what I need to do. So, I'll have a
>>>>>> very
>>>>>>
>>>>> good look at
>>>>>
>>>>> ParallelCompositeReader
>>>>>
>>>>>
>>>>>   Lucene does not offer a way to build those indexes though (patches
>>>>>
>>>>>> welcome!!).
>>>>>>
>>>>>>   This answers my question 1. Thanks.  :)
>>>>>>
>>>>> I somehow hoped that there was already support for that kind of
>>>>> situation
>>>>> in lucene but well,
>>>>> now at least I know that I won't find an already made solution to my
>>>>> problem in the lucene classes and that I will have to code one myself,
>>>>> by taking inspiration in the lucene classes that do similar processing.
>>>>>
>>>>>   We've started some effort very long time ago on LUCENE-1879
>>>>>
>>>>>> (there's a patch and a discussion for an alternative approach) as well
>>>>>> as
>>>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs
>>>>>> (use
>>>>>> LogDocMergePolicy).
>>>>>>
>>>>>>   Wow, priceless. This gives me some headstart and inspiration. :)
>>>>>>
>>>>>
>>>>>   One challenge is how to support multi-threaded indexing, but perhaps
>>>>>
>>>>>> this
>>>>>> isn't a problem in your application? It sounds like, by you writing
>>>>>> that a
>>>>>> user will "download the german index", that the indexes are built
>>>>>> offline?
>>>>>>
>>>>>>   Indeed. The index is built offline, in a single thread, and once it
>>>>>> is
>>>>>>
>>>>> built, it is read only.
>>>>> Cant find an easier situation. :)
>>>>>
>>>>>
>>>>>    Another challenge is how to control segment merging, so that the
>>>>> *exact
>>>>>
>>>>>  same segments* are merged over the parallel indexes. Again, if your
>>>>>> application builds the indexes offline, then this should be easier to
>>>>>> accomplish.
>>>>>>
>>>>>> I assume though that when you index e.g. the German documents, then
>>>>>> the
>>>>>> already indexes 'common' fields do not change for a document. If they
>>>>>> do,
>>>>>> you will need to rebuild the 'common' index too.
>>>>>>
>>>>>> Once you achieve a correct parallel index, it is very easy to open a
>>>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>>>> Common+English,
>>>>>> Common+German, or Common+English+German and search it, since the
>>>>>> internal
>>>>>> document IDs are perfectly aligned.
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>>   Many thanks for the awesome answer and the help (I love you).
>>>>>>
>>>>> As I really really really need this to happen, I'm going to start
>>>>> working
>>>>> on this really soon.
>>>>>
>>>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>>>> workings, so I can't promise to contribute a miracoulous patch though.
>>>>> Especially since I won't work on the muli-thread aspect of the problem.
>>>>> But I'll do the best I can and contribute back whatever code I can
>>>>> produce.
>>>>>
>>>>> Many thanks, again. :)
>>>>>
>>>>>
>>>>>  On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>>>> jose.carlos.canova@gmail.com> wrote:
>>>>>>
>>>>>>    My suggestion is you not worry about the docId, in practice it is
>>>>>> an
>>>>>>
>>>>>>  "internal lucene" id, quite similar with a rowId on a database, each
>>>>>>> index
>>>>>>> may generate a different docId (it is their problem) from a
>>>>>>> translated
>>>>>>> document, you may use your own ID that relates one document to
>>>>>>> another
>>>>>>> on
>>>>>>> different index mainly because like you mention are translated
>>>>>>> documents
>>>>>>> that on theory can be ranked differently from language to language
>>>>>>> (it
>>>>>>> is
>>>>>>> not an obligation that a set of documents on different languages
>>>>>>> spams
>>>>>>> the
>>>>>>> same rank order but i am not 100% sure about this),
>>>>>>>
>>>>>>> Second reason is that 'they may change the internal structure of
>>>>>>> lucene
>>>>>>> without warrant', and then you lose the forward compatibility.
>>>>>>>
>>>>>>> I am not an expert on Lucene like Schindler, but reading their
>>>>>>> documentation understood that they have a special attention on
>>>>>>> "internal lucene" and "experimental lucene" which means internal is
>>>>>>> "non
>>>>>>> warrant compatible", and experimental "may be removed".
>>>>>>>
>>>>>>> For example they (apache-lucene) discover a "new manner" to relate
>>>>>>> each
>>>>>>> document that is more efficient and change some mechanism, then your
>>>>>>> application uses an internal mechanism that is high coupled with
>>>>>>> lucene
>>>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>>>>> version and   on future have to rewrite some code because and this
>>>>>>> might
>>>>>>> cause some "management conflict" if your project follows a continuous
>>>>>>> integration and you are subordinated on a management structure (bad
>>>>>>> to
>>>>>>> you).
>>>>>>>
>>>>>>> I saw this on several projects that uses Lucene around they do not
>>>>>>> upgrade
>>>>>>> their lucene components on their new releases one example if i am not
>>>>>>> wrong
>>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which
>>>>>>> means
>>>>>>> that "The project was abandoned because the manner how they integrate
>>>>>>> with
>>>>>>> Lucene was not fully functional".
>>>>>>>
>>>>>>> Another interesting thing is that developing around Lucene is more
>>>>>>> effective, you guarantee that your product will work and they
>>>>>>> guarantee
>>>>>>> that Lucene works too. This is related with design by contract.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>>>> olivier.binda@wanadoo.fr
>>>>>>>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>>>>> disrespect, but after thinking it through today,
>>>>>>>> I need to and would really love to have the answer to the following
>>>>>>>> question :
>>>>>>>>
>>>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only
>>>>>>>> index
>>>>>>>>
>>>>>>>>   so
>>>>>>>>
>>>>>>>   that some fields are only found in some segments (and how ?)
>>>>>>>
>>>>>>>>
>>>>>>>> Uwe Schindler suggested using different index and a MultiReader for
>>>>>>>> my
>>>>>>>> needs and It probably answers my second question, better formulated
>>>>>>>> as
>>>>>>>>
>>>>>>>>   "Is
>>>>>>>>
>>>>>>>   it possible to restrict  an index to some of it's segments ? " as a
>>>>>>>
>>>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
>>>>>>>> the
>>>>>>>> aforementioned segments might do the trick
>>>>>>>>
>>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>>>>
>>>>>>>>   solve
>>>>>>>>
>>>>>>>   my needs as I have around 300000 documents of the following kind :
>>>>>>>
>>>>>>>> READ ONLY Document :
>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>> A:
>>>>>>>> B:
>>>>>>>> C:
>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>> EN:
>>>>>>>> EN_Words:
>>>>>>>> EN_Sentences:
>>>>>>>> some DocValues
>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>> DE:
>>>>>>>> DE_Words:
>>>>>>>> DE_Sentences:
>>>>>>>> some DocValues
>>>>>>>> ...
>>>>>>>> There might be hundreds of language package that my users might use
>>>>>>>>
>>>>>>>>
>>>>>>>> If I use different indexes
>>>>>>>> indexA for the common stuff,
>>>>>>>> indexEN for the English package,
>>>>>>>> indexDE for the german package,
>>>>>>>>
>>>>>>>> For sure, I will be able to make a big index out of those by using a
>>>>>>>> MultiReader
>>>>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>>>>> means
>>>>>>>> I'll have 900000 documents
>>>>>>>> and the documents in the indexA won't have any relations to the
>>>>>>>> documents
>>>>>>>> in indexEN (right ?) except if I give each document an id in each
>>>>>>>> index
>>>>>>>>
>>>>>>>>   and
>>>>>>>>
>>>>>>>   make a join at query time which is a big no no, because I use a
>>>>>>>
>>>>>>>>   queryParser
>>>>>>>>
>>>>>>>   and users may enter queries like "A:gah AND (DE:schlaffen OR
>>>>>>>
>>>>>>>> EN:sleep)"
>>>>>>>>
>>>>>>>> Or I am mistaken and there is a way to create a document in three
>>>>>>>> different index that stay in relations with the same docId ?
>>>>>>>>
>>>>>>>>
>>>>>>>> My solution if question 1 is possible :
>>>>>>>>
>>>>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>>>>> Document
>>>>>>>> are stored in
>>>>>>>>
>>>>>>>> SEGMENT 1
>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>> A:
>>>>>>>> B:
>>>>>>>> C:
>>>>>>>>
>>>>>>>> SEGMENT 2
>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>> EN:
>>>>>>>> EN_Words:
>>>>>>>> EN_Sentences:
>>>>>>>> some DocValues
>>>>>>>>
>>>>>>>> SEGMENT 3
>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>> DE:
>>>>>>>> DE_Words:
>>>>>>>> DE_Sentences:
>>>>>>>> some DocValues
>>>>>>>>
>>>>>>>>
>>>>>>>> I only need to ship SEGMENT 1 in the App and let users download
>>>>>>>> SEGMENT
>>>>>>>> 2
>>>>>>>> or SEGMENT 3 whether they want english or german
>>>>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>>>>> frankenstein index at query time with a queryparser
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, In case question 1 is possible. I would really like to know
>>>>>>>> too,
>>>>>>>> if
>>>>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>>>>> An application of this would be :
>>>>>>>>
>>>>>>>> At day 1, I shipp my app with 2 languages packages : English and
>>>>>>>> german
>>>>>>>> (documents are uniquely identified by a docId... or by an external
>>>>>>>> id
>>>>>>>> (thanks to a docId<-> external id map)
>>>>>>>>
>>>>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>>>>> able
>>>>>>>> to build an index with English, German, French with the same exact
>>>>>>>> docIds
>>>>>>>> for each document that the index shipped at day 1
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   ------------------------------------------------------------
>>>>>>>>
>>>>>>> ---------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Fields, Index segments and docIds (second Try)

Posted by Olivier Binda <ol...@wanadoo.fr>.

On 05/02/2014 06:05 AM, Shai Erera wrote:
> If you're always rebuilding, let alone forceMerge, you shouldn't have too
> much trouble implementing it. Just make sure that you add documents in the
> same order to all indexes.
>
> If you're always rebuilding, how come you have deletions? Anyway, you must
> also delete in all indexes.

Indeed, I don't have deletions and I'm mainly concerned with merges.
But I just want to understand the whole docId remapping process,
out of curiosity and also because obtaining a docId (and not losing it)
seems to be the key of parallel indexes

> On May 2, 2014 1:57 AM, "Olivier Binda" <ol...@wanadoo.fr> wrote:
>
>> On 05/01/2014 10:28 AM, Shai Erera wrote:
>>
>>> I'm glad it helped you. Good luck with the implementation.
>>>
>> Thanks. First I started looking at the lucene internal code. To understand
>> when/where and why docIds are changing/need to be changed (in merge and doc
>> deletions) .
>> I have always wanted to understand this and I think the understanding may
>> help me somehow.
>>
>>> One thing I didn't mention (though it's in the jdocs) -- it's not enough
>>> to
>>> have the documents of each index aligned, you also have to have the
>>> segments aligned. That is, if both indexes have documents 0-5 aligned, but
>>> one index contains a single segment and the other one 2 segments, that's
>>> not going to work.
>>>
>> That's good to know.
>>
>>   It is possible to do w/ some care -- when you build the German index,
>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>>> documents to match an existing segment on e.g. the Common index.
>>>
>>> Or, if rebuilding all indexes won't take long, you can always rebuild all
>>> of them.
>>>
>> Yes. That's what I am usually doing (it takes less than 1 minute)
>> Yet, I usually do a forceMarge too to only have 1 segment :/
>>
>>   Shai
>>>
>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <ol...@wanadoo.fr>
>>> wrote:
>>>
>>>   On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>>   I hope I got all the details right, if I didn't then please clarify.
>>>>> Also,
>>>>> I haven't read the entire thread, so if someone already suggested this
>>>>> ...
>>>>> well, it probably means it's the right solution :)
>>>>>
>>>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>>>> already handles multiple IndexReaders that are aligned by their internal
>>>>> document IDs. The way it would work, as far as I understand your
>>>>> scenario
>>>>> is something like the following table (columns denote different
>>>>> indexes).
>>>>> Each index contains a subset of relevant fields, where common contains
>>>>> the
>>>>> common fields, and each language index contains the respective language
>>>>> fields.
>>>>>
>>>>> DocID        LuceneID  Common  English       German        ....
>>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>>                                   EN_sentences  DE_sentences
>>>>> "SecondDoc"  1         A,B,C
>>>>> "ThirdDoc"   2         A,B,C
>>>>>
>>>>> Each index can contain all relevant fields, or only a subset (e.g. maybe
>>>>> not all documents have a value for the 'B' field in the 'common' index).
>>>>> What's absolutely very important here though is that the indexes are
>>>>> created very carefully, and if e.g. SecondDoc is not translated into
>>>>> German, *you must still have an empty document* in the German index, or
>>>>> otherwise, document IDs will not align.
>>>>>
>>>>>   That's exactly how I saw it and what I need to do. So, I'll have a very
>>>> good look at
>>>>
>>>> ParallelCompositeReader
>>>>
>>>>
>>>>   Lucene does not offer a way to build those indexes though (patches
>>>>> welcome!!).
>>>>>
>>>>>   This answers my question 1. Thanks.  :)
>>>> I somehow hoped that there was already support for that kind of situation
>>>> in lucene but well,
>>>> now at least I know that I won't find an already made solution to my
>>>> problem in the lucene classes and that I will have to code one myself,
>>>> by taking inspiration in the lucene classes that do similar processing.
>>>>
>>>>   We've started some effort very long time ago on LUCENE-1879
>>>>> (there's a patch and a discussion for an alternative approach) as well
>>>>> as
>>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs
>>>>> (use
>>>>> LogDocMergePolicy).
>>>>>
>>>>>   Wow, priceless. This gives me some headstart and inspiration. :)
>>>>
>>>>   One challenge is how to support multi-threaded indexing, but perhaps
>>>>> this
>>>>> isn't a problem in your application? It sounds like, by you writing
>>>>> that a
>>>>> user will "download the german index", that the indexes are built
>>>>> offline?
>>>>>
>>>>>   Indeed. The index is built offline, in a single thread, and once it is
>>>> built, it is read only.
>>>> Cant find an easier situation. :)
>>>>
>>>>
>>>>    Another challenge is how to control segment merging, so that the *exact
>>>>
>>>>> same segments* are merged over the parallel indexes. Again, if your
>>>>> application builds the indexes offline, then this should be easier to
>>>>> accomplish.
>>>>>
>>>>> I assume though that when you index e.g. the German documents, then the
>>>>> already indexes 'common' fields do not change for a document. If they
>>>>> do,
>>>>> you will need to rebuild the 'common' index too.
>>>>>
>>>>> Once you achieve a correct parallel index, it is very easy to open a
>>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>>> Common+English,
>>>>> Common+German, or Common+English+German and search it, since the
>>>>> internal
>>>>> document IDs are perfectly aligned.
>>>>>
>>>>> Shai
>>>>>
>>>>>   Many thanks for the awesome answer and the help (I love you).
>>>> As I really really really need this to happen, I'm going to start working
>>>> on this really soon.
>>>>
>>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>>> workings, so I can't promise to contribute a miracoulous patch though.
>>>> Especially since I won't work on the muli-thread aspect of the problem.
>>>> But I'll do the best I can and contribute back whatever code I can
>>>> produce.
>>>>
>>>> Many thanks, again. :)
>>>>
>>>>
>>>>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>>> jose.carlos.canova@gmail.com> wrote:
>>>>>
>>>>>    My suggestion is you not worry about the docId, in practice it is an
>>>>>
>>>>>> "internal lucene" id, quite similar with a rowId on a database, each
>>>>>> index
>>>>>> may generate a different docId (it is their problem) from a translated
>>>>>> document, you may use your own ID that relates one document to another
>>>>>> on
>>>>>> different index mainly because like you mention are translated
>>>>>> documents
>>>>>> that on theory can be ranked differently from language to language (it
>>>>>> is
>>>>>> not an obligation that a set of documents on different languages spams
>>>>>> the
>>>>>> same rank order but i am not 100% sure about this),
>>>>>>
>>>>>> Second reason is that 'they may change the internal structure of lucene
>>>>>> without warrant', and then you lose the forward compatibility.
>>>>>>
>>>>>> I am not an expert on Lucene like Schindler, but reading their
>>>>>> documentation understood that they have a special attention on
>>>>>> "internal lucene" and "experimental lucene" which means internal is
>>>>>> "non
>>>>>> warrant compatible", and experimental "may be removed".
>>>>>>
>>>>>> For example they (apache-lucene) discover a "new manner" to relate each
>>>>>> document that is more efficient and change some mechanism, then your
>>>>>> application uses an internal mechanism that is high coupled with lucene
>>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>>>> version and   on future have to rewrite some code because and this
>>>>>> might
>>>>>> cause some "management conflict" if your project follows a continuous
>>>>>> integration and you are subordinated on a management structure (bad to
>>>>>> you).
>>>>>>
>>>>>> I saw this on several projects that uses Lucene around they do not
>>>>>> upgrade
>>>>>> their lucene components on their new releases one example if i am not
>>>>>> wrong
>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>>>>>> that "The project was abandoned because the manner how they integrate
>>>>>> with
>>>>>> Lucene was not fully functional".
>>>>>>
>>>>>> Another interesting thing is that developing around Lucene is more
>>>>>> effective, you guarantee that your product will work and they guarantee
>>>>>> that Lucene works too. This is related with design by contract.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>>> olivier.binda@wanadoo.fr
>>>>>>
>>>>>>   wrote:
>>>>>>> Hello.
>>>>>>>
>>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>>>> disrespect, but after thinking it through today,
>>>>>>> I need to and would really love to have the answer to the following
>>>>>>> question :
>>>>>>>
>>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only
>>>>>>> index
>>>>>>>
>>>>>>>   so
>>>>>>   that some fields are only found in some segments (and how ?)
>>>>>>>
>>>>>>> Uwe Schindler suggested using different index and a MultiReader for my
>>>>>>> needs and It probably answers my second question, better formulated as
>>>>>>>
>>>>>>>   "Is
>>>>>>   it possible to restrict  an index to some of it's segments ? " as a
>>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
>>>>>>> the
>>>>>>> aforementioned segments might do the trick
>>>>>>>
>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>>>
>>>>>>>   solve
>>>>>>   my needs as I have around 300000 documents of the following kind :
>>>>>>> READ ONLY Document :
>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>> A:
>>>>>>> B:
>>>>>>> C:
>>>>>>> // fields shipped with the English package (a zip)
>>>>>>> EN:
>>>>>>> EN_Words:
>>>>>>> EN_Sentences:
>>>>>>> some DocValues
>>>>>>> // fields shipped with the German package (a zip)
>>>>>>> DE:
>>>>>>> DE_Words:
>>>>>>> DE_Sentences:
>>>>>>> some DocValues
>>>>>>> ...
>>>>>>> There might be hundreds of language package that my users might use
>>>>>>>
>>>>>>>
>>>>>>> If I use different indexes
>>>>>>> indexA for the common stuff,
>>>>>>> indexEN for the English package,
>>>>>>> indexDE for the german package,
>>>>>>>
>>>>>>> For sure, I will be able to make a big index out of those by using a
>>>>>>> MultiReader
>>>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>>>> means
>>>>>>> I'll have 900000 documents
>>>>>>> and the documents in the indexA won't have any relations to the
>>>>>>> documents
>>>>>>> in indexEN (right ?) except if I give each document an id in each
>>>>>>> index
>>>>>>>
>>>>>>>   and
>>>>>>   make a join at query time which is a big no no, because I use a
>>>>>>>   queryParser
>>>>>>   and users may enter queries like "A:gah AND (DE:schlaffen OR
>>>>>>> EN:sleep)"
>>>>>>>
>>>>>>> Or I am mistaken and there is a way to create a document in three
>>>>>>> different index that stay in relations with the same docId ?
>>>>>>>
>>>>>>>
>>>>>>> My solution if question 1 is possible :
>>>>>>>
>>>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>>>> Document
>>>>>>> are stored in
>>>>>>>
>>>>>>> SEGMENT 1
>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>> A:
>>>>>>> B:
>>>>>>> C:
>>>>>>>
>>>>>>> SEGMENT 2
>>>>>>> // fields shipped with the English package (a zip)
>>>>>>> EN:
>>>>>>> EN_Words:
>>>>>>> EN_Sentences:
>>>>>>> some DocValues
>>>>>>>
>>>>>>> SEGMENT 3
>>>>>>> // fields shipped with the German package (a zip)
>>>>>>> DE:
>>>>>>> DE_Words:
>>>>>>> DE_Sentences:
>>>>>>> some DocValues
>>>>>>>
>>>>>>>
>>>>>>> I only need to ship SEGMENT 1 in the App and let users download
>>>>>>> SEGMENT
>>>>>>> 2
>>>>>>> or SEGMENT 3 whether they want english or german
>>>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>>>> frankenstein index at query time with a queryparser
>>>>>>>
>>>>>>>
>>>>>>> Also, In case question 1 is possible. I would really like to know too,
>>>>>>> if
>>>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>>>> An application of this would be :
>>>>>>>
>>>>>>> At day 1, I shipp my app with 2 languages packages : English and
>>>>>>> german
>>>>>>> (documents are uniquely identified by a docId... or by an external id
>>>>>>> (thanks to a docId<-> external id map)
>>>>>>>
>>>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>>>> able
>>>>>>> to build an index with English, German, French with the same exact
>>>>>>> docIds
>>>>>>> for each document that the index shipped at day 1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   ------------------------------------------------------------
>>>> ---------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fields, Index segments and docIds (second Try)

Posted by Shai Erera <se...@gmail.com>.

If you're always rebuilding, let alone forceMerge, you shouldn't have too
much trouble implementing it. Just make sure that you add documents in the
same order to all indexes.

If you're always rebuilding, how come you have deletions? Anyway, you must
also delete in all indexes.
On May 2, 2014 1:57 AM, "Olivier Binda" <ol...@wanadoo.fr> wrote:

> On 05/01/2014 10:28 AM, Shai Erera wrote:
>
>> I'm glad it helped you. Good luck with the implementation.
>>
>
> Thanks. First I started looking at the lucene internal code. To understand
> when/where and why docIds are changing/need to be changed (in merge and doc
> deletions) .
> I have always wanted to understand this and I think the understanding may
> help me somehow.
>
>>
>> One thing I didn't mention (though it's in the jdocs) -- it's not enough
>> to
>> have the documents of each index aligned, you also have to have the
>> segments aligned. That is, if both indexes have documents 0-5 aligned, but
>> one index contains a single segment and the other one 2 segments, that's
>> not going to work.
>>
>
> That's good to know.
>
>  It is possible to do w/ some care -- when you build the German index,
>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>> documents to match an existing segment on e.g. the Common index.
>>
>> Or, if rebuilding all indexes won't take long, you can always rebuild all
>> of them.
>>
> Yes. That's what I am usually doing (it takes less than 1 minute)
> Yet, I usually do a forceMarge too to only have 1 segment :/
>
>  Shai
>>
>>
>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <ol...@wanadoo.fr>
>> wrote:
>>
>>  On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>
>>>  I hope I got all the details right, if I didn't then please clarify.
>>>> Also,
>>>> I haven't read the entire thread, so if someone already suggested this
>>>> ...
>>>> well, it probably means it's the right solution :)
>>>>
>>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>>> already handles multiple IndexReaders that are aligned by their internal
>>>> document IDs. The way it would work, as far as I understand your
>>>> scenario
>>>> is something like the following table (columns denote different
>>>> indexes).
>>>> Each index contains a subset of relevant fields, where common contains
>>>> the
>>>> common fields, and each language index contains the respective language
>>>> fields.
>>>>
>>>> DocID        LuceneID  Common  English       German        ....
>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>                                  EN_sentences  DE_sentences
>>>> "SecondDoc"  1         A,B,C
>>>> "ThirdDoc"   2         A,B,C
>>>>
>>>> Each index can contain all relevant fields, or only a subset (e.g. maybe
>>>> not all documents have a value for the 'B' field in the 'common' index).
>>>> What's absolutely very important here though is that the indexes are
>>>> created very carefully, and if e.g. SecondDoc is not translated into
>>>> German, *you must still have an empty document* in the German index, or
>>>> otherwise, document IDs will not align.
>>>>
>>>>  That's exactly how I saw it and what I need to do. So, I'll have a very
>>> good look at
>>>
>>> ParallelCompositeReader
>>>
>>>
>>>  Lucene does not offer a way to build those indexes though (patches
>>>> welcome!!).
>>>>
>>>>  This answers my question 1. Thanks.  :)
>>> I somehow hoped that there was already support for that kind of situation
>>> in lucene but well,
>>> now at least I know that I won't find an already made solution to my
>>> problem in the lucene classes and that I will have to code one myself,
>>> by taking inspiration in the lucene classes that do similar processing.
>>>
>>>  We've started some effort very long time ago on LUCENE-1879
>>>> (there's a patch and a discussion for an alternative approach) as well
>>>> as
>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs
>>>> (use
>>>> LogDocMergePolicy).
>>>>
>>>>  Wow, priceless. This gives me some headstart and inspiration. :)
>>>
>>>
>>>  One challenge is how to support multi-threaded indexing, but perhaps
>>>> this
>>>> isn't a problem in your application? It sounds like, by you writing
>>>> that a
>>>> user will "download the german index", that the indexes are built
>>>> offline?
>>>>
>>>>  Indeed. The index is built offline, in a single thread, and once it is
>>> built, it is read only.
>>> Cant find an easier situation. :)
>>>
>>>
>>>   Another challenge is how to control segment merging, so that the *exact
>>>
>>>> same segments* are merged over the parallel indexes. Again, if your
>>>> application builds the indexes offline, then this should be easier to
>>>> accomplish.
>>>>
>>>> I assume though that when you index e.g. the German documents, then the
>>>> already indexes 'common' fields do not change for a document. If they
>>>> do,
>>>> you will need to rebuild the 'common' index too.
>>>>
>>>> Once you achieve a correct parallel index, it is very easy to open a
>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>> Common+English,
>>>> Common+German, or Common+English+German and search it, since the
>>>> internal
>>>> document IDs are perfectly aligned.
>>>>
>>>> Shai
>>>>
>>>>  Many thanks for the awesome answer and the help (I love you).
>>> As I really really really need this to happen, I'm going to start working
>>> on this really soon.
>>>
>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>> workings, so I can't promise to contribute a miracoulous patch though.
>>> Especially since I won't work on the muli-thread aspect of the problem.
>>> But I'll do the best I can and contribute back whatever code I can
>>> produce.
>>>
>>> Many thanks, again. :)
>>>
>>>
>>>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>> jose.carlos.canova@gmail.com> wrote:
>>>>
>>>>   My suggestion is you not worry about the docId, in practice it is an
>>>>
>>>>> "internal lucene" id, quite similar with a rowId on a database, each
>>>>> index
>>>>> may generate a different docId (it is their problem) from a translated
>>>>> document, you may use your own ID that relates one document to another
>>>>> on
>>>>> different index mainly because like you mention are translated
>>>>> documents
>>>>> that on theory can be ranked differently from language to language (it
>>>>> is
>>>>> not an obligation that a set of documents on different languages spams
>>>>> the
>>>>> same rank order but i am not 100% sure about this),
>>>>>
>>>>> Second reason is that 'they may change the internal structure of lucene
>>>>> without warrant', and then you lose the forward compatibility.
>>>>>
>>>>> I am not an expert on Lucene like Schindler, but reading their
>>>>> documentation understood that they have a special attention on
>>>>> "internal lucene" and "experimental lucene" which means internal is
>>>>> "non
>>>>> warrant compatible", and experimental "may be removed".
>>>>>
>>>>> For example they (apache-lucene) discover a "new manner" to relate each
>>>>> document that is more efficient and change some mechanism, then your
>>>>> application uses an internal mechanism that is high coupled with lucene
>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>>> version and   on future have to rewrite some code because and this
>>>>> might
>>>>> cause some "management conflict" if your project follows a continuous
>>>>> integration and you are subordinated on a management structure (bad to
>>>>> you).
>>>>>
>>>>> I saw this on several projects that uses Lucene around they do not
>>>>> upgrade
>>>>> their lucene components on their new releases one example if i am not
>>>>> wrong
>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>>>>> that "The project was abandoned because the manner how they integrate
>>>>> with
>>>>> Lucene was not fully functional".
>>>>>
>>>>> Another interesting thing is that developing around Lucene is more
>>>>> effective, you guarantee that your product will work and they guarantee
>>>>> that Lucene works too. This is related with design by contract.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>> olivier.binda@wanadoo.fr
>>>>>
>>>>>  wrote:
>>>>>> Hello.
>>>>>>
>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>>> disrespect, but after thinking it through today,
>>>>>> I need to and would really love to have the answer to the following
>>>>>> question :
>>>>>>
>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only
>>>>>> index
>>>>>>
>>>>>>  so
>>>>>
>>>>>  that some fields are only found in some segments (and how ?)
>>>>>>
>>>>>>
>>>>>> Uwe Schindler suggested using different index and a MultiReader for my
>>>>>> needs and It probably answers my second question, better formulated as
>>>>>>
>>>>>>  "Is
>>>>>
>>>>>  it possible to restrict  an index to some of it's segments ? " as a
>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
>>>>>> the
>>>>>> aforementioned segments might do the trick
>>>>>>
>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>>
>>>>>>  solve
>>>>>
>>>>>  my needs as I have around 300000 documents of the following kind :
>>>>>>
>>>>>> READ ONLY Document :
>>>>>> // common fields shipped with the App that aren't language related
>>>>>> A:
>>>>>> B:
>>>>>> C:
>>>>>> // fields shipped with the English package (a zip)
>>>>>> EN:
>>>>>> EN_Words:
>>>>>> EN_Sentences:
>>>>>> some DocValues
>>>>>> // fields shipped with the German package (a zip)
>>>>>> DE:
>>>>>> DE_Words:
>>>>>> DE_Sentences:
>>>>>> some DocValues
>>>>>> ...
>>>>>> There might be hundreds of language package that my users might use
>>>>>>
>>>>>>
>>>>>> If I use different indexes
>>>>>> indexA for the common stuff,
>>>>>> indexEN for the English package,
>>>>>> indexDE for the german package,
>>>>>>
>>>>>> For sure, I will be able to make a big index out of those by using a
>>>>>> MultiReader
>>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>>> means
>>>>>> I'll have 900000 documents
>>>>>> and the documents in the indexA won't have any relations to the
>>>>>> documents
>>>>>> in indexEN (right ?) except if I give each document an id in each
>>>>>> index
>>>>>>
>>>>>>  and
>>>>>
>>>>>  make a join at query time which is a big no no, because I use a
>>>>>>
>>>>>>  queryParser
>>>>>
>>>>>  and users may enter queries like "A:gah AND (DE:schlaffen OR
>>>>>> EN:sleep)"
>>>>>>
>>>>>> Or I am mistaken and there is a way to create a document in three
>>>>>> different index that stay in relations with the same docId ?
>>>>>>
>>>>>>
>>>>>> My solution if question 1 is possible :
>>>>>>
>>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>>> Document
>>>>>> are stored in
>>>>>>
>>>>>> SEGMENT 1
>>>>>> // common fields shipped with the App that aren't language related
>>>>>> A:
>>>>>> B:
>>>>>> C:
>>>>>>
>>>>>> SEGMENT 2
>>>>>> // fields shipped with the English package (a zip)
>>>>>> EN:
>>>>>> EN_Words:
>>>>>> EN_Sentences:
>>>>>> some DocValues
>>>>>>
>>>>>> SEGMENT 3
>>>>>> // fields shipped with the German package (a zip)
>>>>>> DE:
>>>>>> DE_Words:
>>>>>> DE_Sentences:
>>>>>> some DocValues
>>>>>>
>>>>>>
>>>>>> I only need to ship SEGMENT 1 in the App and let users download
>>>>>> SEGMENT
>>>>>> 2
>>>>>> or SEGMENT 3 whether they want english or german
>>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>>> frankenstein index at query time with a queryparser
>>>>>>
>>>>>>
>>>>>> Also, In case question 1 is possible. I would really like to know too,
>>>>>> if
>>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>>> An application of this would be :
>>>>>>
>>>>>> At day 1, I shipp my app with 2 languages packages : English and
>>>>>> german
>>>>>> (documents are uniquely identified by a docId... or by an external id
>>>>>> (thanks to a docId<-> external id map)
>>>>>>
>>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>>> able
>>>>>> to build an index with English, German, French with the same exact
>>>>>> docIds
>>>>>> for each document that the index shipped at day 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>  ------------------------------------------------------------
>>> ---------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Fields, Index segments and docIds (second Try)

Posted by Olivier Binda <ol...@wanadoo.fr>.

On 05/01/2014 10:28 AM, Shai Erera wrote:
> I'm glad it helped you. Good luck with the implementation.

Thanks. First I started looking at the lucene internal code. To 
understand when/where and why docIds are changing/need to be changed (in 
merge and doc deletions) .
I have always wanted to understand this and I think the understanding 
may help me somehow.
>
> One thing I didn't mention (though it's in the jdocs) -- it's not enough to
> have the documents of each index aligned, you also have to have the
> segments aligned. That is, if both indexes have documents 0-5 aligned, but
> one index contains a single segment and the other one 2 segments, that's
> not going to work.

That's good to know.

> It is possible to do w/ some care -- when you build the German index,
> disable merges (use NoMergePolicy) and flush whenever you indexed enough
> documents to match an existing segment on e.g. the Common index.
>
> Or, if rebuilding all indexes won't take long, you can always rebuild all
> of them.
Yes. That's what I am usually doing (it takes less than 1 minute)
Yet, I usually do a forceMarge too to only have 1 segment :/

> Shai
>
>
> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <ol...@wanadoo.fr>wrote:
>
>> On 04/30/2014 10:48 AM, Shai Erera wrote:
>>
>>> I hope I got all the details right, if I didn't then please clarify. Also,
>>> I haven't read the entire thread, so if someone already suggested this ...
>>> well, it probably means it's the right solution :)
>>>
>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>> already handles multiple IndexReaders that are aligned by their internal
>>> document IDs. The way it would work, as far as I understand your scenario
>>> is something like the following table (columns denote different indexes).
>>> Each index contains a subset of relevant fields, where common contains the
>>> common fields, and each language index contains the respective language
>>> fields.
>>>
>>> DocID        LuceneID  Common  English       German        ....
>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>                                  EN_sentences  DE_sentences
>>> "SecondDoc"  1         A,B,C
>>> "ThirdDoc"   2         A,B,C
>>>
>>> Each index can contain all relevant fields, or only a subset (e.g. maybe
>>> not all documents have a value for the 'B' field in the 'common' index).
>>> What's absolutely very important here though is that the indexes are
>>> created very carefully, and if e.g. SecondDoc is not translated into
>>> German, *you must still have an empty document* in the German index, or
>>> otherwise, document IDs will not align.
>>>
>> That's exactly how I saw it and what I need to do. So, I'll have a very
>> good look at
>>
>> ParallelCompositeReader
>>
>>
>>> Lucene does not offer a way to build those indexes though (patches
>>> welcome!!).
>>>
>> This answers my question 1. Thanks.  :)
>> I somehow hoped that there was already support for that kind of situation
>> in lucene but well,
>> now at least I know that I won't find an already made solution to my
>> problem in the lucene classes and that I will have to code one myself,
>> by taking inspiration in the lucene classes that do similar processing.
>>
>>> We've started some effort very long time ago on LUCENE-1879
>>> (there's a patch and a discussion for an alternative approach) as well as
>>> there is a very useful suggestion in ParallelCompositeReader's jdocs (use
>>> LogDocMergePolicy).
>>>
>> Wow, priceless. This gives me some headstart and inspiration. :)
>>
>>
>>> One challenge is how to support multi-threaded indexing, but perhaps this
>>> isn't a problem in your application? It sounds like, by you writing that a
>>> user will "download the german index", that the indexes are built offline?
>>>
>> Indeed. The index is built offline, in a single thread, and once it is
>> built, it is read only.
>> Cant find an easier situation. :)
>>
>>
>>   Another challenge is how to control segment merging, so that the *exact
>>> same segments* are merged over the parallel indexes. Again, if your
>>> application builds the indexes offline, then this should be easier to
>>> accomplish.
>>>
>>> I assume though that when you index e.g. the German documents, then the
>>> already indexes 'common' fields do not change for a document. If they do,
>>> you will need to rebuild the 'common' index too.
>>>
>>> Once you achieve a correct parallel index, it is very easy to open a
>>> ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
>>> Common+German, or Common+English+German and search it, since the internal
>>> document IDs are perfectly aligned.
>>>
>>> Shai
>>>
>> Many thanks for the awesome answer and the help (I love you).
>> As I really really really need this to happen, I'm going to start working
>> on this really soon.
>>
>> I'm definately not an expert on threads/filesystems/and lucene inner
>> workings, so I can't promise to contribute a miracoulous patch though.
>> Especially since I won't work on the muli-thread aspect of the problem.
>> But I'll do the best I can and contribute back whatever code I can produce.
>>
>> Many thanks, again. :)
>>
>>>
>>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>> jose.carlos.canova@gmail.com> wrote:
>>>
>>>   My suggestion is you not worry about the docId, in practice it is an
>>>> "internal lucene" id, quite similar with a rowId on a database, each
>>>> index
>>>> may generate a different docId (it is their problem) from a translated
>>>> document, you may use your own ID that relates one document to another on
>>>> different index mainly because like you mention are translated documents
>>>> that on theory can be ranked differently from language to language (it is
>>>> not an obligation that a set of documents on different languages spams
>>>> the
>>>> same rank order but i am not 100% sure about this),
>>>>
>>>> Second reason is that 'they may change the internal structure of lucene
>>>> without warrant', and then you lose the forward compatibility.
>>>>
>>>> I am not an expert on Lucene like Schindler, but reading their
>>>> documentation understood that they have a special attention on
>>>> "internal lucene" and "experimental lucene" which means internal is "non
>>>> warrant compatible", and experimental "may be removed".
>>>>
>>>> For example they (apache-lucene) discover a "new manner" to relate each
>>>> document that is more efficient and change some mechanism, then your
>>>> application uses an internal mechanism that is high coupled with lucene
>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>> version and   on future have to rewrite some code because and this might
>>>> cause some "management conflict" if your project follows a continuous
>>>> integration and you are subordinated on a management structure (bad to
>>>> you).
>>>>
>>>> I saw this on several projects that uses Lucene around they do not
>>>> upgrade
>>>> their lucene components on their new releases one example if i am not
>>>> wrong
>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>>>> that "The project was abandoned because the manner how they integrate
>>>> with
>>>> Lucene was not fully functional".
>>>>
>>>> Another interesting thing is that developing around Lucene is more
>>>> effective, you guarantee that your product will work and they guarantee
>>>> that Lucene works too. This is related with design by contract.
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <olivier.binda@wanadoo.fr
>>>>
>>>>> wrote:
>>>>> Hello.
>>>>>
>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>> disrespect, but after thinking it through today,
>>>>> I need to and would really love to have the answer to the following
>>>>> question :
>>>>>
>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only index
>>>>>
>>>> so
>>>>
>>>>> that some fields are only found in some segments (and how ?)
>>>>>
>>>>>
>>>>> Uwe Schindler suggested using different index and a MultiReader for my
>>>>> needs and It probably answers my second question, better formulated as
>>>>>
>>>> "Is
>>>>
>>>>> it possible to restrict  an index to some of it's segments ? " as a
>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read the
>>>>> aforementioned segments might do the trick
>>>>>
>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>
>>>> solve
>>>>
>>>>> my needs as I have around 300000 documents of the following kind :
>>>>>
>>>>> READ ONLY Document :
>>>>> // common fields shipped with the App that aren't language related
>>>>> A:
>>>>> B:
>>>>> C:
>>>>> // fields shipped with the English package (a zip)
>>>>> EN:
>>>>> EN_Words:
>>>>> EN_Sentences:
>>>>> some DocValues
>>>>> // fields shipped with the German package (a zip)
>>>>> DE:
>>>>> DE_Words:
>>>>> DE_Sentences:
>>>>> some DocValues
>>>>> ...
>>>>> There might be hundreds of language package that my users might use
>>>>>
>>>>>
>>>>> If I use different indexes
>>>>> indexA for the common stuff,
>>>>> indexEN for the English package,
>>>>> indexDE for the german package,
>>>>>
>>>>> For sure, I will be able to make a big index out of those by using a
>>>>> MultiReader
>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>> means
>>>>> I'll have 900000 documents
>>>>> and the documents in the indexA won't have any relations to the
>>>>> documents
>>>>> in indexEN (right ?) except if I give each document an id in each index
>>>>>
>>>> and
>>>>
>>>>> make a join at query time which is a big no no, because I use a
>>>>>
>>>> queryParser
>>>>
>>>>> and users may enter queries like "A:gah AND (DE:schlaffen OR EN:sleep)"
>>>>>
>>>>> Or I am mistaken and there is a way to create a document in three
>>>>> different index that stay in relations with the same docId ?
>>>>>
>>>>>
>>>>> My solution if question 1 is possible :
>>>>>
>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>> Document
>>>>> are stored in
>>>>>
>>>>> SEGMENT 1
>>>>> // common fields shipped with the App that aren't language related
>>>>> A:
>>>>> B:
>>>>> C:
>>>>>
>>>>> SEGMENT 2
>>>>> // fields shipped with the English package (a zip)
>>>>> EN:
>>>>> EN_Words:
>>>>> EN_Sentences:
>>>>> some DocValues
>>>>>
>>>>> SEGMENT 3
>>>>> // fields shipped with the German package (a zip)
>>>>> DE:
>>>>> DE_Words:
>>>>> DE_Sentences:
>>>>> some DocValues
>>>>>
>>>>>
>>>>> I only need to ship SEGMENT 1 in the App and let users download SEGMENT
>>>>> 2
>>>>> or SEGMENT 3 whether they want english or german
>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>> frankenstein index at query time with a queryparser
>>>>>
>>>>>
>>>>> Also, In case question 1 is possible. I would really like to know too,
>>>>> if
>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>> An application of this would be :
>>>>>
>>>>> At day 1, I shipp my app with 2 languages packages : English and german
>>>>> (documents are uniquely identified by a docId... or by an external id
>>>>> (thanks to a docId<-> external id map)
>>>>>
>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>> able
>>>>> to build an index with English, German, French with the same exact
>>>>> docIds
>>>>> for each document that the index shipped at day 1
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org