You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2013/11/01 17:39:22 UTC
[jira] [Commented] (SOLR-5354) Distributed sort is broken with CUSTOM FieldType

    [ https://issues.apache.org/jira/browse/SOLR-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811422#comment-13811422 ] 

Steve Rowe commented on SOLR-5354:
----------------------------------

Thanks for the review Robert.

bq. Can we please not have the Object/Object stuff in FieldComparatorSource? This is wrong: FieldComparator already has a generic type so I don't understand the need to discard type safety.

I'm not sure what you have in mind - do you think FieldComparatorSource should be generified? In this case I think each extending class will need to provide an implementation for these methods, since there isn't a sensible way to provide a default implementation of conversion to/from the generic type.

 bq. The unicode conversion for String/String_VAL is incorrect and should not exist: despite the name, these types can be any bytes

This is the status quo right now - the patch just keeps that in place.  But I agree.  I think the issue is non-binary (XML) serialization, for which UTF-8 is safe, but arbitrary binary is not.  Serializing all STRING/STRING_VAL as Base64 seems wasteful in the general case.

Relatedly, looks like there's an orphaned {{SortField.Type.BYTES}} (orphaned in that it's not handled in lots of places) - I guess this should go away?

{quote}
As a concrete example the CollationField and ICUCollationField sort with String/String_VAL comparators but contain non-unicode bytes.

These currently do not work distributed today either (which I would love to see fixed on this issue).
{quote}

I'm working on a distributed version of the Solr (icu) collation tests.  Once I get that failing, I'll be able to test potential solutions.

> Distributed sort is broken with CUSTOM FieldType
> ------------------------------------------------
>
>                 Key: SOLR-5354
>                 URL: https://issues.apache.org/jira/browse/SOLR-5354
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.4, 4.5, 5.0
>            Reporter: Jessica Cheng
>            Assignee: Steve Rowe
>              Labels: custom, query, sort
>         Attachments: SOLR-5354.patch
>
>
> We added a custom field type to allow an indexed binary field type that supports search (exact match), prefix search, and sort as unsigned bytes lexicographical compare. For sort, BytesRef's UTF8SortedAsUnicodeComparator accomplishes what we want, and even though the name of the comparator mentions UTF8, it doesn't actually assume so and just does byte-level operation, so it's good. However, when we do this across different nodes, we run into an issue where in QueryComponent.doFieldSortValues:
>           // Must do the same conversion when sorting by a
>           // String field in Lucene, which returns the terms
>           // data as BytesRef:
>           if (val instanceof BytesRef) {
>             UnicodeUtil.UTF8toUTF16((BytesRef)val, spare);
>             field.setStringValue(spare.toString());
>             val = ft.toObject(field);
>           }
> UnicodeUtil.UTF8toUTF16 is called on our byte array,which isn't actually UTF8. I did a hack where I specified our own field comparator to be ByteBuffer based to get around that instanceof check, but then the field value gets transformed into BYTEARR in JavaBinCodec, and when it's unmarshalled, it gets turned into byte[]. Then, in QueryComponent.mergeIds, a ShardFieldSortedHitQueue is constructed with ShardDoc.getCachedComparator, which decides to give me comparatorNatural in the else of the TODO for CUSTOM, which barfs because byte[] are not Comparable...
> From Chris Hostetter:
> I'm not very familiar with the distributed sorting code, but based on your
> comments, and a quick skim of the functions you pointed to, it definitely
> seems like there are two problems here for people trying to implement
> custom sorting in custom FieldTypes...
> 1) QueryComponent.doFieldSortValues - this definitely seems like it should
> be based on the FieldType, not an "instanceof BytesRef" check (oddly: the
> comment event suggestsion that it should be using the FieldType's
> indexedToReadable() method -- but it doesn't do that.  If it did, then
> this part of hte logic should work for you as long as your custom
> FieldType implemented indexedToReadable in a sane way.
> 2) QueryComponent.mergeIds - that TODO definitely looks like a gap that
> needs filled.  I'm guessing the sanest thing to do in the CUSTOM case
> would be to ask the FieldComparatorSource (which should be coming from the
> SortField that the custom FieldType produced) to create a FieldComparator
> (via newComparator - the numHits & sortPos could be anything) and then
> wrap that up in a Comparator facade that delegates to
> FieldComparator.compareValues
> That way a custom FieldType could be in complete control of the sort
> comparisons (even when merging ids).
> ...But as i said: i may be missing something, i'm not super familia with
> that code.  Please try it out and let us know if thta works -- either way
> please open a Jira pointing out the problems trying to implement
> distributed sorting in a custom FieldType.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org