You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yann-Erwan Perio <ye...@gmail.com> on 2014/01/21 16:32:46 UTC

BytesRef equals() method

Hello,

I have been working a bit with BytesRef recently, and I wonder whether
the content of the equals() method, and more specifically the content
of the bytesEquals(BytesRef other) method, is the intended one.

Here is my use case. I work with Lucene 4.6.0. During indexing, using
a custom tokenizer, I have added some payloads onto some tokens. Using
an extension of the Default Similarity, I was then able to retrieve
these payloads, passing them to a collector of mine, so as to perform
aggregation calculations. It occurred to me that the BytesRef
retrieved were not exactly the same as the indexed - namely their real
content was the same, but their offsets would differ.

I was made aware of this because I used a Map<BytesRef, ...> in the
collector, and the map would sometimes give inconsistent results.
Checking out the source code, the hashcode() method looks valid to me,
but the bytesEquals() method looks strange - because prior to
comparing the real value of the BytesRef, it checks their lengths -
and AIUI these may differ, even though BytesRef are logically equal.

I am not familiar at all with the internals of Lucene (this includes
the BytesRef mechanics), so I may be completely wrong here. FWIW, I
solved my problem by creating fresh BytesRef from the ones sent by the
similarity, using the copyBytes method. I could also have used the
string representation of the BytesRef, but this appears to be slower
than copying the bytes, by a magnitude of about 2.5.

Kind regards.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Yann-Erwan Perio <ye...@gmail.com>.
On Wed, Jan 22, 2014 at 12:09 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:

Hi,

> DocsAndPositionsEnum.getPayload() is allowed to re-use the returned
> BytesRef under the hood.

Ah, I am starting to get it. The BytesRef would be directly stored in
the key set of the map, but since its properties can change, then I
can imagine how this can invalidate the hash table.

In fact, I do not use DocsAndPositionsEnum.getPayload(), but I believe
that the way I have set up presents the same behavior that the one you
have described. I need the payload of the matched token in my custom
collector, so I simply extended the DefaultSimilarity (making the new
similarity aware of my custom collector), and passed the payload to
the collector in the scorePayload() method.

> So, if you want to hold a copy of the payload across two or more calls
> to .getPayload you'll have to make a deep copy of
> (BytesRef.deepCopyOf) the returned BytesRef yourself.

This is actually what I was doing in my workaround, except that I'd
use copyBytes directly, prior to putting the entry in the map. It
somehow worked by coincidence.

Cheers all for your time and replies, it is really appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Michael McCandless <lu...@mikemccandless.com>.
DocsAndPositionsEnum.getPayload() is allowed to re-use the returned
BytesRef under the hood.

So, if you want to hold a copy of the payload across two or more calls
to .getPayload you'll have to make a deep copy of
(BytesRef.deepCopyOf) the returned BytesRef yourself.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jan 22, 2014 at 3:12 AM, Steven Schlansker <st...@likeness.com> wrote:
> On Wed, 22 Jan 2014 07:14:59 +0100
> Yann-Erwan Perio <ye...@gmail.com> wrote:
>
>> On Tue, Jan 21, 2014 at 7:54 PM, Steven Schlansker
>> <st...@likeness.com> wrote:
>>
>> Certainly, but my problem still persists if I do not do it. I spent
>> the whole night debugging the code, to no avail. As a matter of fact,
>> when I run a series of tests on my application, the following happens
>> about once out of ten times (this is the resulting log of some sysout
>> calls):
>>
>> Payload: toString=Nc6, bytes=[4e 63 36], offset=3, length=3,
>> hashcode=78081 map.keySet()=[[4e 63 36]]
>> Now testing map.contains(payload)
>> map.contains(payload)==false
>> Now testing map.isEmpty()
>> map.isEmpty()==false
>> Map is not empty. Manually iterating keys.
>> Key n°1: toString=Nc6, bytes=[4e 63 36], offset=3, length=3,
>> hashcode=78081 Verifying key.equals(payload)==true
>> Verifying map.containsKey(payload)==false
>> Verifying map.containsKey(key)==false
>>
>> As you can see, the map provides the key I am looking for, but it
>> cannot identify it back! Going through the HashMap data structure, it
>> was indeed assigned a different hashCode (73787).
>>
>> I do not understand how this could happen. I thought that there was
>> maybe a concurrency issue with the payload itself - as if it were
>> reused in concurrent scoring processes (I use the payload sent back by
>> DefaultSimilarity) - but the faulty hashCode, as far as I can see,
>> should not be generated by my test data set.
>>
>> I'll try looking again at the code with fresh eyes, but in the
>> meanwhile, do not hesitate to tell me if this makes sense to you.
>
> It sounds like you've already considered this somewhat, but I'd guess
> that by far the most likely cause here is modification of either the
> byte[] or the offset/length somewhere behind your back.  Perhaps the
> BytesRef itself is unexpectedly shared such that two different
> processes use it for their own purposes.
>
> If you can't find that, perhaps try to pare down a self-contained test
> case.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Steven Schlansker <st...@likeness.com>.
On Wed, 22 Jan 2014 07:14:59 +0100
Yann-Erwan Perio <ye...@gmail.com> wrote:

> On Tue, Jan 21, 2014 at 7:54 PM, Steven Schlansker
> <st...@likeness.com> wrote:
> 
> Certainly, but my problem still persists if I do not do it. I spent
> the whole night debugging the code, to no avail. As a matter of fact,
> when I run a series of tests on my application, the following happens
> about once out of ten times (this is the resulting log of some sysout
> calls):
> 
> Payload: toString=Nc6, bytes=[4e 63 36], offset=3, length=3,
> hashcode=78081 map.keySet()=[[4e 63 36]]
> Now testing map.contains(payload)
> map.contains(payload)==false
> Now testing map.isEmpty()
> map.isEmpty()==false
> Map is not empty. Manually iterating keys.
> Key n°1: toString=Nc6, bytes=[4e 63 36], offset=3, length=3,
> hashcode=78081 Verifying key.equals(payload)==true
> Verifying map.containsKey(payload)==false
> Verifying map.containsKey(key)==false
> 
> As you can see, the map provides the key I am looking for, but it
> cannot identify it back! Going through the HashMap data structure, it
> was indeed assigned a different hashCode (73787).
> 
> I do not understand how this could happen. I thought that there was
> maybe a concurrency issue with the payload itself - as if it were
> reused in concurrent scoring processes (I use the payload sent back by
> DefaultSimilarity) - but the faulty hashCode, as far as I can see,
> should not be generated by my test data set.
> 
> I'll try looking again at the code with fresh eyes, but in the
> meanwhile, do not hesitate to tell me if this makes sense to you.

It sounds like you've already considered this somewhat, but I'd guess
that by far the most likely cause here is modification of either the
byte[] or the offset/length somewhere behind your back.  Perhaps the
BytesRef itself is unexpectedly shared such that two different
processes use it for their own purposes.

If you can't find that, perhaps try to pare down a self-contained test
case.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Yann-Erwan Perio <ye...@gmail.com>.
On Tue, Jan 21, 2014 at 7:54 PM, Steven Schlansker <st...@likeness.com> wrote:

Hi,

Firstly, thanks to all of you for your insights.

> How can two byte arrays be equal if they have different lengths?
> Same way as two Strings with differing lengths can never be equal, two
> byte arrays with different lengths will never be equivalent.

Indeed. As Michael pointed out, I happened to have a misunderstanding
in what "length" meant in the code. Thanks for clearing that!

> copyBytes doesn’t change the length of the BytesRef, so two unequal BytesRef
> instances cannot become equal solely through a copyBytes call, by my reading?

Certainly, but my problem still persists if I do not do it. I spent
the whole night debugging the code, to no avail. As a matter of fact,
when I run a series of tests on my application, the following happens
about once out of ten times (this is the resulting log of some sysout
calls):

Payload: toString=Nc6, bytes=[4e 63 36], offset=3, length=3, hashcode=78081
map.keySet()=[[4e 63 36]]
Now testing map.contains(payload)
map.contains(payload)==false
Now testing map.isEmpty()
map.isEmpty()==false
Map is not empty. Manually iterating keys.
Key n°1: toString=Nc6, bytes=[4e 63 36], offset=3, length=3, hashcode=78081
Verifying key.equals(payload)==true
Verifying map.containsKey(payload)==false
Verifying map.containsKey(key)==false

As you can see, the map provides the key I am looking for, but it
cannot identify it back! Going through the HashMap data structure, it
was indeed assigned a different hashCode (73787).

I do not understand how this could happen. I thought that there was
maybe a concurrency issue with the payload itself - as if it were
reused in concurrent scoring processes (I use the payload sent back by
DefaultSimilarity) - but the faulty hashCode, as far as I can see,
should not be generated by my test data set.

I'll try looking again at the code with fresh eyes, but in the
meanwhile, do not hesitate to tell me if this makes sense to you.

> Not all bytes are valid representations of Strings, so don’t do this unless
> you are very sure you are dealing with character data and know the encoding.

This would not be a problem in my use case, as the provided text is
generated by the application, and uses only certain ASCII chars.

> What differently-sized byte arrays would you expect to compare as equals?

Arrays that would contain an equal slice of values (the logical value)
- one would discard some leading bits, of various length, considered
as technical (junk). This is how I understood the BytesRef structure.

Kind regards.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Steven Schlansker <st...@likeness.com>.
On Jan 21, 2014, at 7:32 AM, Yann-Erwan Perio <ye...@gmail.com> wrote:

> Hello,
> 
> I have been working a bit with BytesRef recently, and I wonder whether
> the content of the equals() method, and more specifically the content
> of the bytesEquals(BytesRef other) method, is the intended one.
> 
> I was made aware of this because I used a Map<BytesRef, ...> in the
> collector, and the map would sometimes give inconsistent results.
> Checking out the source code, the hashcode() method looks valid to me,
> but the bytesEquals() method looks strange - because prior to
> comparing the real value of the BytesRef, it checks their lengths -
> and AIUI these may differ, even though BytesRef are logically equal.

How can two byte arrays be equal if they have different lengths?
Same way as two Strings with differing lengths can never be equal, two
byte arrays with different lengths will never be equivalent.

> 
> I am not familiar at all with the internals of Lucene (this includes
> the BytesRef mechanics), so I may be completely wrong here. FWIW, I
> solved my problem by creating fresh BytesRef from the ones sent by the
> similarity, using the copyBytes method.

copyBytes doesn’t change the length of the BytesRef, so two unequal BytesRef
instances cannot become equal solely through a copyBytes call, by my reading?

> I could also have used the
> string representation of the BytesRef, but this appears to be slower
> than copying the bytes, by a magnitude of about 2.5.

Not all bytes are valid representations of Strings, so don’t do this unless
you are very sure you are dealing with character data and know the encoding.

It’s also not surprising that this is slower, given that creating a String
not only involves copying all the bytes but also decoding them into characters.


What differently-sized byte arrays would you expect to compare as equals?

Best,
Steven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: BytesRef equals() method

Posted by "Rose, Stuart J" <St...@pnnl.gov>.
I agree that comparing the BytesRef lengths in an equals() method seems counter to the purpose of having a BytesRef class. 

I'd recommend taking a look at the BytesRefHash which maps BytesRef objects to unique ids as it 'may' be more efficient than converting to Strings. 

Stuart


-----Original Message-----
From: Yann-Erwan Perio [mailto:ye.perio@gmail.com] 
Sent: Tuesday, January 21, 2014 7:33 AM
To: java-user@lucene.apache.org
Subject: BytesRef equals() method

Hello,

I have been working a bit with BytesRef recently, and I wonder whether the content of the equals() method, and more specifically the content of the bytesEquals(BytesRef other) method, is the intended one.

Here is my use case. I work with Lucene 4.6.0. During indexing, using a custom tokenizer, I have added some payloads onto some tokens. Using an extension of the Default Similarity, I was then able to retrieve these payloads, passing them to a collector of mine, so as to perform aggregation calculations. It occurred to me that the BytesRef retrieved were not exactly the same as the indexed - namely their real content was the same, but their offsets would differ.

I was made aware of this because I used a Map<BytesRef, ...> in the collector, and the map would sometimes give inconsistent results.
Checking out the source code, the hashcode() method looks valid to me, but the bytesEquals() method looks strange - because prior to comparing the real value of the BytesRef, it checks their lengths - and AIUI these may differ, even though BytesRef are logically equal.

I am not familiar at all with the internals of Lucene (this includes the BytesRef mechanics), so I may be completely wrong here. FWIW, I solved my problem by creating fresh BytesRef from the ones sent by the similarity, using the copyBytes method. I could also have used the string representation of the BytesRef, but this appears to be slower than copying the bytes, by a magnitude of about 2.5.

Kind regards.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BytesRef equals() method

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Note the comments in the source:

   /** Length of used bytes. */
   public int length;

length is not the same as the size of the internal buffer.  It is the 
number of used bytes, so the length of the "logical" value as you call it.

-Mike

On 1/21/2014 10:32 AM, Yann-Erwan Perio wrote:
> Hello,
>
> I have been working a bit with BytesRef recently, and I wonder whether
> the content of the equals() method, and more specifically the content
> of the bytesEquals(BytesRef other) method, is the intended one.
>
> Here is my use case. I work with Lucene 4.6.0. During indexing, using
> a custom tokenizer, I have added some payloads onto some tokens. Using
> an extension of the Default Similarity, I was then able to retrieve
> these payloads, passing them to a collector of mine, so as to perform
> aggregation calculations. It occurred to me that the BytesRef
> retrieved were not exactly the same as the indexed - namely their real
> content was the same, but their offsets would differ.
>
> I was made aware of this because I used a Map<BytesRef, ...> in the
> collector, and the map would sometimes give inconsistent results.
> Checking out the source code, the hashcode() method looks valid to me,
> but the bytesEquals() method looks strange - because prior to
> comparing the real value of the BytesRef, it checks their lengths -
> and AIUI these may differ, even though BytesRef are logically equal.
>
> I am not familiar at all with the internals of Lucene (this includes
> the BytesRef mechanics), so I may be completely wrong here. FWIW, I
> solved my problem by creating fresh BytesRef from the ones sent by the
> similarity, using the copyBytes method. I could also have used the
> string representation of the BytesRef, but this appears to be slower
> than copying the bytes, by a magnitude of about 2.5.
>
> Kind regards.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org