You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkuehler <an...@lehmi.de> on 2022/03/05 15:30:56 UTC

COSBase, avoid to have the same hashCode for different objects holding the same value

Hi,

I'm not sure if we dicussed that topic in the past or if I simply mixed it up 
with a discussion about "equals" and "="

However, PDFBOX-5286 shows the we have an issue with objects which aren't the 
same but are treated as the same because of the same hash. This is true for all 
simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.

Think about the following two indirect /Length objects

100 0 obj
512
endobj


200 0 obj
512
endobj

* there two different COSObjects "100 0" and "200 0"
* both COSObjects have different hashes
* both COSObjects are referencing a COSInteger holding the same value "512"
* both COSIntegers are different objects
* both COSIntegers have the SAME hash, as the current implementation of hashCode 
is based on the value of the COSInteger

Or some pseudo code

COSObject(100,0) != COSObject(200,0)
COSInteger(100,0) != COSInteger(200,0)
COSObject(100,0).hashCode != COSObject(200,0).hashCode
COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
COSInteger(100,0).equals(COSInteger(200,0) == true

IMHO we should change the implementation of hashCode so that different objects 
will have different hashCodes.

I expect some side effects
* we are using a lot of hash-based collections and I'm afraid there may be some 
cases where the fact of having the same hash for different objects is wanted 
(knowingly or not)
* we have to remove the static instances for COSInteger values in a range from 
-100 to 256 which will result in an increased number of COSInteger instances
* there are just two static instances of COSBoolean ("true" and "false") which 
have to be replaced too
* COSName is caching a lot of values as static instances as well, which should 
be removed as well
* looks like COSFloat shouldn't be a problem

WDYT? Should we simply start with COSFloat and COSInteger and see how it ends up?

Andreas










---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: COSBase, avoid to have the same hashCode for different objects holding the same value

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Am Samstag, dem 05.03.2022 um 16:30 +0100 schrieb Andreas Lehmkuehler:
> Hi,
> 
> I'm not sure if we dicussed that topic in the past or if I simply
> mixed it up 
> with a discussion about "equals" and "="

Not sure that we discussed that in the context of primitives but having
worked on that myself a bit for the primitives there were already
hashCode/equals methods. In addition as you pointed out there are also
static instances for quite a while.

I also had to revert changes done around equals/hashCode in the past as
the current implementation of the COS type classes and (mainly)
COSWriter are dependent on keeping a specific handling.

So to me the question really is what one would expect from a PDF
perspective. If that means that we need to treat equals as being
identical then I'm fine going down that route. This would also resolve
the mutability question we had a while ago. E.g. what happens if the
COSInteger value of COSObject 100 changes but COSObject 200 points to
the same (Java) object?

Currently we have a mix which is inconsitent and the source of
surprises.

On the long run that also means that we need to look at a more
intelligent way of COSWriter as when writing a PDF we should be able to
benefit from storing "same" content only once and reference that.

BR
Maruan

> 
> However, PDFBOX-5286 shows the we have an issue with objects which
> aren't the 
> same but are treated as the same because of the same hash. This is
> true for all 
> simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.
> 
> Think about the following two indirect /Length objects
> 
> 100 0 obj
> 512
> endobj
> 
> 
> 200 0 obj
> 512
> endobj
> 
> * there two different COSObjects "100 0" and "200 0"
> * both COSObjects have different hashes
> * both COSObjects are referencing a COSInteger holding the same value
> "512"
> * both COSIntegers are different objects
> * both COSIntegers have the SAME hash, as the current implementation
> of hashCode 
> is based on the value of the COSInteger
> 
> Or some pseudo code
> 
> COSObject(100,0) != COSObject(200,0)
> COSInteger(100,0) != COSInteger(200,0)
> COSObject(100,0).hashCode != COSObject(200,0).hashCode
> COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
> COSInteger(100,0).equals(COSInteger(200,0) == true
> 
> IMHO we should change the implementation of hashCode so that
> different objects 
> will have different hashCodes.
> 
> I expect some side effects
> * we are using a lot of hash-based collections and I'm afraid there
> may be some 
> cases where the fact of having the same hash for different objects is
> wanted 
> (knowingly or not)
> * we have to remove the static instances for COSInteger values in a
> range from 
> -100 to 256 which will result in an increased number of COSInteger
> instances
> * there are just two static instances of COSBoolean ("true" and
> "false") which 
> have to be replaced too
> * COSName is caching a lot of values as static instances as well,
> which should 
> be removed as well
> * looks like COSFloat shouldn't be a problem
> 
> WDYT? Should we simply start with COSFloat and COSInteger and see how
> it ends up?
> 
> Andreas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: COSBase, avoid to have the same hashCode for different objects holding the same value

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

@Mel good to see you are still around

Am 07.03.22 um 14:55 schrieb Martinez, Mel - 0441 - MITLL:
> I’m mainly a lurker on PDFBox these days, as I have moved to focus on other things (though ironically my latest tasking does indirectly make use of PDFBox again!) and I am not familiar with the details of the code but I would offer the following advice on this:
> 
> If two objects are:
> 
> a) the same type and
> b) have the same value (i.e., evaluate:  obj1.equals(obj2) == true ) and
> c) are immutable (i.e., their value cannot be changed once constructed)
> 
> Then they should have the same hashcode.
> 
> Conceptually such objects represent the same exact, immutable, information.  This is why two String objects that both hold the same character sequence have the same hashcode.   These sort of immutable objects are considered interchangeable when they have the same data value.  Code execution is exactly the same regardless of which object instance you use for a given sequence of code.  Numbers such as Integers, Longs, Floats and Doubles and Booleans also all represent immutable information and the same rules apply.  The number “5” is informationally identical throughout the universe and indeed all references to “5” are really references to the same immutable information.
> 
> A huge advantage of treating such equal-value, same-type objects interchangeably (by giving them identical hashcodes) is that they can be used with Object Pooling to reduce memory and improve performance.
> 
> If the object types are not immutable, however — i.e., if it is possible for their values to be modified (such as with setters or other mutators) then whether they should have the same hashcode depends on how they are used.  Do they have other data fields that are not being taken into consideration when the hash is calculated?   Do they have transient fields that are not maintained across serialization?
> 
> Hashcodes usually (not always) should be persistent through the life cycle of the object.  If you put an object in a hashmap, the internal bucket it gets dropped into will be based on the hash and you (normally) don’t want that changing while the object reference is stored in the map.
> 
> I can not recall enough detail about the PDFBox codebase or the COSxxxx wrappers in particular to be able to assert how these points apply so I am just offering these concepts here for folks to keep in mind.
I'm afraid it is more complicated than that and I myself didn't think about all 
pitfalls.

@Mel It is correct that immutable objects sharing the same value and type should 
be equal and have the same hashcode at least from a programmers point of view.

Those simple COSBase-objects such as COSBoolean, COSInteger, COSFloat and 
COSName seem to match those characteristics. But the base class COSBase has to 
changeable attributes ("key" and "direct") so that instances of those classes 
aren't immutable and don't qualify for your definition.

But after reading yours and Maruans answer I rethink my proposal and withdraw 
it. The whole thing isn't that easy. I guess we have to overhaul to whole 
direct/indirect stuff and some more. And maybe we will find a more elegant way 
to represent COS-objects, so that we don't have think about equals and hashCode 
anymore.

Looks like there are still a lot interesting challenges left ;-)

Andreas

> 
> I’m sure you guys will make the right design decision.   And thanks a ton for all the work you guys have done over the years.
> 
> Mel
> 
> Dr. Mel Martinez
> m.martinez@ll.mit.edu<ma...@ll.mit.edu>
> 
> 
> On Mar 5, 2022, at 10:30 AM, Andreas Lehmkuehler <an...@lehmi.de>> wrote:
> 
> Hi,
> 
> I'm not sure if we dicussed that topic in the past or if I simply mixed it up with a discussion about "equals" and "="
> 
> However, PDFBOX-5286 shows the we have an issue with objects which aren't the same but are treated as the same because of the same hash. This is true for all simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.
> 
> Think about the following two indirect /Length objects
> 
> 100 0 obj
> 512
> endobj
> 
> 
> 200 0 obj
> 512
> endobj
> 
> * there two different COSObjects "100 0" and "200 0"
> * both COSObjects have different hashes
> * both COSObjects are referencing a COSInteger holding the same value "512"
> * both COSIntegers are different objects
> * both COSIntegers have the SAME hash, as the current implementation of hashCode is based on the value of the COSInteger
> 
> Or some pseudo code
> 
> COSObject(100,0) != COSObject(200,0)
> COSInteger(100,0) != COSInteger(200,0)
> COSObject(100,0).hashCode != COSObject(200,0).hashCode
> COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
> COSInteger(100,0).equals(COSInteger(200,0) == true
> 
> IMHO we should change the implementation of hashCode so that different objects will have different hashCodes.
> 
> I expect some side effects
> * we are using a lot of hash-based collections and I'm afraid there may be some cases where the fact of having the same hash for different objects is wanted (knowingly or not)
> * we have to remove the static instances for COSInteger values in a range from -100 to 256 which will result in an increased number of COSInteger instances
> * there are just two static instances of COSBoolean ("true" and "false") which have to be replaced too
> * COSName is caching a lot of values as static instances as well, which should be removed as well
> * looks like COSFloat shouldn't be a problem
> 
> WDYT? Should we simply start with COSFloat and COSInteger and see how it ends up?
> 
> Andreas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org<ma...@pdfbox.apache.org>
> For additional commands, e-mail: dev-help@pdfbox.apache.org<ma...@pdfbox.apache.org>
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: COSBase, avoid to have the same hashCode for different objects holding the same value

Posted by "Martinez, Mel - 0441 - MITLL" <m....@ll.mit.edu>.
I’m mainly a lurker on PDFBox these days, as I have moved to focus on other things (though ironically my latest tasking does indirectly make use of PDFBox again!) and I am not familiar with the details of the code but I would offer the following advice on this:

If two objects are:

a) the same type and
b) have the same value (i.e., evaluate:  obj1.equals(obj2) == true ) and
c) are immutable (i.e., their value cannot be changed once constructed)

Then they should have the same hashcode.

Conceptually such objects represent the same exact, immutable, information.  This is why two String objects that both hold the same character sequence have the same hashcode.   These sort of immutable objects are considered interchangeable when they have the same data value.  Code execution is exactly the same regardless of which object instance you use for a given sequence of code.  Numbers such as Integers, Longs, Floats and Doubles and Booleans also all represent immutable information and the same rules apply.  The number “5” is informationally identical throughout the universe and indeed all references to “5” are really references to the same immutable information.

A huge advantage of treating such equal-value, same-type objects interchangeably (by giving them identical hashcodes) is that they can be used with Object Pooling to reduce memory and improve performance.

If the object types are not immutable, however — i.e., if it is possible for their values to be modified (such as with setters or other mutators) then whether they should have the same hashcode depends on how they are used.  Do they have other data fields that are not being taken into consideration when the hash is calculated?   Do they have transient fields that are not maintained across serialization?

Hashcodes usually (not always) should be persistent through the life cycle of the object.  If you put an object in a hashmap, the internal bucket it gets dropped into will be based on the hash and you (normally) don’t want that changing while the object reference is stored in the map.

I can not recall enough detail about the PDFBox codebase or the COSxxxx wrappers in particular to be able to assert how these points apply so I am just offering these concepts here for folks to keep in mind.

I’m sure you guys will make the right design decision.   And thanks a ton for all the work you guys have done over the years.

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu<ma...@ll.mit.edu>


On Mar 5, 2022, at 10:30 AM, Andreas Lehmkuehler <an...@lehmi.de>> wrote:

Hi,

I'm not sure if we dicussed that topic in the past or if I simply mixed it up with a discussion about "equals" and "="

However, PDFBOX-5286 shows the we have an issue with objects which aren't the same but are treated as the same because of the same hash. This is true for all simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.

Think about the following two indirect /Length objects

100 0 obj
512
endobj


200 0 obj
512
endobj

* there two different COSObjects "100 0" and "200 0"
* both COSObjects have different hashes
* both COSObjects are referencing a COSInteger holding the same value "512"
* both COSIntegers are different objects
* both COSIntegers have the SAME hash, as the current implementation of hashCode is based on the value of the COSInteger

Or some pseudo code

COSObject(100,0) != COSObject(200,0)
COSInteger(100,0) != COSInteger(200,0)
COSObject(100,0).hashCode != COSObject(200,0).hashCode
COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
COSInteger(100,0).equals(COSInteger(200,0) == true

IMHO we should change the implementation of hashCode so that different objects will have different hashCodes.

I expect some side effects
* we are using a lot of hash-based collections and I'm afraid there may be some cases where the fact of having the same hash for different objects is wanted (knowingly or not)
* we have to remove the static instances for COSInteger values in a range from -100 to 256 which will result in an increased number of COSInteger instances
* there are just two static instances of COSBoolean ("true" and "false") which have to be replaced too
* COSName is caching a lot of values as static instances as well, which should be removed as well
* looks like COSFloat shouldn't be a problem

WDYT? Should we simply start with COSFloat and COSInteger and see how it ends up?

Andreas










---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org<ma...@pdfbox.apache.org>
For additional commands, e-mail: dev-help@pdfbox.apache.org<ma...@pdfbox.apache.org>