
Bugs in software have many sources, but ones caused by erroneous assumptions are among the most difficult to find. The problem is that these assumptions are baked in the code, so the reader will tend to take them in at the same time as the code.
Consider the following code that creates some simple histogram of input values. It seems reasonable to assume that the resulting dictionary will have a limited size, with at most steps
entries. This assumption is incorrect.
def bucket(values, min, max, steps): r = max - min assert r > 0 steps = int(steps) assert steps > 0 d = float(steps) / r result = collections.defaultdict(int) for v in values: if v < min: v = min elif v > max: v = max k = math.floor((v - min) * d) / d + min result[k] += 1 return result
This code can return dictionaries which are much larger than steps
entries. The problem here lies in the clamping logic. The assumption here is that a value that is not smaller than min
and not larger than max
is a value in the range ⟦min
…max
⟧.
This assumption is incorrect in Python (and probably many other languages), because in floating point, there is NaN
. Now NaN
in Python has many properties that break the assumptions of the code above:
- Any floating point operation that has
NaN
as an argument returnsNaN
. float('NaN') < X
→False
∀ Xfloat('NaN') > X
→False
∀ Xfloat('NaN') == float('NaN')
→False
The last property means that using floating point values as keys in dictionaries is a bad idea, as the equality operator is used to determine if two keys are the same. This means that you while you can add a value with key NaN
, you can never access it by key, because that key is not equal with itself. So you can also add multiple values with the same key, and they will each have a different entry in the dict:
d[float('NaN')] = 4 d[float('NaN')] = 5 d → {nan: 4, nan: 5}
Python will let you use anything as a dictionary key, even if the key is mutable, or does not implement the equality operator properly. So if you feed the following input to the bucket
function above, you will a return dictionary with 1000 entries, regardless of the value of the step parameter.
v = itertools.repeat(float('nan'), 1000)
Ça me rappelle le NULL en bases de données, avec WHERE monchamp = NULL qui renverra toujours FAUX, et WHERE monchamp NULL qui renverra FAUX aussi.
J’avais commis un billet sur les “assumptions” (je vois pas de mot en français : suppositions ? assumations ?), mais venant du monde réel, là :
http://www.coindeweb.net/blogsanssujetprecis/index.php?post/2008/04/07/466-les-pieges-dans-le-developpement