Python Logo

Bugs in software have many sources, but ones caused by erroneous assumptions are among the most difficult to find. The problem is that these assumptions are baked in the code, so the reader will tend to take them in at the same time as the code.

Consider the following code that creates some simple histogram of input values. It seems reasonable to assume that the resulting dictionary will have a limited size, with at most steps entries. This assumption is incorrect.

def bucket(values, min, max, steps):
  r = max - min
  assert r > 0
  steps = int(steps)
  assert steps > 0
  d = float(steps) / r
  result = collections.defaultdict(int)
  for v in values:
    if v < min:
      v = min
    elif v > max:
      v = max
    k =  math.floor((v - min) * d) / d + min
    result[k] += 1
  return result

This code can return dictionaries which are much larger than steps entries. The problem here lies in the clamping logic. The assumption here is that a value that is not smaller than min and not larger than max is a value in the range ⟦minmax⟧.

This assumption is incorrect in Python (and probably many other languages), because in floating point, there is NaN. Now NaN in Python has many properties that break the assumptions of the code above:

  • Any floating point operation that has NaN as an argument returns NaN.
  • float('NaN') < XFalseX
  • float('NaN') > XFalseX
  • float('NaN') == float('NaN')False

The last property means that using floating point values as keys in dictionaries is a bad idea, as the equality operator is used to determine if two keys are the same. This means that you while you can add a value with key NaN, you can never access it by key, because that key is not equal with itself. So you can also add multiple values with the same key, and they will each have a different entry in the dict:

d[float('NaN')] = 4
d[float('NaN')] = 5 
d → {nan: 4, nan: 5}

Python will let you use anything as a dictionary key, even if the key is mutable, or does not implement the equality operator properly. So if you feed the following input to the bucket function above, you will a return dictionary with 1000 entries, regardless of the value of the step parameter.

v = itertools.repeat(float('nan'), 1000)

One thought on “Assumptions”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: