Roman numbers are one of those legacy system that hang around without ever really disappearing. Roman numbers are difficult to parse, and their length varies at lot. The mystery sequence in my previous post was just that: the length of roman numeral `n` for each value of `n`. The question I was wondering is: how inefficient is the roman numeral system? The decimal basically needs log_{10}(`n`) symbols to represent number `n`, the same number in binary format will need log_{2}(`n`). My intuition would be that the roman system is less efficient (in terms of symbols needed to represent a value) than the decimal system, but better than the binary representation.

So I wrote a small program to compute how many chars would be needed to represent a given number.

The graph shows the various number system, each point is the mean number of characters needed to represent the values in the range [2^{i} … 2^{i+1}[. Obviously the binary representation is exactly linear. The decimal notation is also linear, with steps each time a power of 10 is crossed. The roman notation is also linear up to 2^{18} where the curves goes up: the reason is simple, there is no symbol for values above 10000 (ↈ). Up until that point roman numeral seem to be as efficient as a ternary notation.

To generate those numbers, I created a small python script to generate roman numerals. For the sake of consistency, I chose to only generate characters from the unicode roman numeral range (`x2160`

to `x2188`

). One interesting thing in this range is that it contains hybrid characters, like Ⅷ (`x2167`

) which represent the roman numeral for eight in one single character. Using that range we get slightly different curve, the asymptotic behaviour is not changed, but numbers are generally shorter by two characters and for small numbers, the system is now as efficient as arabic numerals.

Here is the python that generates the shorter roman numerals using the hybrid characters.

NUMERALS = ( (100000, u'ↈ'), (90000, u'ↂↈ'), (50000, u'ↇ'), (40000, u'ↂↇ'), (10000, u'ↂ'), (9000, u'Ⅿↂ'), (5000, u'ↁ'), (4000, u'Ⅿↁ'), (1000, u'Ⅿ'), (900, u'ⅭⅯ'), (500, u'Ⅾ'), (400, u'ⅭⅮ'), (100, u'Ⅽ'), (90, u'ⅩⅭ'), (50, u'Ⅼ'), (40, u'ⅩⅬ'), (20, u'ⅩⅩ'), # to avoid 24 = 'ⅫⅫ' (12, u'Ⅻ'), (11, u'Ⅺ'), (10, u'Ⅹ'), (9, u'Ⅸ'), (8, u'Ⅷ'), (7, u'Ⅶ'), (6, u'Ⅵ'), (5, u'Ⅴ'), (4, u'Ⅳ'), (3, u'Ⅲ'), (2, u'Ⅱ'), (1, u'Ⅰ'), ) def _roman(n): for value, text in NUMERALS: times, n = divmod(n, value) yield text * times if not n: return def roman(n): return u''.join(_roman(n))