Diversity and data assumptions

It is no secret that diversity is an issue in computer science, a lot has already been said and tried to improve the situation, with little success, and sadly I don’t think there is a silver bullet as this is a complicated problem.

Still it makes sense to challenge the way we think about things: computer-science tends to be dominated by a few communities, which, even though they are quite international, tend to replicate their thought patterns and their preconceptions. I cringe every time someone wants to build another Silicon Valley: one is enough, we need something else.

One enduring pattern is that text is ASCII: a majority of the people working from IT come from a culture whose written language cannot be expressed properly using simply the characters used in modern English yet they build systems were this or that text field cannot contain anything else but English characters. A majority reproducing a pattern that does not suit them as users.

How can you challenge the assumptions on who works in information technology if you cannot even challenge the idea of what text is? In this case, a de facto standard that is only usable by a minority, the fraction of web-sites that are pure ASCII has been falling steadily, yet the number of applications and system that can only properly process ASCII is huge.

I’m certainly not claiming that fixing that particular technical problem would in any way improve the diversity situation, but I have the feeling that the underlying problems are similar: a system that has worked for some time, with a large body of evidence showing that it is broken, an unwillingness to change because this would challenge some core processes and assumptions…

3 thoughts on “Diversity and data assumptions”

  1. Why were computers created by the people with the most reduced character set in the world? You could imagine an uchronia where Chinese invent IT, and computers are capable from the beginning to deal with huge character sets. Unicode from the beginning.

  2. English is not smallest character set in the World, Italian uses less (J, K, W, X and Y are not used), and ASCII includes a lot of non necessary characters (upper and lowercase, control characters which are now dead).

    ASCII really became the standard with the rise of micro-computers. It is true that if computing had originated in China, we would probably have had a text encoding that would support more characters, but I suspect there would be reverse problems:
    • The Chinese script system does not support the notion of upper and lower case.
    • The Chinese script does not support variable width characters, ascenders and descenders.
    • Many western scripts shared characters with the same origin: α a and а (Cyrillic a) are closely related, and could be simplified into one
    • Western accents could be approximated using Japanese trailing inflection marks (i.e. have the accent after the the letter).

    • Idea for an uchronia: European Union is created in 1917; Latin becomes the common language for business purposes; computers are created in the 60s to compute the mess of pan-European taxes; Chinese users complain that computers were created by people with the most basic character set.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.