One of the best way to learn things in computer science is failing. This is also one of the things you learn to do the less at school. When teacher talk about hard problems, they mention large complex problems, NP-hard problems, or algorithms where there are very smart solutions. They never mention problems with crappy data. One of the first projects where I was confronted with this was still studying at the university. As a side job, I doing maintenance for some medical lab associated with the university, this was mostly about doing the maintenance of lab’s fleet of Macintoshes.
One day I was asked if I could help them clean up there bibliographical database. The database itself was stored within Filemaker, the problem was that there were multiple formats for names as the data had been entered by hand: one was Surname Family Name, but some records where in reverse format: Family Name, Surname (notice the coma). This seemed simple enough: export the data in CSV, run it through a small program, into another CSV file, re-import the result, voilà.
There were two problems: one the file was big, with many thousands of records, so processing took around ten minutes, two there were many exceptions which would be familiar to anybody doing human name processing: compound names, particles, multiple authors separated with comas, the word and, and sometimes the Oxford comma, sometimes even the ampersand. Some of the data was ill-formed, some used exotic characters, some had broken encodings. Each time I fixed one issue, two others popped-up. In the end, I had a program that fixed most of the names, but I had to warn the good doctors that there were still many issues in the data. There were not really pleased.
Interestingly enough, even today this is still a complex problem. I can now see other way of approaching it, but this mostly would involve crowd-sourcing and data-mining the correct names out of the web. Given the data I had at that time, I’m not sure I would do much better…