Google’s Japanese Search

Google Eiga Sushi Screen Capture

Last saturday I had a traffic spike, nothing serious, but still above the 50 hits a day I usually get. This spike was caused mostly by japanese surfers. The main readers of this blog have, until now, had their browser set-up for english or french, but sudently, Japanese surfers represented 50% of the traffic.

The culprit was this page about a sushi movie. As there was no new referal source, this traffic came from search engines. I realized that my blog turn up in fourth position if you search for the terms 映画 鮨eiga sushi on google, that is if you search for movie and sushi. This seems very wrong to me. It can mean one of the following things:

  • That my blog’s entry is somehow the fourth most relevant page on the theme of movies and sushis. Difficult to believe.
  • That google’s indexing algorithm has trouble with japanese pages. Actually, I already suspected that search engines had trouble with the japanese side of the web this spring, when for some time my page, was highly ranked on queries about the 川北祭りKawakita Matsuri. (This is not the case anymore.)

So what is the problem? I suspect part of the problem comes for the dual writing systems: Hirgana and Kanji. A single word can be written using either system, but google returns completely different results for queries in the two systems:

So why doesn’t google transform the data to one canonical writing system? The problem is that the transformation is not always trivial. Transforming kanji to hiragana is reasonably easy: there are web sites that add hiragana reading to a web page. The problem of this transformation is that it loses semantic information: a kanji has a meaning, hiragana is just a phonetic writing. Multiple kanjis can map to the same hiragana (in fact this is very frequent). The reverse transformation (converting hiragana to kanji) is very difficult to do automatically, as it implies understanding the context. The best solution in my opinion would be to convert kanji to hiragana when indexing a page and indexing both the kanji and hiragana version in the index. When a query is entered in hiragana, suggest the kanji based on statistical data, in the same way the google engine suggests corrections for mis-spellings. The main drawback of this approach is that we have doubled the size of word index for a page. Although you might be able to avoid this by making the page point to the kanji and the kanji point on the hiragana reading.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: