In a previous post, I opined about how Ising models could be applied to learning relationships between musical artists based solely on their co-occurrence patterns. So I got to thinking: what if the “artists” were actually words in two languages. By looking at the connections between words in language 1 and language 2 we can automatically learn translations for words between the two languages.
To do this, we need a corpus where sentences between the two languages are aligned, so that we can get good co-occurrence statistics. Luckily, Europarl is a great resource for this sort of thing. From this point on, we’ll be talking about the French-English Europarl.
Next we need to make a decision about what sort of co-occurrence statistics to include. There are two roads here. In one case, we include all the co-occurrences of the concatenated English-French sentence. So “delegation” might co-occur with both “foreign” (because they tend to occur together in English sentences) and “délégation” (because it occurs in the corresponding French sentences). In the second case, we keep only co-occurrences between Language 1 and Language 2, and none of the co-occurrences between Language 1 and Language 1 or Language 2 and Language 2. In the example above, this would mean that we would NOT consider co-occurrences of “delegation” and “foreign”.
So when I was deciding which of these to do, I thought that route 1 might be better. With more information, the model might be more capable of teasing out the fact that “delegation” does not translate to “étrangère” — in fact “delegation” co-occurs with “foreign” and so we can explain away the co-occurrences with “étrangère.”
How did it do? Here are dictionaries for route 1 and route 2. (Note that the sets of translated words are a little bit different because I automatically prune the entries we’re less sure about). These dictionaries show the top French translation for each English word (i.e., the French word with the highest correlation parameter in the Ising model). There’s lots of interesting stuff in the 2nd, 3rd, etc. best translations too (for example, while the top translation for “reply” is “réponse,” the next one on the list is “repondre.”) for economy of space I’m only putting up the top one.
I’ve also pruned a lot of the stop words automatically. Note that those get translated very poorly as do extremely polysemous words. So you’re not liable to get very good translations for words like “that.” Multi-word idioms also do poorly (“once” -> “fois” rather than “une fois”, “course” -> “sûr” rather than “of course” -> “bien sûr”). Words which have no translation such as modal verbs in English which are often captured by inflection in French (“will” -> “sera” instead of “will be” -> “sera”, “would” -> “voudrais” instead of “would like” -> “voudrais” [ which incidentally tells you a lot about how members of the European parliament use the word “would” ]) are also problemmatic.
But for specific and somewhat rare words, it does amazingly well (“talk” -> “parler”, “rights” -> “droits”, “stress” -> “souligner”). I don’t have exact numbers (perhaps I can make AMT do this for me), but it seems to do pretty well and I think ironically that route 2 does better. I’m not entirely sure why but I think it’s because without the intra-language co-occurrences the model can focus on connections that explain inter-language co-occurrences.
In summary, the approach is fast, easy, and works pretty well (considering that the only nlp tool I used was a simple tokenizer; no morphological analysis or stemming was done!).