Google Translate idea

03.08.2009 18:40

Google Translate is currently pretty useless for any serious translation (regardless of what translators of user manuals for consumer electronics think).

However, I sometimes find myself searching for, say, a Slovene equivalent to a specific English technical term or phrase. I'm fluent in both languages, it's just that the term may be too far out of my expertise to know the correct translation - although I can usually spot the correct one when I see it in context.

Scientific terminology is usually very strict, with one specific name for a phenomenon. And if you want the translation to appear correct in the eyes of someone knowledgeable in that field, a simple by-the-dictionary translation of that name may make you a popular target of in-jokes (take note, authors of subtitles on Slovenian television networks).

So, if I don't have someone fluent in that terminology at hand, what I usually do is first check for equivalent pages in different languages of Wikipedia. More often than not, that approach fails miserably. Then I'm off to Google, where I search for the term I want to translate (in quotes) plus some terms in the targete language that I estimate must appear nearby in the translation.

However, that's not really a good task for a general search engine - translations may for example appear on different web pages (pages often have an English and an Slovenian section separated on different pages). On the other hand my search will only return results when a single page contains both English and Slovene texts, so a lot of potentially useful results would be missed.

Here's my idea: machine translation tools like Google Translate already recognize pairs of texts on the web that are direct translations of each other. That's the input for their machine learning algorithms, where they learn how to (badly) translate free text.

Wouldn't it be nice if you could enter just an English phrase and select a language, and get back a list of English texts containing that phrase, plus the matching texts in Slovene? It's not even necessary to point out where exactly in the text the equivalent phrases are - A paragraph-level of precision would be more than enough. I can find the exact spot myself while reading the context (which I must, to make sure I'm using the correct translation).

So in effect you would be using the infrastructure that most likely already exists, but for human instead of machine learning. I'm sure that would be a most useful tool for people translating technical texts. At least until machine translation becomes a little more accurate.

Posted by Tomaž | Categories: Ideas | Comments »

Cummulus backups

01.08.2009 20:55

The cloud computing is all the rage these days. However I'm kind of reluctant to trust some company in a country half way across the globe with all my personal data, even if they picture it floating in fluffy heaven. So in a kind of old-fashioned way, I still store all my private documents, photos and emails on hard disks that are my own property.

However, backing up my home server to an external USB drive has become a little bit inconvenient lately. So I decided to give it a go and try backing up to Amazon Simple Storage Service (S3) - i. e. the cloud.

My server is on a residential ADSL line with relatively poor upstream bandwidth (1 Mb/s), so incremental backups must use the bandwidth efficiently. There's approximately 4 GB of compressed data to be backed up, which is just on the limit of what I would consider feasible for a setup like this - theoretically it should take around 12 hours to upload a full snapshot, but I don't plan to do these very often.

After some investigation I decided to use Duplicity: first because it's advertised as efficient in exactly this use case and secondly because I already use it to backup my computer at work. Although the official man page is a little short on details about S3 storage, there are quite a few articles floating around.

The cost of S3 storage is pretty minimal: I never plan to store more than around 20 GB worth of backups and if I count in a monthly 4 GB full snapshot, that comes to $3.50 per month. Granted this is very expensive if you compare it to the price of an external USB drive, but it has the benefit of being off-site and conveniently accessible from anywhere on the internet.

Of course, the tiny paranoid voice in my head made me check all the worst case scenarios: If Amazon suddenly disappears from the face of the Earth, I would be left without backups. But I judge that the possibility of that happening and me needing the backups in the same instant is too low to worry about. The data I'm sending over the Atlantic is encrypted with GPG, so it's presumably safe even in the unlikely case someone in US would want to browse through my stuff pretending to be looking for terrorists or some such nonsense.

One problem I do see is that these backups are not safe in case someone breaks into my server, since they could be altered or erased by the attacker - but that's the case with most if not all automated backups. In addition to that, Sysadminman makes an interesting point that in case someone gets my Amazon credentials they can run up a huge bill in my name since it's not possible to put bandwidth limits on an account. That's not a possibility that would make me loose my sleep at night, but I did make a note to check occasionally my account activity.

Finally, how does this work in practice? A full backup takes 14 hours while an incremental one is finished in a little less than 15 minutes. One thing I still have on the to-do list is to look into Linux QoS settings and make some adjustments so I could still comfortably read my email over IMAP and the NTP client wouldn't panic once a month when the full backup is made.

So right now, after two weeks of use, it looks like I'll stick to this backup scheme. Still, it's nice to know I can cancel the service at any moment should any serious problems come up.

Posted by Tomaž | Categories: Code | Comments »