A new kind of mess

29.04.2010 15:29

One of the features of Zemanta API is image suggestion. For example, if you are writing an article about bowls of petunias, Zemanta will suggest nice photographs of that particular kind of plant life to go with your post.

A part of those suggested images comes from English Wikipedia and Wikimedia Commons. And since computer vision just isn't there yet, Zemanta's back-end can only learn about the content of the images from text and various other machine-readable data that is related to that image.

The problem is that while usable data is relatively abundant, it is scattered around English and Commons wikis. It's present directly or in various more or less complicated templates that appear on image description pages. Articles that include an image and captions they use also hold clues to the image content. Sometimes it's even necessary to go through 5 of more jumps between you can connect a useful piece of metadata, like for example between an article including a "Wikimedia Commons has more media related to" box and the actual picture.

Previously, this data extraction was performed in a traditional way: a series of Python scripts read the dumps provided by Wikimedia and stored the information in various MySQL tables. In its last incarnation, this system took around 30 hours to process both wikis. This might not appear much, but in every new dump something inevitably breaks. Perhaps it's a critical template that has been renamed or a minor markup change exposed an odd bug in a Python script. This meant that often two or three sessions were required and a new dump quickly consumed a week worth of work.

Old image processing system using Python scripts and MySQL

Old image processing system using Python scripts and MySQL tables.

Two weeks ago I replaced this monster with a new one built upon the map-reduce paradigm that's all the rage these days. It performs the job in a little over 2 hours and uses Disco framework. This basically trades indexed MySQL table access (lots of expensive hard-disk seeks) for multiple passes over sorted flat files. These use sequential reads and are thus much faster. In practice however, there's still a lot of disk seeking going on, because Disco will sort these huge files on-disk (actually using GNU sort behind the curtains). But obviously the performance improvement is still significant.

I should also mention that Disco took some significant hacking before it became useful, so I can't really say it's a mature solution.

New image processing system using map-reduce

New image processing system using map-reduce.

As far as complexity of Zemanta's system goes, it hasn't gone down either. The part of the code that was directly affected by this change went from around 560 lines of Python to a bit over 1100. On the other hand the boxes in the graph above (representing individual map-reduce jobs) are now much more separated from each other. I guess only time will tell if this will be easier of harder to maintain. One thing is certain: the development cycle has become much faster.

Finally, the time improvement of an order of magnitude starts to look way less impressive when you take into account that the old system used one CPU on one machine while the new one takes two machines with 12 CPUs each. But I guess that doesn't matter if you have processors sitting idle in the rack.

Posted by Tomaž | Categories: Code | Comments »

The faulty default

26.04.2010 12:34

In case anyone else is trying to use mlmmj mailing list manager in a procmail rule. You have to unset the DEFAULT environment variable in your procmailrc before calling mlmmj-recieve. For example, like this:

DEFAULT

:0w
|/usr/bin/mlmmj-recieve -F -L $MLMMJ_SPOOL_DIR/$LOCAL_PART

The idea is that DEFAULT is set automatically by procmail and contains a path to the default mailbox. On the other hand the same environment is for some reason used by mlmmj as an override for the LOCAL_PART_SUFFIX or Delivered-To: header.

In the most well respected tradition of broken software, neither the procmail nor mlmmj behavior is documented. In fact the procmail man page specifically says that DEFAULT is NOT SET automatically when -m option is used. And the reason for the fault is in no way apparent from error messages from mlmmj, which send you on a wild goose chase through the system.

Oh, and by the way, the binary really is called mlmmj-recieve, not mlmmj-receive.

Posted by Tomaž | Categories: Code | Comments »

Adrenalin rush at 30 𝗄𝗆/𝗁

24.04.2010 21:34

Today I attended a full day safe driving course at AMZS' (Automobile Association of Slovenia) new center near Vransko.

I don't have any particular fascination with cars (Ok, I have been known to forget that). But I did came to realize that owning and driving one is basically a must at this time and in this country. So when I do drive my car I try to drive safely and economically (i.e. usually somewhere between "slow" and "get-out-of-my-way-you-fucking-moron" by regional standards).

I wanted to attend the course because I wished to know if my perception of safe driving was valid and because I wanted to experience first hand how a car behaves when you push it outside of it's normal operation.

AMZS safe driving centre Vransko

I can now say that all my questions and doubts are very well answered and cleared out. You get a brief theory lesson that explains the basic concepts like over- and understeer followed by almost 6 hours of (exhausting) practice behind the wheel of your own car. They have slippery surfaces and machinery that simulate dangerous circumstances at low speeds, so you can try out defensive maneuvers in relative safety (relative considering the amount of car-inflicted damage on the buildings and fences around the track, as pointed out by the instructor).

Still, at least for me, it was quite an adrenalin rush to try to recover control in a car that's trying to spin out of control. It gives you an appreciation of how hard it is to react correctly in an actual emergency. In one exercise, for example, even though I was aware approximately when a hydraulic plate will violently push my car into a drift, I was still shocked every time it happened. You can't think and there's no time for rational thought. It's only after a number of tries that my hands just started doing the right things by themselves.

I would certainly recommend this course to anyone that spends any time behind the wheel. I learned things that I was surprised I didn't hear when getting a license. I should also mention that I was pleasantly surprised by the friendly and professional attitude of the staff - until today I was under the impression that professional drivers have even worse attitude towards amateurs than IT professionals.

Posted by Tomaž | Categories: Life | Comments »

Year of the Linux Desktop

18.04.2010 16:56

Seriously, I was surprised today to see this list of minimum system requirements on a "Language Leader" CD my mum got in an English language class. I was just about to say that there's no chance this is going to work on her Ubuntu laptop.

LanguageLeader CD

Redhat, Mandrake and Debian distributions are mentioned, although only Mandrake has a version number (and there's a funny overload of "GNU" around Debian).

Of course, things didn't work as expected when we inserted the CD into her computer. Ubuntu is smart these days and tries to start programs under Wine if it sees a CD with a Windows autorun.inf file on it. Obviously, this isn't the right way to go in this case (and the installer failed to start anyway due to the "Cannot find the autorun program" bug). Is there a multiplatform way of specifying software that should automatically start after inserting media?

There's a LanguageLeader_Linux ELF executable in the root directory which promptly dies with a segmentation fault if started.

Further digging around the CD revealed a main.swf Adobe Flash file. It turns out this one will happily run in the Flash Firefox plug-in, sound and all. Problem solved.

So, it's a nice surprise, but obviously the user experience is lacking compared to other supported systems. The CD has a copyright date in 2008 and by coincidence I was doing this on Ubuntu 8.10 which was also released in 2008. I wonder how much testing, if any, this product received on Linux.

Posted by Tomaž | Categories: Code | Comments »

Facebook is not Vegas

12.04.2010 21:40

Dnevnik, one of the Slovenian daily newspapers, published a parent's guide to Facebook this Saturday. The second guideline says that parents should never ask their children about the stuff they post on social networks. "What happens on Facebook, stays on Facebook."

Although the entire article is written in a slightly satirical tone, I find this logic deeply flawed.

Activities on such websites are basically public knowledge, no matter what their owners try to tell you. Kids should learn that messing around on Facebook can have consequences in real life. What better way to learn that than from a parent asking uncomfortable questions? "Yes, everyone can see that, and everyone includes your parents.". And even if they obey the rule of separating virtual and real lives, sooner or later someone won't. That teaches common sense: if you want privacy, don't do it on a social networking site!

That said, I do believe in the gist of xkcd #137, it's just that you should be doing bold things out of conviction, not ignorance.

Posted by Tomaž | Categories: Life | Comments »

On ATI Radeon documentation

10.04.2010 17:12

Two years ago AMD released documentation for Radeon graphics cards from ATI. And everyone was happy and said how this would finally bring good, open source drivers for that hardware.

When I was buying a new desktop computer a month ago that was the deciding factor on my choice of the graphic card. I was tired of complete lack of run-time configuration support (XRandr, etc.) and really bad compositing performance of binary Nvidia drivers. And Intel graphics, which works like a charm on my EeePC, turned out to be nearly impossible to get for a desktop computer. So I went with the cheapest, oldest ATI card I could get - a fan-less Radeon HD 4350 - thinking that that would have the best chance of working properly on Linux.

To be honest, it does work relatively well with the latest radeonhd driver from the git repository. There is some screen flicker when moving windows and scrolling in the browser, but nothing I couldn't live with. Even 3D acceleration is working (Ok, kind-of. Some xscreensaver modules have the habit of Oops-ing the kernel)

A bigger problem turned out to be non-working HDMI audio (I want to also use this computer for watching movies on the TV). So I thought, hardware documentation is out there, driver source is available, I can fix this. No problem, right?

Well, not really. First, the docs released by AMD aren't as good as all those news articles led you to believe. I couldn't find any reference to the registers controlling HDMI audio in there and the driver code is full of comments warning about how the function of this and that register was reverse engineered. For instance support for audio on RV710 and RV730 was only possible once it was discovered that a previously unknown register had to be set.

In the end I did in fact find the cause of the problem and made a workaround in the code (see bug #27575), but the bad feeling that I would probably be better off with some other graphics card remained.

Posted by Tomaž | Categories: Code | Comments »

Thyristor regulator AC model

08.04.2010 12:40

It turns out that having a thyristor voltage regulator in a feedback loop isn't as simple as it seems. Actually, it's quite hard to keep the regulation stable over a wide range of load conditions and prevent it from oscillating. At least my previous experiments proved somewhat hard to adapt to work reliably in a pre-regulator regime where the load current can go from 0 to 2 amperes.

The stability problem is usually attacked by studying the circuit from the perspective of various well-known loop stability criteria, like phase and gain margin. However, this circuit is as non-linear as they get. Therefore Bode diagrams and such are out of the question without some major simplifications.

Basic thyristor preregulator diagram

Here's how the output voltage (thick) looks like when the reference voltage (thin) swings around a preset DC value.

Time diagram of input and output voltages versus time

Obviously, the output contains frequencies not found in the input, hence the claim that this is a non-linear circuit. The output looks like this because the output voltage

  • can only increase once every 10 ms (1 over 100 Hz, two times the line frequency), when the thyristors can fire, and
  • can only decrease due to load current discharging the capacitors.

This circuit behaves more like a switch-mode power supply than a linear circuit. Unfortunately, the usual state-space averaging technique doesn't work here.

However, it does seem possible to approximate the gain and phase diagrams for a single input frequency with a simple one-pole transmission function:

A(j\omega) = \frac{1}{1 + j\frac{\omega}{\omega_p}}
\omega_p = \frac{2 \cdot I_{out}}{U_{in} \cdot C}

Where C is the output capacitance, Iout output current and Uin the amplitude of the AC reference voltage component. Of course, this holds only for input frequencies significantly lower than the line frequency (say 10 Hz and lower).

This appears to be a good enough approximation to use in practice when applying phase and gain margin criteria. Of course, it only allows rough approximations (think nearest order-of-magnitude, not nearest third-decimal). But it does show how stability relates to current and capacitance. It also shows that the frequency of the pole depends on the amplitude, so a circuit that's otherwise stable may go into oscillation if it receives a large enough disturbance from its set point - something I've also seen in practice.

In conclusion, here's a couple of simulated gain and phase diagrams compared with the approximation given above. The dots are the result of a time-domain simulation of the circuit while the lines represent the transfer function of the model above.

As you can see, the simulated gain follow the model quite nicely. On the other hand the phase plot is a bit off the mark. However the error is usually on the side of safety, since the simulated phase shift is less than one predicted by the model.

Bode gain plot of a thyristor regulator

Bode phase plot of a thyristor regulator

Posted by Tomaž | Categories: Analog | Comments »