Slices and dices

31.08.2010 19:33

Somehow this little corner of Python successfully evaded my discovery until now. Did you know there's a slice object that describes how an array is sliced? It can be passed in most places where a start:stop construct would go.

For instance:

>>> b = range(10)
>>> b[2:4]
[2, 3]
>>> a = slice(2,4)
>>> b[a]
[2, 3]

But more importantly, also:

>>> from operator import itemgetter
>>> f = itemgetter(slice(2,4))
>>> f(range(10))
[2, 3]

Which somewhat expands the usefulness of itemgetter in combination with itertools functions for efficient inner loops.

Interestingly, start:stop doesn't actually construct a slice object, so this:

>>> a = 2:4

is a syntax error. Funny. As far as I know this is the only place where Python's everything-is-an-object philosophy breaks. I would expect this to create an object just like a comma transparently generates a tuple.

Posted by Tomaž | Categories: Code | Comments »

Lucid Lynx automatic updates

29.08.2010 11:22

In case anyone else will find this helpful: my mum's laptop stopped receiving automatic update notifications after upgrade to Ubuntu 10.4. After digging through the lengthy pipeline that sits between an updated package available on an Ubuntu mirror and a window popping up on the screen the fix was trivial: somewhere during the upgrade process the /etc/cron.daily/apt script lost its execute permission bit. So to fix only run:

$ sudo chmod 755 /etc/cron.daily/apt

Oh, and there's a bug open on Launchpad, of course: #390319. However it's quite impossible to find it unless you know exactly what you are looking for. Search for "automatic updates not working" or something similar just gives you hundreds of forum threads full of clueless people blindly guessing at solutions to this and similar problems.

Posted by Tomaž | Categories: Code | Comments »

New Nelma release

27.07.2010 12:27

This is a kind of a late announcement: A few weeks ago I released version 3.2 of Nelma, the numerical electromagnetics package. Changelog isn't very impressive, with the most significant change being a patch that fixes compilation on libpng 1.4.x (contributed by Thomas Klausner of the NetBSD's pkgsrc team).

To put some light in this almost forgotten project, here's a brief description:

Nelma is a command line tool for numerically calculating various electrical properties of printed circuit boards.

It is currently capable of calculating capacitances between objects - nets on a PCB. It returns a spice-compatible description of an equivalent circuit of stray capacitances that can be for example used for more accurate circuit simulation. Alternatively it can also produce field data that can be plotted for example with Gnuplot.

There's an export plugin available for gEDA PCB software that automatically creates a Nelma configuration from a PCB layout.

Nelma is available under the GNU General Public License version 2.

I admit I haven't really used this since I left the Faculty, but back then it did help me with a couple of projects. The emphasis was mostly on the simplicity of the algorithm, so I stuck with the relatively inefficient finite difference method (ouch, O(n3) complexity) instead of the usual finite elements. Right now it could really use some love in form of algorithm parallelization or conversion to finite elements method. Any takers for a summer project?

Posted by Tomaž | Categories: Code | Comments »

Ampersand Ay Em Pee Semicolon

25.07.2010 14:10

Most people don't seem to get content encodings. The most popular way of working with them seems to be randomly throwing encode() and decode() functions at the problem until it compiles/runs/displays in the browser.

I found the most recent example of such practice the other day at work. An ampersand is a pretty common character in URLs since it's used as a parameter separator in GET requests. One common place to find URLs is in RSS and Atom feeds. These two formats are dialects of XML, where ampersand is a special character, denoting a character reference or an external entity. Hence it must be escaped. See the problem?

It turns out a surprising amount of RSS and Atom feeds on the web have URLs that contain obviously incorrectly escaped ampersands. The most common mistake looks like this:

...
<link>http://www.example.com/?a=1&amp;amp;b=2</link>
...

This is valid XML, but the URL is:

http://www.example.com/?a=1&amp;b=2

when it obviously should have been:

http://www.example.com/?a=1&b=2

But since those GET parameters that get garbled are in most cases used for tracking people around, nobody ever notices that because pages still display fine in the browser. The corollary is that nobody ever looks at the results of that tracking or somebody would surely notice the missing traffic coming from RSS feeds.

So, what is the correct way of dealing with these URLs? In theory, a valid URL could actually contain the sequence of characters "&amp;". So there is no way to know for sure if this is an error on the part of the XML author or not.

In practice, the solution is to just call decode() on the URL until it eats up all &amps;. It seems to work every time.

But are there sites out there that legitimately use such URLs? If there are, they seem to be ignored by Google. "inurl:&amp;" query returns only sites that have an ampersand look-alike character in urls (U+FF06, "fullwidth ampersand"). So the query obviously runs in Google's index, but pages with such URLs are either missing or filtered out.

As a side note, the URL http://en.wikipedia.org/wiki/%26amp%3B actually always resolves to a server error page.

Posted by Tomaž | Categories: Code | Comments »

Can't do that Dave

21.07.2010 9:36

Recently Debian desktops (and probably other distributions) got smart and won't allow you, a regular user, to shutdown the system if they think other users are logged in. The warning says System policy prevents stopping the system when other users are logged in.

I'm sure this is a most welcome addition to those 0.1% of installations that are actually used by multiple (human) users and allow shutting down the whole system by non-administrators. For the rest of us, this is just an irritation that pops up whenever you want to turn off the laptop and there's a root console or a sandboxed application running somewhere in the background.

It turns out this can be fixed in a rather trivial way. This post on Arch Linux Forums showed the way.

Drop a file named multiple-users.pkla into /etc/polkit-1/localauthority/50-local.d (not under /var/lib as suggested in the forum, as that is under package manager's control) with the following content:

[shutdown privs]
Identity=unix-user:username
Action=org.freedesktop.consolekit.system.restart-multiple-users;org.freedesktop.consolekit.system.stop-multiple-users
ResultAny=no
ResultInactive=no
ResultActive=yes

Of course add your user name above.

What this does is grant the restart and shutdown privileges to processes running under your user name while you have an "active local session". Not sure how they define what a local session is. Maybe an utmp record for a local terminal. See man pklocalauthority(8).

Posted by Tomaž | Categories: Code | Comments »

Partition alignment tester

01.07.2010 19:52

Lenovo Thinkpad T61 I use at work has abysmal IO performance. The whole system becomes unusable if some process starts moving gigabytes of data around the disk (not so uncommon when dealing with, say, Wikipedia dumps).

This is also noticeable at boot: 75 seconds from power-on to login prompt, then 45 seconds to get to a usable GNOME desktop and a further half a minute to start Firefox. All the time the hard drive disk LED is constantly lit up (which always reminds me of how Windows 95 used to boot). This is on stock Debian Squeeze (but was the pretty much same on Lenny)

Since colleagues running Windows on the same hardware don't complain about such problems I thought that maybe the problem was in the way I partitioned the disk. See, the computer came preinstalled with Vista, which I promptly deleted. I also repartitioned the disk while installing Debian. If the disk is one of those with 4 KB physical block size perhaps the new partitions I made are misaligned?

This laptop is now a bit over 2 years old and as far as I know 4 KB blocks weren't used back then (hence my ignorant repartitioning). But perhaps Lenovo was a little ahead of time?

Unfortunately it's impossible to reliably tell the physical block size from the information reported by the disks' firmware (dirty details here). For instance, WD EARS disk in my Desktop plainly reports 512 B physical sectors even though I'm certain it uses 4 KB ones.

So I made a little tool that reads or writes 4 KB blocks from a partition with different alignments and measures the throughput. On a disk where logical and physical block size differ, throughput should be significantly lower on unaligned accesses because the disk must do multiple physical accesses per single logical operation.

Here are the measurements on the Thinkpad:

Writing to /dev/raw/raw1
Size 514805760 bytes, reported FS block size 0
10 processes, each writing 100 * 4096 bytes
align	elapsed	throughput
0	3 s	1.3 MB/s
1	3 s	1.3 MB/s
2	3 s	1.3 MB/s
3	4 s	1.0 MB/s
4	3 s	1.3 MB/s
5	3 s	1.3 MB/s
6	3 s	1.3 MB/s
7	3 s	1.3 MB/s

And here on my desktop with WD EARS drive:

Writing to /dev/raw/raw1
Size 514805760 bytes, reported FS block size 0
10 processes, each writing 100 * 4096 bytes
align	elapsed	throughput
0	3 s	1.3 MB/s
1	35 s	0.1 MB/s
2	37 s	0.1 MB/s
3	38 s	0.1 MB/s
4	39 s	0.1 MB/s
5	38 s	0.1 MB/s
6	38 s	0.1 MB/s
7	39 s	0.1 MB/s

As you can see Thinkpad has the same throughput regardless of the alignment while unaligned accesses get a huge penalty on my desktop. Head seek time dominates in these measurements, hence the relatively low maximum throughput on both machines.

I tried to cancel the effect of any caches the best I knew in my code but with all the magic that sits between a write() call and the physical platter on a modern system it's impossible to be sure. Still I think this is a pretty strong indicator that partition alignment isn't the cause of the performance problems here. And it's also a pleasant confirmation that I did properly set up partitions on my desktop (Ted has a really good guide how to do that - just ignore the fact that he is talking about SSDs)

You can get the alignment tester code from git:

git clone http://chandra.tablix.org/~avian/git/alignment_tester.git

Make sure you read the source and understand what it does before running. Also you will need a scratch partition because tests are destructive (I used the swap partition on both machines for that). Oh, and do a full backup first.

Posted by Tomaž | Categories: Code | Comments »

Another mlmmj weirdness

22.06.2010 19:47

Here's another mlmmj weirdness. If you're running this mailing list manager on Debian Lenny (package version 1.2.15-1.1), be advised that this package is missing an English template file maxmailsize, which is used when sending an automated warning to a user if the mail he sent was too large.

This causes messages to get stuck in the mail spool and produces syslog messages like below (love the assertion failed: Success part):

/usr/bin/mlmmj-process[17335]: prepstdreply.c:173: Could not open std mail maxmailsize: No such file or directory
/usr/bin/mlmmj-process[17335]: mlmmj-process.c:656: assertion failed: Success

The simplest fix is to grab an official source distribution tarball and drop mlmmj-1.2.15/listtexts/en/maxmailsize into the text/ subdirectory of each list's spool directory.

Also, it appears this is fixed in Squeeze, so I didn't even bother to open a bug report.

Posted by Tomaž | Categories: Code | Comments »

Did F-Spot eat your photos?

11.06.2010 10:32

F-Spot has this weird feature. Sometimes when you import a photograph it doesn't copy the actual file to its library but only makes a reference to the file's current location in its database. And yes, I know about the "Copy files" checkbox in the Import dialog.

I don't know about you, but the library is the most important reason I use F-Spot instead of my old unwieldy sort-files-into-directories-manually system.

I have yet to discover when this happens, but when it does, it is easy to miss it. The thumbnail for the image appears normally and there is no indication in the GUI of where the image is actually stored. It's only when you open the image in full view and the original files have been moved (or worse, deleted) that the first sign of trouble appears.

On two occasions now I have stumbled upon old images missing from the library. On one of them I was able to recover some of the lost images from a wiped SD card using some tool which name I can't remember now (possibly either recoverjpeg or PhotoRec).

Here's how to check if any photo in F-Spot's database is located outside of the library:

$ sqlite3 $HOME/.config/f-spot/photos.db
sqlite> select * from photos where base_uri not like "file:///home/avian/images/Library%";
502|1213290088|file:///home/avian/Desktop/|dsc00669.jpg||2|1|0|
503|1213116525|file:///home/avian/Desktop/|dsc00668.jpg||2|1|0|
504|1213360534|file:///home/avian/Desktop/|dsc00670.jpg||2|1|0|

Of course, replace the path in the query with the path where your photo library lives.

If the SQL query returns any rows (like in the real-life example above) all you can do is curse the F-Spot developers and hope those images are still in Recycle bin or non-reused portion of some memory card.

Posted by Tomaž | Categories: Code | Comments »

Awesome suggestions

27.05.2010 22:28

Every once in a while we have a hack day at Zemanta. It's supposed to be a day when you use the office hours to explore any interesting ideas that are always pushed back by more important work. At least in my case, the most interesting ideas usually pop up just after the hack day.

This is also how this little hack came to life. My thinking went into this direction:

Zemanta's in-text link suggestions appear fine and fancy at first, but over time two problems became apparent. First, I myself rarely find the suggested explanatory links useful. It is reasonable to presume that the readers of my posts are also familiar enough with the topic that they find links to such basic explanations as Wikipedia redundant. And secondly, I often manually include in-text links that are much more related to the topic (previous posts on this topic, tickets in bug trackers, other blogs, etc.) Zemanta uses the "Related articles" list on the bottom of the post for such links, but I prefer to include them in-line. This means they can easily get overlooked if the text is flooded with Zemanta's automatic explanatory in-text links.

Now what if there was a way to automatically suggest those in-text links that I currently include manually and save me some copy-and-paste between the browser's location bar and the text editor? Destination URLs of all those links are already in the browser, because I either have them bookmarked or I recently read them when researching the topic of the blog post.

And the best thing is the Firefox browser already has a wonderful suggestion engine out of the box. It's just hard to recognize because by default it's limited to the location text entry field.

Yes, I'm talking about the so-called AwesomeBar.

So I made a tiny Firefox extension that exports the nsIAutoCompleteSearch interface of the Places system through a socket, so that any program on the same computer can access it. Then I wrote a client in the form of a Vim plug-in that throws at it pieces of text you're writing and displays the suggested links for you to choose. Yes, I know Vim was perhaps a bad choice, but this is what I use for blogging and I know how to use its scripting interface.

The plug-in adds a new keybinding (F10 by default). It invokes a function that takes the last typed phrase and runs it through the AwesomeBar magic. Here's how it looks in practice:

Awesomeggest demonstration, pressing F10

Press F10...

Awesomeggest demonstration, choose a web page from the list

choose a web page from the list...

Awesomeggest demonstration, HTML link is inserted

and a HTML link is inserted into the text.

Javascript isn't exactly my cup of tea, so the code is a bit clunky and sometimes the extension just refuses to work for no apparent reason. In any case, see the source for more technical details. There's really not much of it.

The code is available in a git repository under the GPLv2 for your enjoyment. All the usual warnings apply - this might eat your cookies and it certainly exposes your browsing history to every process on the host and possibly the entire internet.

$ git clone http://www.tablix.org/~avian/git/awesomeggest.git

I've been using this for a while now and it's useful to some degree. For one it's pretty good in practice at finding the correct web page, if only the anchor phase appears in the page URL or title (which seems to most of the times). The slowness of the search function in Firefox makes it a little less convenient to use and the Vim interface is pretty bad. An asynchronous search (like in the Firefox URL bar) would make performance problems much less noticeable, but is as far as I see beyond the capability of Vim scripting. Perhaps there is some brave soul out there that will prove how much better some other editor can be at this?

As always, any comments or patches are most welcome.

Posted by Tomaž | Categories: Code | Comments »

A new kind of mess

29.04.2010 15:29

One of the features of Zemanta API is image suggestion. For example, if you are writing an article about bowls of petunias, Zemanta will suggest nice photographs of that particular kind of plant life to go with your post.

A part of those suggested images comes from English Wikipedia and Wikimedia Commons. And since computer vision just isn't there yet, Zemanta's back-end can only learn about the content of the images from text and various other machine-readable data that is related to that image.

The problem is that while usable data is relatively abundant, it is scattered around English and Commons wikis. It's present directly or in various more or less complicated templates that appear on image description pages. Articles that include an image and captions they use also hold clues to the image content. Sometimes it's even necessary to go through 5 of more jumps between you can connect a useful piece of metadata, like for example between an article including a "Wikimedia Commons has more media related to" box and the actual picture.

Previously, this data extraction was performed in a traditional way: a series of Python scripts read the dumps provided by Wikimedia and stored the information in various MySQL tables. In it's last incarnation, this system took around 30 hours to process both wikis. This might not appear much, but in every new dump something inevitably breaks. Perhaps it's a critical template that has been renamed or a minor markup change exposed an odd bug in a Python script. This meant that often two or three sessions were required and a new dump quickly consumed a week worth of work.

Old image processing system using Python scripts and MySQL

Old image processing system using Python scripts and MySQL tables.

Two weeks ago I replaced this monster with a new one built upon the map-reduce paradigm that's all the rage these days. It performs the job in a little over 2 hours and uses Disco framework. This basically trades indexed MySQL table access (lots of expensive hard-disk seeks) for multiple passes over sorted flat files. These use sequential reads and are thus much faster. In practice however, there's still a lot of disk seeking going on, because Disco will sort these huge files on-disk (actually using GNU sort behind the curtains). But obviously the performance improvement is still significant.

I should also mention that Disco took some significant hacking before it became useful, so I can't really say it's a mature solution.

New image processing system using map-reduce

New image processing system using map-reduce.

As far as complexity of Zemanta's system goes, it hasn't gone down either. The part of the code that was directly affected by this change went from around 560 lines of Python to a bit over 1100. On the other hand the boxes in the graph above (representing individual map-reduce jobs) are now much more separated from each other. I guess only time will tell if this will be easier of harder to maintain. One thing is certain: the development cycle has become much faster.

Finally, the time improvement of an order of magnitude starts to look way less impressive when you take into account that the old system used one CPU on one machine while the new one takes two machines with 12 CPUs each. But I guess that doesn't matter if you have processors sitting idle in the rack.

Posted by Tomaž | Categories: Code | Comments »

The faulty default

26.04.2010 12:34

In case anyone else is trying to use mlmmj mailing list manager in a procmail rule. You have to unset the DEFAULT environment variable in your procmailrc before calling mlmmj-recieve. For example, like this:

DEFAULT

:0w
|/usr/bin/mlmmj-recieve -F -L $MLMMJ_SPOOL_DIR/$LOCAL_PART

The idea is that DEFAULT is set automatically by procmail and contains a path to the default mailbox. On the other hand the same environment is for some reason used by mlmmj as an override for the LOCAL_PART_SUFFIX or Delivered-To: header.

In the most well respected tradition of broken software, neither the procmail nor mlmmj behavior is documented. In fact the procmail man page specifically says that DEFAULT is NOT SET automatically when -m option is used. And the reason for the fault is in no way apparent from error messages from mlmmj, which send you on a wild goose chase through the system.

Oh, and by the way, the binary really is called mlmmj-recieve, not mlmmj-receive.

Posted by Tomaž | Categories: Code | Comments »

Year of the Linux Desktop

18.04.2010 16:56

Seriously, I was surprised today to see this list of minimum system requirements on a "Language Leader" CD my mum got in an English language class. I was just about to say that there's no chance this is going to work on her Ubuntu laptop.

LanguageLeader CD

Redhat, Mandrake and Debian distributions are mentioned, although only Mandrake has a version number (and there's a funny overload of "GNU" around Debian).

Of course, things didn't work as expected when we inserted the CD into her computer. Ubuntu is smart these days and tries to start programs under Wine if it sees a CD with a Windows autorun.inf file on it. Obviously, this isn't the right way to go in this case (and the installer failed to start anyway due to the "Cannot find the autorun program" bug). Is there a multiplatform way of specifying software that should automatically start after inserting media?

There's a LanguageLeader_Linux ELF executable in the root directory which promptly dies with a segmentation fault if started.

Further digging around the CD revealed a main.swf Adobe Flash file. It turns out this one will happily run in the Flash Firefox plug-in, sound and all. Problem solved.

So, it's a nice surprise, but obviously the user experience is lacking compared to other supported systems. The CD has a copyright date in 2008 and by coincidence I was doing this on Ubuntu 8.10 which was also released in 2008. I wonder how much testing, if any, this product received on Linux.

Posted by Tomaž | Categories: Code | Comments »

On ATI Radeon documentation

10.04.2010 17:12

Two years ago AMD released documentation for Radeon graphics cards from ATI. And everyone was happy and said how this would finally bring good, open source drivers for that hardware.

When I was buying a new desktop computer a month ago that was the deciding factor on my choice of the graphic card. I was tired of complete lack of run-time configuration support (XRandr, etc.) and really bad compositing performance of binary Nvidia drivers. And Intel graphics, which works like a charm on my EeePC, turned out to be nearly impossible to get for a desktop computer. So I went with the cheapest, oldest ATI card I could get - a fan-less Radeon HD 4350 - thinking that that would have the best chance of working properly on Linux.

To be honest, it does work relatively well with the latest radeonhd driver from the git repository. There is some screen flicker when moving windows and scrolling in the browser, but nothing I couldn't live with. Even 3D acceleration is working (Ok, kind-of. Some xscreensaver modules have the habit of Oops-ing the kernel)

A bigger problem turned out to be non-working HDMI audio (I want to also use this computer for watching movies on the TV). So I thought, hardware documentation is out there, driver source is available, I can fix this. No problem, right?

Well, not really. First, the docs released by AMD aren't as good as all those news articles led you to believe. I couldn't find any reference to the registers controlling HDMI audio in there and the driver code is full of comments warning about how the function of this and that register was reverse engineered. For instance support for audio on RV710 and RV730 was only possible once it was discovered that a previously unknown register had to be set.

In the end I did in fact find the cause of the problem and made a workaround in the code (see bug #27575), but the bad feeling that I would probably be better off with some other graphics card remained.

Posted by Tomaž | Categories: Code | Comments »

Streaming gzip decompression in Python

24.03.2010 17:59

Python has a handy module called gzip that can be used to read gzip-compressed files just like if they were uncompressed. However, it has one nasty flaw - it needs support for the seek() method in file objects it reads, which isn't available for example in network streams.

Since the gzip command-line utility can decompress from a pipe, Python's requirement for seekable streams obviously isn't a requirement of the file format.

I hit upon this limitation when trying to implement a map reader function for compressed files for Disco map-reduce framework I'm experimenting with at work. It appears I'm not alone with this problem.

I nearly implemented a fix myself, when I found not one, but two patches gathering dust in the official Python's Bugzilla that solve this issue: 914340 and 1675951.

I tried the second one and it works beautifully. The description also claims that it improves read performance by 20%, although I can't confirm that.

This was submitted back in 2007 and as far as I see fits the description of a patch any free software maintainer should be most happy to accept. I can only guess why 3 years later this is still not in the official Python builds.

Posted by Tomaž | Categories: Code | Comments »

Decoding IEEE 754

19.03.2010 13:20

There's a discussion going on today on Zemanta's staff mailing list about floating point formats and precision. I thought the following C code might also be useful to anyone trying to understand how floating point values are handled in modern hardware.

This manually decodes and prints parts of the floating point number stored in C's double type. Of course, it assumes that hardware uses the IEEE 754 "double" for that. That should be true at least on all descendants of the Intel i386 architecture.

#include <stdint.h>
#include <stdio.h>

int main() {
	double a = 0.1f;
	uint8_t* ac = (uint8_t*) &a;

	int n;
	for(n=7;n>=0;n--) printf("%02x ", ac[n]);
	printf("\n\n");

	int64_t e, m, s;

	m = 1;		// 1 (implied bit)
	m<<=4;
	m += ac[6] & ((1<<4)-1);	// 5
	m<<=8;
	m += ac[5];	// 13
	m<<=8;
	m += ac[4];	// 21
	m<<=8;
	m += ac[3];	// 29
	m<<=8;
	m += ac[2];	// 37
	m<<=8;
	m += ac[1];	// 45
	m<<=8;
	m += ac[0];	// 53

	printf("mantissa = %lld * 2^(-52)\n", m);

	e = ac[7] & ((1<<7)-1);	// 7
	e<<=4;
	e += ac[6]>>4;	// 11

	e -= 1023;	// exponent bias

	printf("exponent = %lld\n", e);

	s = ac[7]>>7;

	printf("sign bit = %lld\n\n", s);

	printf("%f ~= (-1)^%lld * %lld * 2^(%lld)\n", a, s, m, e - 52);

	return 0;
}

Adjust the value of a to taste. By default it will print:

3f b9 99 99 a0 00 00 00 

mantissa = 7205759511166976 * 2^(-52)
exponent = -4
sign bit = 0

0.100000 ~= (-1)^0 * 7205759511166976 * 2^(-56)

By the way, bc, the arbitrary precision calculator, comes very handy when doing calculations that involve limits of floating point formats.

Posted by Tomaž | Categories: Code | Comments »