Steampunk telephony

27.04.2011 19:04

Yesterday I went to the Museum of Post and Telecommunications in Polhov Gradec, which is a part of the national technical museum. With a relatively small exhibition it covers the history of telecommunications from horse carrying a bag of mail to digital mobile telephony. Certainly the most interesting exhibit there was a working electromechanical telephone exchange with three phones connected to it so you can actually look at the moving stepping switches while dialing a number.

I would say these machines have a similar appeal for electrical engineers as steam engines probably have for mechanical engineers. Everything is exposed and you can inspect their mode of operation simply by observing them with an unaided eye. Even with just basic engineering knowledge you can enjoy deducting interesting details of how they work.

More modern systems are certainly a valuable part of the collection, but their core components are simply much too small, too fast or too complex for this kind of observation. While observing a vintage transistor circuit can be just as fun, I doubt any museum allows you to poke their exhibits with an oscilloscope.

It is no wonder then that people are still building relay-logic devices (or steam locomotives for that matter).

Posted by Tomaž | Categories: Digital | Comments »

Pain station

25.04.2011 21:37

I'm almost ashamed to admit that there is a PlayStation 3 on the desk in front of me. You know, the piece of hardware that lost the official support to run Linux last year and whose manufacturer's master signing keys were revealed earlier this year.

For the moment I won't go into how it's 7-core Cell microprocessor can be used for all sorts of useful number crunching. Instead I want to give my first experience of the system from the standpoint of someone whose last encounter with a game console was probably a friends NES countless years ago.

If you've seen a modern game up close I guess you already know where this is heading. From the first boot on things did not go as expected. Somehow I still thought of these things as simple devices where you plug in a ROM module (OK, it's a disc now) and you can enjoy a few minutes of playing, without bothering to find proper settings or worrying whether your system is powerful enough to run the latest game.

The 10-something settings sub-menus and the list of system requirements on the bundled game quickly disposed of those ideas.

Seriously, I don't get it. This thing is connected with a digital connection (HDMI cable I had to buy separately by the way) to a digital display (an LCD TV also made by Sony). It has all the capabilities to figure out everything it needs to display a picture on the screen. And the first thing it does is to ask me to manually set up the visible area of the screen!? Just to check that this isn't something left over from the analogue days I connected it to another Sony-made HDMI enabled TV and the visible area was indeed a few percent different. With both TVs claiming to support the same 1080p HDTV resolution! Mind boggles. I won't even go into the details of getting the sound working on one of these TVs that involved 30 minutes of digging through those damned settings menus.

Also, the bundled game (a well-known racing simulation I'm told) is far from a thing you might pick up in a few spare minutes. First of all, it requires installation and setup of its own. After that you have to earn "money", buy "things", gather "experience", setup a "profile", keep up a "reputation", live a "life". I thought people play games to get away from all those things for a while.

Actually, everything is surprisingly set up in a way to keep people from enjoying a game in company. You get only one controller in the box. You have to create a user account for each person. The fact that the game keeps up your statistics, virtual bank balance and what not probably means that you can't even give someone a try without worrying that he will mess up this or that part of your account.

I really did not see that coming.

Posted by Tomaž | Categories: Life | Comments »

Google Refine tips

23.04.2011 14:15

Google Refine is currently the best free software tool for cleaning up messy data. It's perfect to correct unescaped HTML strings, catch an odd typo or fetch additional data about entities from Freebase.

We use it extensively at Zemanta to clean up and reconcile customer's datasets before importing them into our proprietary triple store. What used to be a pile of Python scripts with questionable reusability is now a Refine project using a bunch of string transforms and reconciliations with either Freebase or our own reconciliation service.

While the screencasts and tutorials get you started pretty quickly there are quite a few details that took me a while to figure out. I'll list them here in case they might be useful to someone else and for my own reference.

  • When loading data from a CSV file Refine autodetects its encoding. However it seems to only look at the first few rows. So in the common case where the majority of rows are ASCII it is quite likely all the detection code sees is 7 bit ASCII. In that case it falls back to ISO 8859-1 and will garble, say UTF-8 encoded, strings further down. Right now it is impossible to force an encoding, but by inserting a properly encoded non-ASCII character in the header row you can give the detection logic something to work on. I usually just prepend "ČŽŠ" to a file and it worked so far.

    There's also a bug that occasionally causes corruption of non-ASCII characters when reconciling cells. See issue 314.

  • You should use caution with the "Auto-detect value types" option when importing data. Numeric database IDs look like numerical values to Refine and at least in one case Refine helpfully transformed large integer IDs to exponential representation which made them useless.

  • Beware that while there are Boolean operators in GREL (the default language when defining transforms and facets) there is no Boolean type. Boolean operators return a string "true" or "false", both of which are considered as a true value for any subsequent Boolean operators. So "(1 == 1) and (1 == 0)" will return "true". If you are writing a facet, the simplest solution is to do "(1 == 1) + (1 == 0)", which returns "truefalse" and then work with that.

  • Refine has this weird idea of multi-valued cells. If the value in the first cell in a row is blank, other cells are interpreted as additional values for the first row above with the non-blank first cell. The most common way to encounter this is when fetching data from Freebase. If the topic has more than one value for a property you are fetching from Freebase additional rows will be inserted into your project. This commonly happens with "/type/object/mid" since one topic can have multiple MIDs resulting from merges.

  • There's a really annoying bug in Refine 2.0 that is triggered by reconciliating a cell that consists only of stop words ("the", "and" and so on). It creates a state where project's data will get corrupted on save, making it impossible to open again once you close it in Refine. That's issue 358 in the tracker. There's also a patch there, which I strongly suggest you apply.

  • To find rows without a single reconciliation candidate, use the numeric facet "Best candidate's score", check "Error" and uncheck "Numeric", "Non-numeric" and "Blank".

  • The undo / redo feature is great for browsing through the history. Use caution though, because a single misplaced click on the star or flag in the nearby cell will destroy all undone history. Days of work have been lost that way.

  • When exporting a project to a CSV file, beware that only cells currently shown are exported. Remove all facets and filters if you want to export the complete project.

  • You can't delete a single row. I usually use the flag field to mark the rows I want to delete, then filter for flagged rows and use the "Remove all matching rows" option.

To conclude, Refine is a very useful tool and a significant improvement over the old Freebase Gridworks, however it still has some rough edges. I maintain my own fork which has fixes for some of the bugs I mentioned above and some additional improvements to the user interface specific to Zemanta's use case.

I should also mention that hacking on Refine is really simple. Compilation worked out of the box on my Debian box according to the Developer's guide. For fixing most of the user interface issues you don't even have to recompile the Java back-end - just modify the client-side HTML and/or Javascript files you can find in the binary distribution.

Posted by Tomaž | Categories: Code | Comments »

Aiming for the center

11.04.2011 21:08

Software that exposes its user interface over the web offers limitless opportunities for monitoring the behavior of people using it. You can literally watch each move of the mouse and each key press. And it's so tempting to gather all this information and try to make sense out of it that some are pushing this idea of data centered design where each decision must come from a bump in a graph or two. From my point of view as an engineer (UX is not in my verbal vocabulary) I find that kind of approach shortsighted at best.

You are invariably optimizing for an average user, even if you divide people into neat little numbered bins. Consider an analogy from the physical world: Center of gravity is the mean location of mass in an object. In many objects, like in this ring for instance, it is also devoid of any mass. Optimizing something for the average user can mean you simply make no one happy and everyone equally miserable.

Ring with a center of gravity mark

AB testing and similar ideas are just hill-climbing optimization algorithms for some variable that ultimately depends on processes in a wet human brain. Such simple approaches fall short in the real world where laws of physics are infinitely less complex. How can they be the main guide for development in user interface design, where a linear law is replaced by chaotic mess of firing neurons? I don't doubt that in some time in the future well be able to understand and model it (and that certainly won't be a two dimensional graph with vaguely defined axes). Until then it will take a human to understand a human.

Some may argue that at large scales groups of people are surprisingly predictable. My answer is that it's also well known that entropy increases in the universe. That doesn't bring you any closer to designing an internal combustion engine.

I'll stop myself here and won't go into the impossibility of doing controlled experiments with people or the moral problem I see in choosing profit as the universal optimization variable for all design questions. I'll offend enough liberal arts students as it is.

Measurements are important in evaluation, but anything involving a human in the loop is one significant figure at best. Modifying things based on that second decimal increase in metric instead of going with the gut feeling of what is best just takes everyone a little further from happiness.

Even if you reap mad profits from that bouncing blinking purchase button.

Posted by Tomaž | Categories: Ideas | Comments »

Samsung GT-I5700 disassembly

05.04.2011 22:38

I've already said plenty of bad things about the Android-powered mobile phone I bought nearly a year ago, so I'll refrain from repeating myself. I tore it down the other day to see if anything could be done to fix the highly annoying contact bounce the lock-unlock button developed lately. And also to satisfy my curiosity by poking around this made-for-the-scrapyard device.

Samsung GT-I5700 with the PCB exposed

Removing the back to expose the (single) PCB was pretty straightforward: six screws and some plastic brackets. There's a black GSM/UMTS antenna block attached to the back of the shell (under the keys) and three more similar contacts at the top in the shell itself. It looks like something is embedded into the white plastic, possibly a Wi-Fi, Bluetooth or GPS antenna. Connections to all of these are merely by gold-plated springs that push agains the back side when the phone is put together.

Samsung GT-I5700 PCB turned over

The backside of the PCB contains all the integrated circuits and is connected to everything else via tiny SMD connectors and bits of flexible PCB. The CPU has an RF shield that is basically conductive rubber ring that presses against the PCB and the aluminum frame. There's also a tiny backup battery on the right side.

Flexible PCB connections in Samsung GT-I5700

The PCB is held in place with a single screw at the top. The bottom hangs freely, giving the phone a distinctive tunning-fork-feeling it you knock on it.

I admit I haven't seen the inside of many phones, but this one got surprisingly dusty in less than a year of use (in what I consider a clean environment). The rubber seals around the holes in the casing obviously proved to be insufficient protection against the environment.

In the end, I tested the tiny SMD push button and it didn't seem to behave any differently than the other buttons, so I left it as it is. It's probably a better idea to replace everything else around that button.

Posted by Tomaž | Categories: Digital | Comments »

Reverse DNS lookups and Apache

02.04.2011 13:13

While debugging some unrelated problem (another case of the #349) I noticed that logs from an Apache web server I take care of sometimes contain complete hostnames of clients instead of just IP addresses. That is usually considered bad practice - making reverse DNS requests for every HTTP connection your server gets makes the web site slower and loads DNS servers for no good reason. I decided to look into it.

A tcpdump session quickly confirmed that Apache was doing double reverse lookups even though the HostnameLookups directive was clearly turned off. What was puzzling was that only a part of logged requests contained hostnames (that is, LogFormat "%h" expanded to a hostname instead of an IP). Most, but not all of those were for automatically generated index pages (mod_autoindex) and I could reliably reproduce only some requests.

In the end, the culprits turned out to be .htaccess files that contained allow from directives with hostnames instead of IP ranges. Naturally those can only be verified by Apache by doing double reverse lookups. The surprising thing that made this problem harder to find is that with index pages Apache will check the .htaccess file for the current directory and all subdirectories. So if any subdirectory has restriction based on a hostname, the lookups will also be made when a client requests its parent directory.

I also found out that when persistent connections are used, the hostname will be logged for all requests done through that connection. So once an URL requiring a lookup is retrieved by a client, all subsequent requests will have a hostname instead of an IP logged (even though only one DNS lookup was made).

By the way, Google turns up this blog post on the topic. It might be a bit outdated, but while investigating I did also look into the Apache 2.2.9 source and I can add that:

  • The only place in the code where double reverse lookups are made is the mod_authz_host module. So if you see double lookups in tcpdump a hostname based allow / deny rule is the only option.
  • Using "%h" in LogFormat alone does most certainly not cause a DNS lookup on its own. Replacing it with "%a" will however hide the fact that the server is doing lookups, because that one will expand to the remote IP address whether the hostname is known to Apache or not.
Posted by Tomaž | Categories: Code | Comments »