Power supply voltage shifts

02.05.2016 20:16

I'm a pretty heavy Munin user. In recent years I've developed a habit of adding a graph or two (or ten) for every service that I maintain. I also tend to monitor as many aspects of computer hardware as I can conveniently write a plugin for. At the latest count, my Munin master tracks a bit over 600 variables (not including a separate instance that monitors 50-odd VESNA sensor nodes deployed by IJS).

Monitoring everything and keeping a long history allows you to notice subtle changes that would otherwise be easy to miss. One of the things that I found interesting is the long-term behavior of power supplies. Pretty much every computer these days comes with software-accessible voltmeters on various power supply rails, so this is easy to do (using lm-sensors, for instance).

Take for example voltage on the +5 V rail of an old 500 watt HKC USP5550 ATX power supply during the last months of its operation:

Voltage on ATX +5 V rail versus time.

From the start, this power supply seemed to have a slight downward trend of around -2 mV/month. Then for some reason the voltage jumped up for around 20 mV, was stable for a while and then sharply dropped and started drifting at around -20 mV/month. At that point I replaced it, fearing that it might soon endanger the machine it was powering.

The slow drift looks like aging of some sort - perhaps a voltage reference or a voltage divider before the error amplifier. Considering that it disappeared after the PSU was changed it seems that it was indeed caused by the PSU and not by a drifting ADC reference on the motherboard or some other artifact in the measurements. Abrupt shifts are harder to explain. As far as I can see, nothing important happened at those times. An application note from Linear mentions that leakage currents due to dirt and residues on the PCB can cause output voltage shifts.

It's also interesting that the +12 V rail on the same power supply showed a bit different pattern. The last voltage drop is not apparent there, so whatever caused the drop on the +5 V line seemed to have happened after the point where regulation circuit measures the voltage. The +12 V line isn't separately regulated in this device, so if the regulation circuit would be involved, some change should have been apparent on +12 V as well.

Perhaps it was just a bad solder joint somewhere down the line or oxidation building up on connectors. At 10 A, a 50 mV step only corresponds to around 5 mΩ change in resistance.

Voltage on ATX +12 V rail versus time.

This sort of voltage jumps seem to be quite common though. For instance, here is another one I recently recorded on a 5 V, 2.5 A external power supply that came with CubieTruck. Again, as far as I can tell, there were no external reasons (for instance, power supply current shows no similar change at that time).

Voltage on CubieTruck power supply versus time.

I have the offending HKC power supply opened up on my bench at the moment and nothing looks obviously out of place except copious amounts of dust. While it would be interesting to know what the exact reasons were behind these voltage changes, I don't think I'll bother looking any deeper into this.

Posted by Tomaž | Categories: Analog | Comments »

Measuring interrupt response times, part 2

27.04.2016 11:40

Last week I wrote about some typical interrupt response times you get from an Arduino and Raspberry Pi, if you follow basic examples from documentation or whatever comes up on Google. I got some quite unexpected results, like for instance a Python script that responds faster than a compiled C program. To check some of my guesses as to what caused those results, I did another set of measurements.

For Arduino, most response times were grouped around 9 microseconds, but there were a few outliers. I checked the Arduino library source and it indeed always enables AVR timer/counter0 overflow interrupt. If timer interrupt happens at the same time as the GPIO interrupt I was measuring, the GPIO interrupt can get delayed. Performing the measurement with the timer interrupt masked out indeed removes the outliers:

Effect of timer interrupt on Arduino response time.

With timer off, all measured response times are between 9.1986 to 8.9485 μs. This is a 0.2501 μs long interval. It fits perfectly with theory - at 16 MHz CPU clock and instruction length between 1 and 5 cycles, uncertainty for interrupt latency is 0.25 μs.

The second weird thing was the aforementioned discrepancy between Python and C on Raspberry Pi. The default Python library uses an ugly hack to bypass the kernel GPIO driver and control GPIO lines directly from user space: it mmaps a range of physical memory containing GPIO registers into its own process memory space using /dev/mem. This is similar to how X servers on Linux (used to?) access graphics hardware from user space. While this approach is very unportable, it's also much faster since you don't need to do context switches into kernel for every operation.

To check just how much faster mmap method is on Raspberry Pi, I copied the GPIO access code from the RPi.GPIO library into my test C program:

Response times using sysfs and mmap methods on Raspberry Pi.

As you can see, the native program is now faster than the interpreted Python script. This also demonstrates just how costly context switches are: the sysfs version is more than two times slower on average. It's also worth noting that both RPi.GPIO and my C program still use epoll() or select() on a sysfs file to wait for the interrupt. Just output pin change can be done with direct memory accesses.

Finally, Raspberry Pi was faster when the CPU was loaded which seemed counterintuitive. I tracked this down to automatic CPU frequency scaling. By default, Raspberry Pi Zero seems to be set to run between 700 MHz and 1000 MHz using ondemand governor. If I switch to performance governor, it keeps the CPU running at 1 GHz at all times. In that case, as expected, the CPU load increases the average response time:

Effect of cpufreq governor on Raspberry Pi response time.

It's interesting to note that Linux kernel comes with pluggable idle loop implementations (CONFIG_CPU_IDLE). The idle loop can be selected through /sys/devices/system/cpu/cpuidle in a similar way to the CPU frequency governor. The Raspbian Jessie release however has that disabled. It uses the default idle loop for ARMv6 processors. Assembly code has been patched though. The ARM Wait For Interrupt WFI instruction in the vanilla kernel has been replaced with some mcreq (write to coprocessor?) instructions. I can't find any info on the JIRA ticket referenced in the comment and the change has been added among other BCM-specific changes in a single 6400-line commit. Idle loop implementation is interesting because if it puts the CPU into a power saving mode, it can affect the interrupt latency as well.

As before, source code and raw data is on GitHub.

Posted by Tomaž | Categories: Digital | Comments »

Measuring interrupt response times

18.04.2016 15:13

Embedded systems were traditionally the domain of microcontrollers. You programmed them in C on bare metal, directly poking values into registers and hooking into interrupt vectors. Only if it was really necessary you would include some kind of a light-weight operating system. Times are changing though. These days it's becoming more and more common to see full Linux systems and high-level languages in this area. It's not surprising: if I can just pop open a shell, see what exceptions my Python script is throwing and fix them on the fly, I'm not going to bother with microcontrollers and the whole in-circuit debugger thing. Some even say it won't be long before we will all be just running web browsers on our devices.

It seems to be common knowledge that the traditional approach really excels at latency. If you're moderately careful with your code, you can get your system to react very quickly and consistently to events. Common embedded Linux systems don't have real-time features. They seem to address this deficiency with some combination of "don't care", "it's good enough" and throwing raw CPU power at the problem. Or as the author of RPi.GPIO library puts it:

If you are after true real-time performance and predictability, buy yourself an Arduino.

I was wondering what kind of performance you could expect from these modern systems. I tend to be very conservative in my work: I have a pile of embedded Linux-running boards, but they are mostly gathering dust while I stick to old-fashioned Cortex M3s and AVRs. So I thought it would be interesting to do some experiments and get some real data about these things.

Measuring interrupt response times on Arduino.

To test how fast a program can respond to an event, I chose a very simple task: Raise an output digital line whenever a rising edge happens on an input digital line. This allowed me to very simply measure response times in an automated fashion using an USB-connected oscilloscope and a signal generator.

I tested two devices: An Arduino Uno using a 16 MHz ATmega328 microcontroller and an Raspberry Pi Zero using a 1 GHz ARM-based CPU running Raspbian Jessie. I tried several approaches to implementing the task. On Arduino, I implemented it with an interrupt and a polling loop. On Raspberry Pi, I tried a kernel module, a native binary written in C and a Python program. You can see exact source code on GitHub.

Measuring interrupt response times on Raspberry Pi.

For all of these, I chose the most obvious approach possible. My implementations were based as much as possible on the preferred libraries mentioned in the documentation or whatever came up on top of my web searches. This meant that for Arduino, I was using the Arduino IDE and the library that comes with it. For Raspberry Pi, I used the RPi.GPIO Python library, the GPIO sysfs interface for native code in user space and the GPIO consumer interface for the kernel module (based on examples from Stefan Wendler). Definitely many of these could be further hand-optimized, but I was mostly interested here in out-of-the-box performance you could get in the first try.

Here is a histogram of 500 measurements for the five implementations:

Histogram of response time measurements.

As expected, Arduino and the Raspberry Pi kernel module were both significantly faster and more consistent than the two Raspberry Pi user space implementations. Somewhat shocking though, the interpreted Python program was considerably faster than my C program compiled into native code.

If you check the source, RPi.GPIO library maps the hardware registers directly into its process memory. This means that it does not need any syscalls for controlling the GPIO lines. On the other hand, my C implementation uses the kernel's sysfs interface. This is arguably a cleaner and safer way to do it, but it requires calls into the kernel to change GPIO states and these require expensive context switches. This difference is likely the reason why Python was faster.

Histogram of response time measurements (zoomed)

Here is the zoomed-in left part of the histogram. Raspberry Pi kernel module can be just as fast as the Arduino, but is less consistent. Not surprising, since the kernel has many other interrupts to service and not that impressive considering 60 times faster CPU clock.

Arduino itself is not that consistent out-of-the-box. While most interrupts are served in around 9 microseconds (so around 140 CPU cycles), occasionally they take as long as 15 microseconds. Probably Arduino library is to blame here since it uses the timer interrupt for delay functions. This interrupt seems to be always enabled, even when a delay function is not running, and hence competes with the GPIO interrupt I am using.

Also, this again shows that polling on Arduino can sometimes be faster than interrupts.

Effect of CPU load on response time.

Another interesting result was the effect of CPU load on Raspberry Pi response times. Somewhat counter intuitively, response times are smaller on average when there is some other process consuming CPU cycles. This happens even with the kernel module, which makes me think it has something to do with power saving features. Perhaps this is due to CPU frequency scaling or maybe the kernel puts an idle CPU into some sleep mode from which it takes longer to wake up.

In conclusion, I was a bit impressed how well Python scores on this test. While it's an order of magnitude slower than Arduino, 200 microseconds on average is not bad. Of course, there's no hard upper limit on that. In my test, some responses took two times as much and things really start falling apart if you increase the interrupt load (like for instance, with a process that does something with the SD card or network adapter). Some of the results on Raspberry Pi were quite surprising and they show once again that intuition can be pretty wrong when it comes to software performance.

I will likely be looking into more details regarding some of these results. If you would like to reproduce my measurements, I've put source code, raw data and a notebook with analysis on GitHub.

Posted by Tomaž | Categories: Digital | Comments »

Clockwork, part 2

10.04.2016 19:39

I hate to leave a good puzzle unsolved. Last week I was writing about a cheap quartz mechanism I got from an old clock that stopped working. I said that I could not figure out why its rotor only turns in one direction given a seemingly symmetrical construction of the coil that drives it.

There is quite a number of tear downs and descriptions of how such mechanisms work on the web. However, very few seem to address this issue of direction of rotation and those that do don't give a very convincing argument. Some mention that the direction has something to do with the asymmetric shape of the coil's core. This forum post mentions that the direction can be reversed if a different pulse width is used.

So, first of all I had a closer look at the core. It's made of three identical iron sheets, each 0.4 mm thick. Here is one of them on the scanner with the coil and the rotor locations drawn over it:

Coil location and direction of rotation.

It turns out there is in fact a slight asymmetry. The edges of the cut-out for the rotor are 0.4 mm closer together on one diagonal than on the other. It's hard to make that out with unaided eye. It's possible that the curved edge on the other side makes it less error prone to construct the core with all three sheets in same orientation.

Dimension drawing of the magnetic core.

The forum post about pulse lengths and my initial thought about shaded pole motors made me think that there is some subtle transient effect in play that would make the rotor prefer one direction over the other. Using just a single coil, core asymmetry cannot result in a rotating magnetic field if you assume linear conditions (e.g. no part of the core gets saturated) and no delay due to eddy currents. Shaded pole motors overcome this by delaying magnetization of one part of the core through a shorted auxiliary winding, but no such arrangement is present here.

I did some measurements and back-of-the-envelope calculations. The coil has approximately 5000 turns and resistance of 215 Ω. The field strength is nowhere near saturation for iron. The current through the coil settles somewhere on the range of milliseconds (I measured a time constant of 250 μs without the core in place). It seems unlikely any transients in magnetization can affect the movements of the rotor.

After a bit more research, I found out that this type of a motor is called a Lavet type stepping motor. In fact, its operation can be explained completely using static fields and transients don't play any significant role. The rotor has four stable points: two when the coil drives the rotor in one or the other direction and two when the rotor's own permanent magnetization attracts it to the ferromagnetic core. The core asymmetry creates a slight offset between the former and the latter two points. Wikipedia describes the principle quite nicely.

To test this principle, I connected the coil to an Arduino and slowly stepped this clockwork motor through it's four states. The LED on the Arduino board above shows when the coil is energized. The black dot on the rotor roughly marks the position of one of its poles. You can see that when the coil turns off, the rotor turns slightly forward as its permanent magnet aligns it with the diagonal on the core that has a smaller air gap (one step is a bit more pronounced than the other on the video above). This slight forward advancement from the neutral position then makes the rotor prefer the forward over the backward motion when the coil is energized in the other direction.

It's always fascinating to see how a mundane thing like a clock still manages to have parts in it whose principle of operation is very much not obvious from the first glance.

Posted by Tomaž | Categories: Life | Comments »


28.03.2016 15:17

Recently one of the clocks in my apartment stopped. It's been here since before I moved in and is probably more than 10 years old. The housing more or less crumbled away as I opened it. On the other hand the movement inside looked like it was still in a good condition, so I had a look if there was anything in it that I could fix.

Back side of the quartz clock movement.

This is a standard 56 mm quartz wall clock movement. It's pretty much the same as in any other cheap clock I've seen. In this case, its makers were quick to dispel any myths about its quality: no jewels in watchmaker's parlance means no quality bearings and I'm guessing unadjusted means that the frequency of its quartz oscillator can't be adjusted.

Circuit board in the clock movement.

As far as electronics is concerned, there's not much to see in there. There's a single integrated circuit, bonded to a tiny PCB and covered with a blob of epoxy. It uses a small tuning-fork quartz resonator to keep time. As the cover promised, there's no sign of a trimmer for adjusting the quartz load capacitance. Two exposed pads on the top press against some metallic strips that connect to the single AA battery. The life time of the battery was probably more than a year since I don't remember the last time I had to change it.

Coil from the clock movement.

The circuit is connected to a coil on the other side of the circuit board. It drives the coil with 30 ms pulses once per second with alternating polarity. The oscilloscope screenshot below shows voltage on the coil terminals.

Voltage waveform on the two coil terminals.

When the mechanism is assembled, there's a small toroidal permanent magnet sitting in the gap in the coil's core with the first plastic gear on top of it. The toroid is laterally magnetized and works as a rotor in a simple stepper motor.

Permanent magnet used as a rotor in clock movement.

The rotor turns half a turn every second and this is what gives off the audible tick-tock sound. I'm a bit puzzled as to what makes it turn only in one direction. I could see nothing that would work as a shaded pole or something like that. The core also looks perfectly symmetrical with no features that would make it prefer one direction of rotation over the other. Maybe the unusual cutouts on the gear for the second hand have something to do with it.

Update: my follow-up post explains what determines direction of rotation.

Top side of movement with gears in place.

This is what the mechanism looks like with gears in place. The whole construction is very finicky and a monument to material cost reduction. There's no way to run it without the cover in place since gears fall over and the impulses in the coil actually eject the rotor if there's nothing on top holding it in place (it's definitely not as well behaved as one in this video). In fact, I see no traces that the rotor magnet has been permanently bonded in any way with the first gear. It seems to just kind of jump around in the magnetic field and drive the mechanism by rubbing against the inside of the gear.

In the end, I couldn't find anything obviously wrong with this thing. The electronics seem to work correctly. The gears also look and turn fine. When I put it back together it would sometimes run, sometimes it would just jump one step back and forth and sometimes it would stand still. Maybe some part wore down mechanically, increasing friction. Or maybe the magnet lost some of its magnetization and no longer produces enough torque to reliably turn the mechanism. In any case, it's going into the scrap box.

Posted by Tomaž | Categories: Life | Comments »

The problem with gmail.co

14.03.2016 19:41

At this moment, the gmail.co (note missing m) top-level domain is registered by Google. This is not surprising. It's common practice these days for owners of popular internet services to buy up domains that are similar to their own. It might be to fight phising attacks (e.g. go-to-this-totally-legit-gmail.co-login-form type affairs), prevent typosquatting or purely for convenience to redirect users that mistyped the URL to the correct address.

$ whois gmail.co
Registrant Organization:                     Google Inc.
Registrant City:                             Mountain View
Registrant State/Province:                   CA
Registrant Country:                          United States

gmail.co currently serves a plain 404 Not Found page on the HTTP port. Not really user friendly, but I guess it's good enough to prevent web-based phising attacks.

Now, with half of the world using ...@gmail.com email addresses, it's not uncommon to also mistakenly send an email to a ...@gmail.co address. Normally, if you mistype the domain part of the email address, your MTA will see the DNS resolve fail and you would immediately get either a SMTP error at the time of submission, or a bounced mail shortly after.

Unfortunately, gmail.co domain actually exists, which means that MTAs will in fact attempt to deliver mail to it. There's no MX DNS record, however SMTP specifies that MTAs must in that case use the address in A or AAAA records for delivery. Those do exist (as they allow the previously mentioned HTTP error page to be served to a browser).

To further complicate the situation, the SMTP port 25 on IPs referenced by those A and AAAA records is blackholed. This means that a MTA will attempt to connect to it, hang while the remote host eats up SYN packets, and fail after the TCP handshake timeouts. A timeout looks to the MTA like an unresponsive mail server, which means it will continue to retry the delivery for a considerable amount of time. The RFC 5321 says that it should take at least 4-5 days before it gives up and sends a bounce:

Retries continue until the message is transmitted or the sender gives up; the give-up time generally needs to be at least 4-5 days. It MAY be appropriate to set a shorter maximum number of retries for non- delivery notifications and equivalent error messages than for standard messages. The parameters to the retry algorithm MUST be configurable.

In a nutshell, what all of this means is that if you make a typo and send a mail to @gmail.co, it will take around a week for you to receive any indication that your mail was not delivered. Needless to say, this is bad. Especially if the message you were sending was time critical in nature.

Update: Exim will warn you when a message has been delayed for more than 24 hours, so you'll likely notice this error before the default 6 day retry timeout. Still, it's annoying and not all MTAs are that friendly.

The lesson here is that, if you register your own typosquatting domains, do make sure that mail sent to them will be immediately bounced. One way is to simply set an invalid MX record (this is an immediate error for SMTP). You can also run a SMTP server that actively rejects all incoming mail (possibly with a friendly error message reminding the user of the mistyped address), but that requires some more effort.

As for this particular Google's blunder, a workaround is to put a special retry rule for gmail.co in your MTA so that it gives up faster (e.g. see Exim's Retry configuration).

Posted by Tomaž | Categories: Life | Comments »

Some SPF statistics

21.02.2016 19:04

Some people tend their backyard gardens. I host my own mail server. Recently, there has been a push towards more stringent mail server authentication to fight spam and abuse. One of the simple ways of controlling which server is allowed to send mail for a domain is the Sender Policy Framework. Zakir Durumeric explained it nicely in his Neither Snow Nor Rain Nor MITM talk at the 32C3.

The effort for more authentication seems to be headed by Google. That is not surprising. Google Mail is e-mail for most people nowadays. If anyone can push for changes in the infrastructure, it's them. A while ago Google published some statistics regarding the adoption of different standards for their inbound mail. Just recently, they also added visible warnings for their users if mail they received has not been sent from an authenticated server. Just how much an average user can do about that (except perhaps pressure their correspondents to start using Google Mail) seems questionable though.

Anyway, I implemented a SPF check for inbound mail on my server some time ago. I never explicitly rejected mail based on it however. My MTA just adds a header to incoming messages. I was guessing the added header may be picked out by the Bayesian spam filter, if it became significant at any point. After reading about Google's efforts I was wondering what the situation regarding SPF checks looks like for me. Obviously, I see a very different sample of the world's e-mail traffic as Google's servers.

For this experiment I took a 3 month sample of inbound e-mail that was received by my server between November 2015 and January 2016. The mail was classified by Bogofilter into spam and non-spam mail, mostly based on textual content. SPF records were evaluated by spf-tools-perl upon reception. Explanation of results (what softfail, permerror, etc. means) is here.

SPF evaluation results for non-spam messages.

SPF evaluation results for spam messages.

As you can see, the situation in this little corner of the Internet is much less optimistic than the 95.3% SPF adoption rate that Google sees. More than half of mail I see doesn't have a SPF record. A successful SPF record validation also doesn't look like that much of a strong signal for spam filtering either, with 22% of spam mail successfully passing the check.

It's nice that I saw no hard SPF failures for non-spam mail. I checked my inbox for mail that had softfails and permerrors. Some of it was borderline-spammy and some of it was legitimate and appeared to be due to the sender having a misconfigured SPF record.

Another interesting point I noticed is that some sneaky spam mail comes with their own headers claiming SPF evaluation. This might be a problem if the MTA just adds another Received-SPF header at the bottom and doesn't remove the existing one. If you then have a simple filter on Received-SPF: pass somewhere later in the pipeline it's likely the filter will hit the spammer's header first instead of the header your MTA added.

Posted by Tomaž | Categories: Life | Comments »

IEEE 802.11 channel survey statistics

14.02.2016 20:17

I attended the All our shared spectrum are belong to us talk by Paul Fuxjaeger this December at the 32C3. He talked about a kernel patch that adds radio channel utilization data to the beacons sent out by a wireless LAN access point. The patched kernel can be used with OpenWRT to inform other devices in the network of the state of the radio channel, as seen from the standpoint of the access point. One of the uses for this is reducing the hidden node problem and as I understand they used it to improve robustness of a Wi-Fi mesh network.

His patch got my attention because I did not know that typical wireless LAN radios kept such low-level radio channel statistics. I previously did some surveys of the occupancy of the 2.4 GHz band for a seminar at the Institute. My measurements using a TI CC2500 transceiver on a VESNA sensor node showed that the physical radio channel very rarely reached any kind of significant occupancy, even in the middle of very busy Wi-Fi networks. While I did not pursue it further, I remained suspicious of that result. Paul's talk gave me an idea that it would be interesting to compare such measurements with statistics available from the Wi-Fi adapter itself.

I did some digging and found the definition of struct survey_info in include/net/cfg80211.h in Linux kernel:

 * struct survey_info - channel survey response
 * @channel: the channel this survey record reports, mandatory
 * @filled: bitflag of flags from &enum survey_info_flags
 * @noise: channel noise in dBm. This and all following fields are
 *	optional
 * @channel_time: amount of time in ms the radio spent on the channel
 * @channel_time_busy: amount of time the primary channel was sensed busy
 * @channel_time_ext_busy: amount of time the extension channel was sensed busy
 * @channel_time_rx: amount of time the radio spent receiving data
 * @channel_time_tx: amount of time the radio spent transmitting data
 * Used by dump_survey() to report back per-channel survey information.
 * This structure can later be expanded with things like
 * channel duty cycle etc.
struct survey_info {
	struct ieee80211_channel *channel;
	u64 channel_time;
	u64 channel_time_busy;
	u64 channel_time_ext_busy;
	u64 channel_time_rx;
	u64 channel_time_tx;
	u32 filled;
	s8 noise;

You can get these values for a wireless adapter on a running Linux system using ethtool or iw without patching the kernel:

$ ethtool -S wlan0
NIC statistics:
     noise: 161
     ch_time: 19404258681
     ch_time_busy: 3082386526
     ch_time_ext_busy: 18446744073709551615
     ch_time_rx: 2510607684
     ch_time_tx: 371239068
$ iw dev wlan0 survey dump
Survey data from wlan0
	frequency:			2422 MHz [in use]
	noise:				-95 dBm
	channel active time:		19404338429 ms
	channel busy time:		3082398273 ms
	channel receive time:		2510616517 ms
	channel transmit time:		371240518 ms

It's worth noting that ethtool will print out values even if the hardware does not set them. For example the ch_time_ext_busy above seems invalid (it's 0xffffffffffffffff in hex). ethtool also incorrectly interprets the noise value (161 is -95 interpreted as an unsigned 8-bit value).

What fields contain valid values depends on the wireless adapter. In fact, grepping for channel_time_busy through the kernel tree lists only a handful of drivers that touch it. For example, the iwlwifi-supported Intel Centrino Advanced-N 6205 in my laptop does not fill any of these fields, while the Atheros AR9550 adapter using the ath9k driver in my wireless router does.

Short of the comment in the code listed above, I haven't yet found any more detailed documentation regarding the meaning of these fields. I checked the IEEE 802.11 standard, but as far as I can see, it only mentions that radios can keep the statistic on the percentage of the time the channel is busy, but doesn't go into details.

Weekly Wi-Fi channel statistic

Weekly Wi-Fi channel noise statistic

I wrote a small Munin plugin that keeps track of these values. For time counters the graph above shows the derivative (effectively the percentage of time the radio spent in a particular state). With some experiments I was able to find out the following:

  • noise: I'm guessing this is the threshold value used by the energy detection carrier sense mechanism in the CSMA/CA protocol. It's around -95 dBm for my router, which sounds like a valid value. It's not static but changes up and down by a decibel or two. It would be interesting to see how the radio dynamically determines the noise level.
  • channel_time: If you don't change channels, this goes up by 1000 ms each 1000 ms of wall-clock time.
  • channel_time_busy: Probably the amount of time the energy detector showed input power over the noise threshold, plus the time the radio was in transmit mode. This seems to be about the same as the sum of RX and TX times, unless there is a lot of external interference (like another busy Wi-Fi network on the same channel).
  • channel_time_ext_busy: This might be a counter for the secondary radio channel used in channel binding, like in 802.11ac. I haven't tested this since channel binding isn't working on my router.
  • channel_time_rx: On my router this increments at around 10% of channel_time rate, even if the network is completely idle, so I'm guessing it triggers at some very low-level, before packet CRC checks and things like that. As expected, it goes towards 100% of channel_time rate if you send a lot of data towards the interface.
  • channel_time_tx: At idle it's around 2% on my router, which seems consistent with beacon broadcasts. It goes towards 100% of channel_time rate if you send a lot of data from the interface.

In general, the following relation seems to hold:

\frac{t_{channel-tx} + t_{channel-rx}}{t_{channel}} < \frac{t_{channel-busy}}{t_{channel}} < 100\%
Posted by Tomaž | Categories: Life | Comments »

Rapitest Socket Tester

29.01.2016 17:47

John Ward has a series of videos on YouTube where he discusses the Rapitest Socket Tester. This is a device that can be used to quickly check whether a UK-style 230 V AC socket has been wired correctly. John explains how a device like that can be dangerously misleading, if you trust its verdict too much. Even if Rapitest shows that the socket passed the test, the terminals in the socket can still be dangerously miswired.

Rapitest Socket Tester (Part 1)

(Click to watch Rapitest Socket Tester (Part 1) video)

I have never seen a device like this in person. Definitely they are not common in this part of the world. Possibly because the German "Schuko" sockets we use don't define the positions of the live and neutral connections and hence there are fewer mistakes to make in wiring them. The most common testing apparatus for household wiring jobs here is the simple mains tester screwdriver (about which John has his own strong opinion and I don't completely agree with him there).

From the first description of the Rapitest device, I was under the impression that it must contain some non-linear components. Specifically after hearing that it can detect when the line and neutral connections in the socket have been reversed. I was therefore a bit surprised when I saw that the PCB inside the device contains just a few resistors. I was curious how it manages to do its thing with such a simple circuit, so I went slowly through the part of the video that shows the disassembly and sketched out the schematic:

Schematic of the Rapitest Socket Tester

S1 through S3 are the neon indicator lamps that are visible on the front of the device, left to right. L, N and E are line, neutral and earth pins that fit into the corresponding connections in the socket. It was a bit hard to read out the resistor values from the colors on the video, so there might be some mistakes there, but I believe the general idea of the circuit is correct.

It's easy to see from this circuit how the device detects some of the fault conditions that are listed on the front. For instance, if earth is disconnected, then S3 will not light up. In that case, voltage on S3 is provided by the voltage divider R7 : R8+R1+R2 which does not provide a high enough voltage to strike an arc in the lamp (compared to R7 : R8, if earth is correctly connected).

Similarly, if line and neutral are reversed, only the R3 : R5 divider will provide enough voltage and hence only S1 will light up. S3 has no voltage since it is connected across neutral and earth in that case. For S2, the line voltage is first halved across R2 and R1 and then reduced further due to R4 and R6.

Rapitest 13 Amp Socket Tester

Image by John Ward

However, it's hard to intuitively see what would happen in all 64 possible scenarios (each of the 3 terminals can in theory be connected to either line, neutral, earth or left disconnected, hence giving 43 combinations). To see what kind of output you would theoretically get in every possible situation, I threw together a simple Spice simulation of the circuit drawn above. A neon lamp is not trivial to simulate in Spice, so I simplified things a bit. I modeled lamps as open-circuits and only checked whether the voltage on them would reach the breakdown voltage of around 100 V. If the voltage across a lamp was higher, I assumed it would light up.

The table below shows the result of this simulation. First three columns show the connection of the tree socket terminals (NC means the terminal is not connected anywhere). I did not test situations where a terminal would be connected over some non-zero impedance. An X in one of the last three columns means that the corresponding lamp would turn on in that case.

  L N E S1 S2 S3
1 L L L      
2 L L N     X
3 L L E     X
4 L L NC      
5 L N L X    
6 L N N X X X
7 L N E X X X
8 L N NC X X  
9 L E L X    
10 L E N X X X
11 L E E X X X
12 L E NC X X  
13 L NC L      
14 L NC N   X X
15 L NC E   X X
16 L NC NC      
17 N L L X X X
18 N L N X    
19 N L E X    
20 N L NC X X  
21 N N L     X
22 N N N      
23 N N E      
24 N N NC      
25 N E L     X
26 N E N      
27 N E E      
28 N E NC      
29 N NC L   X X
30 N NC N      
31 N NC E      
32 N NC NC      
33 E L L X X X
34 E L N X    
35 E L E X    
36 E L NC X X  
37 E N L     X
38 E N N      
39 E N E      
40 E N NC      
41 E E L     X
42 E E N      
43 E E E      
44 E E NC      
45 E NC L   X X
46 E NC N      
47 E NC E      
48 E NC NC      
49 NC L L      
50 NC L N      
51 NC L E      
52 NC L NC      
53 NC N L      
54 NC N N      
55 NC N E      
56 NC N NC      
57 NC E L      
58 NC E N      
59 NC E E      
60 NC E NC      
61 NC NC L      
62 NC NC N      
63 NC NC E      
64 NC NC NC      

I marked with blue the six combinations (7, 8, 15, 19, 37, 55) that are shown on the front of the device. They show that in those cases my simulation produced the correct result.

Five rows marked with red show situations where the device shows "Correct" signal, but the wiring is not correct. You can immediately see two classes of problems that the device fails to detect:

  • It cannot distinguish between earth and neutral (combinations 6, 10 and 11). This is obvious since both of these are on the same potential (in my simulation and to some approximation in reality as well). However, if a residual-current device is installed, any fault where earth and neutral have been swapped should trip it as soon as any significant load is connected to the socket.
  • It also fails to detect when potentials have been reversed completely (e.g. line is on both neutral and earth terminals and either neutral or earth is on the line terminal - combinations 17 and 33). This is the deadly as wrong as you can get situation shown by John in the second part of his video.

Under the assumption that you only have access to AC voltages on the three terminals in the socket, both of these fault situations are in fact impossible to distinguish from the correct one with any circuit or device.

It's worth noting also that the Rapitest can give a dangerously misleading information in other cases as well. For instance, the all-lights-off "Line not connected" result might give someone the wrong impression that there is no voltage in the circuit. There are plenty of situations where line voltage is present on at least one of the terminals, but all lamps on the device are off.

Posted by Tomaž | Categories: Analog | Comments »

Display resolutions in QEMU in Windows 10 guest

21.01.2016 18:08

A while back I posted a recipe on how to add support for non-standard display resolutions to QEMU's stdvga virtual graphics card. For instance, I use that to add a wide-screen 1600x900 mode for my laptop. That recipe still works on Windows 7 guest and the latest QEMU release 2.5.0.

On Windows 10, however, the only resolution you can set with that setup is 1024x768. Getting it to work requires another approach. I should warn though that performance seems quite bad. Specifically, opening a web page that has any kind of dynamic elements on it can slow down the guest so much that just closing the offending window can take a couple of minutes.

First of all, the problem with Windows 10 being stuck at 1024x768 can be solved by switching the VGA BIOS implementation from the Bochs BIOS, which is shipped with QEMU upstream by default, to SeaBIOS. Debian packages switched to SeaBIOS in 1.7.0+dfsg-2, so if you are using QEMU packages from Jessie, you should already be able to set resolutions other than 1024x768 in Windows' Advanced display settings dialog.

If you're compiling QEMU directly from upstream, switching the VGA BIOS is as simple as replacing the vgabios-stdvga.bin file. Install the seabios package and run the following:

$ rm $PREFIX/share/qemu/vgabios-stdvga.bin
$ ln -s /usr/share/seabios/vgabios-stdvga.bin $PREFIX/share/qemu

where $PREFIX is the prefix you used when installing QEMU. If you're recompiling QEMU often, the following patch for QEMU's top-level Makefile does this automatically on make install:

--- qemu-2.5.0.orig/Makefile
+++ qemu-2.5.0/Makefile
@@ -457,6 +457,8 @@ ifneq ($(BLOBS),)
 	set -e; for x in $(BLOBS); do \
 		$(INSTALL_DATA) $(SRC_PATH)/pc-bios/$$x "$(DESTDIR)$(qemu_datadir)"; \
+	rm -f "$(DESTDIR)$(qemu_datadir)/vgabios-stdvga.bin"
+	ln -s /usr/share/seabios/vgabios-stdvga.bin "$(DESTDIR)$(qemu_datadir)/vgabios-stdvga.bin"
 ifeq ($(CONFIG_GTK),y)
 	$(MAKE) -C po $@

Now you should be able to set resolutions other than 1024x768, but SeaBIOS still doesn't support non-standard resolutions like 1600x900 by default. For that, you need to amend the list of video modes and recompile SeaBIOS. Get the source (apt-get source seabios) and apply the following patch:

--- seabios-1.7.5.orig/vgasrc/bochsvga.c
+++ seabios-1.7.5/vgasrc/bochsvga.c
@@ -99,6 +99,9 @@ static struct bochsvga_mode
     { 0x190, { MM_DIRECT, 1920, 1080, 16, 8, 16, SEG_GRAPH } },
     { 0x191, { MM_DIRECT, 1920, 1080, 24, 8, 16, SEG_GRAPH } },
     { 0x192, { MM_DIRECT, 1920, 1080, 32, 8, 16, SEG_GRAPH } },
+    { 0x193, { MM_DIRECT, 1600, 900,  16, 8, 16, SEG_GRAPH } },
+    { 0x194, { MM_DIRECT, 1600, 900,  24, 8, 16, SEG_GRAPH } },
+    { 0x195, { MM_DIRECT, 1600, 900,  32, 8, 16, SEG_GRAPH } },
 static int dispi_found VAR16 = 0;

You probably want to also increment the package version to keep it from being overwritten next time you do apt-get upgrade. Finally, recompile the package with dpkg-buildpackage and install it.

Now when you boot the quest you should see your new mode appear in the list of resolutions. There is no need to recompile or reinstall QEMU again.

Posted by Tomaž | Categories: Code | Comments »