Further adventures in Chromebook-land

28.01.2013 21:33

As you might remember from my previous blog post, I have a disassembled ARM-based Samsung Chromebook lying around, occupying various horizontal surfaces that might otherwise be put into better use. After an initial success with replacing the original, feature-challenged OS with armhf port of Debian Wheezy I hit on a couple of snags. First, I found out that running the computer with a non-Google signed OS means having to look at an annoying warning message at each boot and having to press Ctrl-D (and being careful not to turn off developer mode by touching any other keys by mistake) or wait for a minute or so. And second, by carelessly playing around with alsamixer, I managed to get the left built-in speaker to melt through the bottom casing of the laptop.

Naturally, the smart decision would at that point be to return the laptop to the shop and demand my money back. Of course, I chose the other way.

Chromebook left speaker close-up

As far as the speaker is concerned, it's quite beyond repair. While the body (which I guess is kind of a resonance chamber?) and the magnet are quite unharmed, the membrane and the coil (made with a piece of flexible PCB as far as I could see) ended up in a puddle of molten plastic. I suppose the other speaker is still working correctly, but I didn't test it as I have it disconnected from the motherboard for now.

Chromebook motherboard, top side

Once you remove the human interface, the business part of the laptop is a surprisingly small motherboard, containing little more than the Exynos system-on-chip surrounded by a bunch of memory chips and power supply circuits.

Chromebook motherboard, bottom side

The problem with the annoying bootloader turned out to be harder to solve than I thought. As I understand the boot process, the CPU first runs pre-boot code (apparently some proprietary initialization code from Samsung). This then loads secondary program loader which in turn loads an U-Boot. This one then annoys you and goes on to load anything you want from the SSD, provided you are in developer mode, of course. I'm not sure about the first two parts, but U-boot is stored on an Winbond 25Q32DW series serial flash chip with an SPI interface.

Serial flash IC on Chromebook motherboard

This chip has a active-low write-protect pin. The pin is pulled low by default somehow, which prevents the main Exynos CPU from writing to flash. It doesn't seem to be tied to ground though, so I'm guess it might be controlled from the embedded system controller (or a GPIO pin from Exynos, but that would be kind of stupid). If you browse Google's documentation there are some mentions of a mysterious servo2 debug board that apparently allows you to overwrite the flash and even boot the computer if the flash is corrupted. I haven't been able to find any kind of details about it, not even where are you supposed to connect it. There are no special debug connectors on the laptop's motherboard as far as I can see, so it's either plugged into one of the externally accessible connectors (USB, HDMI, SD card, audio) and does some magic through there (possibly with the help of ESC), or servo2 refers to a special version of the motherboard that has some additional debug capabilities.

In any case, without debug board's magic, replacing the bootloader doesn't look simple. I can rewire the write-protect pin, but that will give me exactly one try at programming the flash. If I botch it, Exynos will crash on boot and I won't get another chance. I'm not sure I'm capable of desoldering the flash chip without destroying it, reprogram it externally and solder it back without messing up any of tiny SMD components around it. Although there seems to already be an Arduino-based programmer available for these chips, so at least I would be spared the task to code that myself.

I've built the Chromium OS development environment which in turn can also be used to build the flash images. While everything seems to build without problems, I'm kind of confused as to whether the images built in this way still include the annoying warning or not. The build process itself turned out to be quite convoluted, involving surprising amounts of complex Python code (what's wrong with Makefiles?) hidden behind Gentoo's Portage scripts and has so far resisted my attempts to find out how the flash image is actually constructed.

Unfortunately, while poking inside the laptop I managed to add a third problem on top of the previous two. While re-attaching the copper cooling plate my screwdriver slipped and shattered one of the tiny bare-die flip-chip packages around the STM32F10086 controller.

Shattered flip-chip component

These seem to be bare silicon soldered directly to the PCB without any kind of packaging and are surprisingly brittle. From what's left of it and by looking at similar components on the board, I'm guessing the laser-etched back-side marking originally said 2822HN. I'm not sure what its function was and I can't find any references on the web for these components (the other type of a similar component used on the motherboard is 28DCV7). Perhaps a discreet logic gate? In any case, it's quite beyond my capabilities to replace, even if I would manage to get a replacement part.

Surprisingly, with this component in the broken state as it is, the laptop still boots. So far I haven't yet tested if any peripheral isn't working. One effect seems to be that the power button is now kind of unreliable, taking several presses before the computer turns on. But that might also not be related - with the casing open, everything is kind of wobbly and I wouldn't be surprised if keyboard isn't properly supported in this setup.

In any case, this Chromebook seems to be a failure as far as replacing my EeePC goes. In this broken form it certainly won't become a computer I can rely on when traveling, even if I manage to replace the bootloader. Might eventually turn out useful for some other project though.

Posted by Tomaž | Categories: Life | Comments »

GPG key transition

13.01.2013 20:22

I've been using the same GnuPG key pair for signing and encrypting my mail since 2001. If you are not using an email client that is OpenPGP-aware you might have noticed that all my electronic correspondence seems to have a piece of robot barf appended at the end. I've been stubbornly insisting on at least signing all of my out-going mail, even for recipients that I know don't use public-key cryptography, in a futile attempt to raise awareness about these things.

This secret 1024 bit DSA/ElGamal pair will now soon be 12 years old. It has been moved between many machines and, while I'm quite careful about these things, it's at least probable that in all these years it had leaked somewhere. It's also hopelessly outdated by any modern standard and quite within the reach of modern code-breakers. Listening to the RSA factorization in the real world talk at 29C3 finally reminded me to take the plunge and replace it with a modern 4096 bit RSA key. I've also moved to SHA256 digests, as recommended by Debian. And finally, to prevent the new key from getting this far beyond its best-before date, I've also set the expiry date to 5 years.

So, my new key is:

pub   4096R/0A822E7A 2013-01-13 [expires: 2018-01-12]
      Key fingerprint = 4EC1 9BBE DE7A 4AA1 E6EB  A82F 059A 0D2C 0A82 2E7A

I will be immediately switching all signatures to it. I will not revoke my old key for the next 90 days, but if you encrypt your mail, please use my new key instead. Also, if you got one of my Moo cards recently, please note that the GPG fingerprint on the back side refers to my old key.

You can import my new public key into your key chain by using the following command:

$ gpg --keyserver subkeys.pgp.net --recv-key 0A822E7A

I would appreciate if you would sign my new key to integrate it into the web of trust. If you meet me in person in the future, I will probably give you the key fingerprint, so you can be sure it's the correct one. Otherwise if you trust my old key, you can check my official key transition statement, which is signed by both my old and my new key.

Posted by Tomaž | Categories: Life | Comments »

Pinkie sense debugging

12.01.2013 18:29

Here's another story about a debugging session that took an embarrassing amount of time and effort and confirmed once again that a well thought-out design for debugability will pay for itself many times over in lost stomach acid and general developer well being.

As you might remember, the UHF receiver I designed a while back has been deployed on VESNA sensor nodes in a cognitive radio testbed. When the first demos of the deployment were presented however it became apparent that nodes equipped with it had a problem: experiments would often fail mysteriously and had to be repeated a few times before valid measurements could be retrieved.

In this particular case firmware running on VESNA's ARM CPU implements a very simple, home-brew scheduler. An experimenter sends a list of tasks over the management network to the sensor node and the scheduler attempts to execute them roughly at the specified times. After a while, the experimenter asks the node if the given tasks have been completed, and if they have, requests a download of the recorded data. Often however nodes would simply continue replying that the tasks have not been completed well after the last command should have been concluded.

This kind of problem was specific to nodes equipped with the UHF receiver (not that other nodes don't have problems of their own) and seemingly limited to sensor nodes deployed high on light-poles as the one test article on my desk refused to exhibit this bug.

A hint of what might be going on came when I started monitoring the uptime of sensor nodes with Munin. When the bug manifested itself the number of seconds since the CPU reboot fell to zero, making apparent that the node was resetting itself during experiments. Upon reset the scheduler would forget about the scheduled tasks stored in volatile memory, leaving the non-volatile state that was queried by experimenters in a perpetual running state. This oversight on the part of scheduler design and the fact that the node can't signal errors back to the infrastructure over the management network protocol has already made debugging this issue harder than necessary.

Next step was to determine what was causing these resets. Fortunately the STM32F103 CPU used on VESNA provides a helpful set of flags in the RCC_CSR register that allow you to distinguish between six different CPU reset reasons. Unfortunately, the bootloader on deployed nodes clobbers the values in the register, leaving no way to determine its value after reboot.

VESNA with SNE-ISMTV-UHF in a weather-proof box.

Back to square one, I reasoned that resets might be hardware related. Since the UHF tuner is quite power hungry I guessed that it might have something to do with power supply on deployed nodes. I also suspected hangs in the STM32's I2C interface, which is supposedly notoriously buggy when presented with marginal signals on the bus. Add the fact that weather-proof plastic boxes used to house sensor nodes turned out not to be as resistant to rain as we originally hoped and hardware related problems did not seem that far-fetched.

However poking around the circuit did not reveal anything obviously wrong and with no way to reproduce the problem in a lab I came to another dead end.

Next break-through came when I managed to reproduce the problem on a node that has been unmounted and brought back to the lab. It turned out that on this node a specific command would result in a node reset in around 2 cases out of 100. This might not sound much, but a real-life experiment would typically consist of many such tasks, adding up to a much higher probability of failure. Still, it took around two hours to reliably reproduce a reset in this way and this resulted in careful monitoring of failure probabilities versus the physical node and firmware versions. This monitoring later proved that all nodes exhibited this bug, even the test article I initially marked as problem-free.

Having a reproducible test case on my desk however did little to help the issue. With no JTAG or serial console available on the production configuration, it was impossible to use a on-chip debugger. It did make it possible to upload a fixed bootloader though and the cause of the resets was revealed to be the hardware watchdog.

VESNA uses the STM32 independent watchdog, which is a piece of hardware in the microcontroller that resets the CPU state unless a specific register write occurs every minute or so. The fact that watchdog was resetting the CPU pointed to a software problem. Unfortunately the wise developers in STM did not provide any way to actually determine what the CPU has been doing before the watchdog killed it. There is no way to get last program counter value or even a pointer to the last stack frame and hence no sane way of debugging watchdog-related issues (I did briefly toy with the idea of patching the CPU reset vector and dumping memory contents on reset, but soon decided that does not classify as a sane method).

This led to another fruitless hunt around the source code for functions that might be hanging the CPU for too long.

Then I noticed that one of the newer firmware versions had a much lower chance of failure - 1 failure in 2500 tries. Someone, probably unknowingly since I didn't see any commit messages that would tell me otherwise, already fixed, or nearly fixed, the bug. Since such accidental fixes tend also to be accidentally removed it still made sense to figure out what exactly was causing this bug to make sure the fix stayed put. I fired up git bisect and after a few days of testing I came up with the following situation:

git-bisect result

Note that git bisect is being run in reverse here. Since I was searching for a commit that fixed a bug, not introduced it, bisect/bad marks a revision with the bug fixed while bisect/good marks a revision with a bug.

But if you look closely, you can see that the first revision that fixed it was a merge commit of two branches, both of which exhibited the bug. To make things even more curious, this was a straightforward merge with no conflicts. It made no sense that this merge would introduce any timing changes large enough to trip the watchdog timer. However the fact that the changes had to do with the integrated A/D converter did curiously point to a certain direction.

After carefully testing for and excluding bootloader and programming issues (after all, I did find a bug once where firmware would not be uploaded to flash properly if it was exactly a multiple of 512 bytes long), I came upon this little piece of code:

ADC_Cmd(ADC1, ENABLE);
while (!(ADC_GetFlagStatus(ADC1, ADC_FLAG_EOC)));

Using the STM's scarily-licensed firmware library, this triggers a one-shot A/D conversion and waits for the end of conversion flag to be set by hardware. When I carefully added a timeout to this loop the bug disappeared in all firmware versions that previously exhibited the bug. I say carefully because at this point I was not trusting compiler either, which means I added some timeout code to the loop first that had a larger-than-realistic timeout, made sure the bug was still there, and then only changed the timeout value without touching the code itself.

Now it would be most convenient to blame everything to a buggy microcontroller peripheral. Thus far most clues seem to point to the fact that some minor timing issues during ADC calibration and turn-on may cause the ADC to sporadically hang or take far longer than expected to finish a conversion (turning off the watchdog did not usually result in a hang).

But even if that is the case (and I'm still not completely convinced although this case is now closed as far as I'm concerned), this journey mostly showed what happens when debugging a 40.000-line embedded code base with none to little internal debug tools at your disposal. It's a tremendous time sink and takes careful planning and note taking - I'm left with pages of calculations I made to make sure tests were running long enough to ensure a good probability of correct git-bisect result knowing prior probabilities of encountering a bug.

So, if you made it this far and are writing code, please make sure proper error reporting facilities are your top priority. Make sure you fail as early as possible and give as much feedback as possible to people that might be trying to debug things. Try to be reasonably robust to failures in code and hardware outside of your control. And above all, make absolutely sure you don't interfere with any kind of debugging facilities your platform provides. As limited as they might appear to be, they are still better than nothing.

Posted by Tomaž | Categories: Life | Comments »

Galaksija updates

05.01.2013 12:07

One of the results of me squatting a table in the retro-gaming assembly of 29C3 is the following short list of news regarding Galaksija:

Applebloom simulation on Galaksija at 29C3 retro-gaming assembly.

I've uploaded a new version of Galaksija development tools today. Version 0.2.2 includes the assembly source for the Not my department scroller demo that I wrote on a whim on day 1 of the Congress and that could be sporadically seen running on the TV screen in the retro-gaming area. The demo uses the built-in video driver routine in ROM, but tweaks the timings of the video signal to achieve smoother vertical motion of the text than what the usual low-resolution graphics would allow. This is a trick similar to what ROM terminal emulation routines use to smoothly scroll screen contents upwards. Here's a pretty horrible video of the running demo.

Before the congress I also updated the CMOS Galaksija page. It now finally includes complete design documentation of my Galaksija-compatible motherboard and keyboard, including schematics and PCB artwork, as well as full text of my thesis that covers a lot of details of how the circuit works (in Slovene language, for information in English it's still best to just search for "galaksija" on this blog). Over the years many people asked for these documents and now I finally managed to dig out the final version from my old CVS repository and verify that it reflects correctly the circuit in my single working specimen. I would love to hear from anyone that would like to attempt building his own CMOS Galaksija using this documentation.

Finally, if you missed my Galaksija talk at 29C3, the video has been uploaded to YouTube and you can find slides from the presentation in the PDF format on the Fahrplan.

Posted by Tomaž | Categories: Life | Comments »

29C3 wrap-up

03.01.2013 10:40

On Sunday the latest iteration of the Chaos Communication Congress concluded and now, a few days later, I feel that I have paid back enough of my sleep debt to be able to write a coherent wrap-up post.

29C3 in the Congress Centrum Hamburg

As you probably heard, the Congress moved from Berlin to Hamburg in search of a bigger venue, as the Berlin Congress Center was getting increasingly crowded. At least for our small delegation from Kiberpipa this complicated travel arrangement a bit, since it turned out that Hamburg does not have any cheap airline connections from our corner of the world. So we had to opt for a day of car and train travel. Actually, I always enjoyed traveling on German ICE trains, looking out the window at 300 km/h with a cup of coffee in my hand. Although for next year we certainly need to find a way that does not include driving in a sleep-deprived caffeine haze as the last leg of the journey home.

All fears that the new venue will somehow negatively affect the Congress have not been realized. The Congress Center Hamburg certainly is huge - four floors with intermediate ones in between, one huge auditorium and two somewhat smaller lecture rooms. The place was so large in fact that I regularly got lost and made a few extra circles around a building to find an assembly or a workshop I was looking for. The size made the name Ten Forward for the lounge at the top of the building quite appropriate.

Printing all tweets with the #29c3 tag

Even though more than 6000 tickets were sold, the place was not crowded. You could always find a couch to sit in and a power socket for your laptop. Except for the more popular talks in the smaller lecture rooms where you sometimes still had to be half and hour early to get a seat. Oh, and the huge queue on the first day (getting a wrist band on day 0 was definitely a good idea). But despite that, the organization team deserves all respect for keeping such an enormous event running so smoothly. Apparently one tenth of the visitors also volunteered as angels which I just find amazing and I think it throws a very good light on the kind of crowd that gathers each year at these Congresses.

Network conditions were on the same level and it should suffice to say that Wi-Fi was truly ubiquitous and my old EeePC 901 had no problems connecting to it. This was the first congress where I didn't feel the need to jack into an Ethernet port to download something. Also worth mentioning was the rumor of dropping IPv4 connectivity next year, which will be interesting if true. Definitely a good excuse to get all of my machines reachable from IPv6 by the end of the next year.

Ang Cui and Michael Costello on hacking Cisco phones

As far as talks were concerned, I really enjoyed those that I attended. It might be that some of the speakers opted to stay in Berlin for BerlinSides or EHSM, and while I would love to attend some of the talks there, I also didn't feel like there was a lack of talks on some particular interesting topic in Hamburg this year. Here are some of the more memorable from the top of my head: Natalie Silvanovich gave a wonderful talk about reverse engineering Tamagotchis, from decapping chips to dumping ROMs and IR protocols. Ang Cui and Michael Costello showed plenty of ways some one can install spyware on your Cisco IP phone and gave a nice overview of the security of the Unix-like system that runs on them. Two talks concerning weird machines are also well worth a look: it turns out that Turing-completeness can be found just about everywhere these days, you just have to look close enough. The talk about real-world RSA factorization convinced me it's time to get a larger GPG key. Finally, it would be interesting to try low-cost chip microprobing, although a suitable microscope doesn't seem to be that cheap.

Again, this short list doesn't do the whole Fahrplan justice. There were also a lot of ad-hoc organized meet-ups and workshops and these fours days were just not enough to peek into all corners of the Congress. I myself will be slowly going through the back-log of video recordings of events I missed for the next few weeks at least. It's also worth mentioning that a lot of talks were presented by female speakers. While that might not mean much and I was told that I am not supposed to have an opinion on this topic, it does give one data point against criticisms that this community is hostile to women.

Seriously: Fragile Infrastructure beyond this point

Also missing from the list of technical talks above is the keynote and other talks connected to this year's motto, Not my department (a quote from a satire about Wernher von Braun I believe, who supposedly didn't care where his rockets fell since that was not his department). The main topics discussed were government surveillance and whistle-blowing and the general message was that people should take a broader look at what the effects of their work on the society is. It's important to occasionally take a step back and see if your work is being used for things that you would otherwise object to. All of the talks in English were about the situation in the United States though. It will be interesting to check some of the translated German talks on this topic as well.

To conclude, the Congress was as awesome as ever and I couldn't wish for a better way to spend last days of this year and I couldn't care less that I slept through the midnight on New Year's Eve because of it. The sheer amount of people I meet trying their best to make tomorrow a better place always manages to fill me with optimism that lasts for months afterwards. Presenting a talk this year made it a somewhat different experience but the feedback I got made all of the extra hours of sleep I lost well worth it, plus it made me appreciate all of the effort going into the Congress so much more. Thanks for everything and see you next year!

Posted by Tomaž | Categories: Life | Comments »