Reflashing the MS Surface 2.0

Or: how I nearly bricked 8000 € worth of equipment

Update Aug. 2013: I've picked up work on the kernel driver again, and managed to temporarily corrupt another SUR40 device. After some brief discussion on the linux-usb mailing list, Alan Stern finally spotted that I had also errorneously used USB_RECIP_ENDPOINT in my command instead of USB_RECIP_DEVICE. So there were actually two different bugs hiding in that tiny block of code. I guess the morale is to never reuse code when writing drivers...
After I've had successfully tested my userspace driver for using the Surface 2.0 under Linux, I've decided that the logical next step would be to write a Linux kernel driver for later inclusion into the mainline kernel.

So I simply took a kernel driver I had written previously (idmouse.c) as a template, swapped out the USB ids and the control methods and loaded it into the kernel, resulting in a character device (/dev/surface) from which touch data could be read.

So far, so good.

A couple of days later, I got an email from the university team telling me that the Surface's entire display had stopped working (external monitor was still ok) and if there might be any connection to my Linux experiments. At first, I dismissed that idea, since I assumed I hadn't done anything different to my previous experiments. However, after I had booted Linux to see what I could find out about the issue, I noticed this:
$ lsusb
Bus 002 Device 002: ID 04b4:8613 Cypress Semiconductor Corp. CY7C68013 EZ-USB FX2 USB 2.0 Development Kit
Moreover, the expected Surface device was missing - so apparently, something nasty had happened to the USB interface. More specifically, the USB interface chip (Cypress EZ-USB FX2) had obviously lost its firmware and was defaulting back to the factory configuration. I still wasn't sure how my code might have caused this, but this was too clearly related somehow to be dismissed as coincidence.

To see if there was anything different after all compared to my previous code, I checked my SVN log and found this:
--- surface-input.c	(Revision 700)
+++ surface-input.c	(Revision 701)
@@ -56,7 +56,7 @@
 #define SURFACE_GET_SENSORS 0xb1 /*  8 bytes sensors   */
 #define surface_command(dev, command, index, buffer, size) 
-	usb_control_msg (dev->udev, usb_sndctrlpipe (dev->udev, 0), command, 
+	usb_control_msg (dev->udev, usb_rcvctrlpipe (dev->udev, 0), command, 
 	USB_TYPE_VENDOR | USB_RECIP_ENDPOINT | USB_DIR_IN, 0x00, index, buffer, size, 1000)
 MODULE_DEVICE_TABLE(usb, surface_table);
Since I had reused a different driver, I had overlooked that the previous code used control output transfers to send messages to the device instead of control input transfers like the Surface. When I tested the code, I noticed that the control messages returned error codes, fixed the message direction and didn't think about it anymore.

However, after getting to the conclusion that the device's firmware EEPROM had somehow become corrupted, I started thinking again: what if the controller had actually interpreted the wrong message direction as a write command and dumped the following data into the EEPROM? As the command was sent with a 12-byte buffer, that would probably mean that the first 12 bytes of the firmware had been overwritten.

To get some idea of the consequences, I checked the FX2LP Technical Reference Manual. As it turns out, the first 12 bytes are used for the header and the first set of load addresses. So good news after all: it should be possible to reconstruct these bytes, either from a stock firmware image or from the original firmware plus some educated guesses.

Since no stock firmware is available anywhere (this is a pretty niche product after all), I needed some way to retrieve the original firmware. A bit of Googling pointed me to the site of Chris McClelland who wrote the excellent fx2loader tool. The standard behaviour of the FX2LP chip is to accept downloads into on-chip RAM, but not into the EEPROM. However, fx2loader can upload a second-stage loader to RAM which then enables both read and write access to the EEPROM.

After I had managed to retrieve an EEPROM image using that secondary loader, I checked it using an hex editor and got the following:
$ hexdump -C firmware_damaged.iic | head -4
00000000  00 00 00 00 32 31 32 31  33 2e 30 31 02 14 5c 00  |....21213.01..\.|
00000010  03 00 0b 02 1c 3a 00 03  00 33 02 1e 36 00 03 00  |.....:...3..6...|
00000020  43 02 13 00 00 03 00 4b  02 0d 9f 00 03 00 53 02  |C......K......S.|
00000030  13 00 03 ff 00 80 12 10  e1 f5 33 ed f5 34 90 e6  |..........3..4..|
This confirmed my suspicion: the first 12 bytes contain part of the string which the original control transfer was supposed to retrieve from the device. Checking the Technical Reference Manual again gave me some idea about how the header should look like (see table on the right).

0x01Vendor ID L
0x02Vendor ID H
0x03Product ID L
0x04Product ID H
0x05Revision L
0x06Revision H
0x08Length H
0x09Length L
0x0AAddress H
0x0BAddress L
Now, I had to somehow reconstruct these 12 header bytes. The first 8 were actually rather straightforward: I knew the expected vendor/product/revision IDs from my previous code, and for the configuration byte, I just selected the default value of 0x00.

Length and address, however, were a different story: after the end of the data block, there's another length/address header with data and so on. Looking at the following bytes showed a certain pattern (4 missing header bytes shown as ...):
.. .. .. .. 02 14 5C
00 03 00 0B 02 1C 3A
00 03 00 33 02 1E 36
00 03 00 43 02 13 00
00 03 00 4B 02 0D 9F 
All of these represent 3-byte blocks of code, written to addresses 0x0B, 0x33, 0x43, 0x4B and so on. Another look into the datasheet confirmed my suspicion that all these locations are interrupt vectors (usually located at the very beginning of a controller's memory). Since the most important one, the reset vector at 0x0000, was missing, I concluded that this was the memory location being written by the very first block and that the 4 missing bytes had to contain the values 0x00 0x03 0x00 0x00.

Having fixed the firmware image, I now used fx2loader to put the image into RAM for testing, and the Surface screen came back to life! I thought I'd be nearly done by now and tried to simply write the fixed image back to the EEPROM. However, that didn't turn out to be quite that easy: when I tried to read back the EEPROM contents for verification afterwards, I got the following result:
$ hexdump -C verify.iic | head -8
00000000  43 02 13 00 00 03 00 4b  02 0d 9f 00 03 00 53 02  |C......K......S.|
00000010  13 00 03 ff 00 80 12 10  e1 f5 33 ed f5 34 90 e6  |..........3..4..|
00000020  43 02 13 00 00 03 00 4b  02 0d 9f 00 03 00 53 02  |C......K......S.|
00000030  13 00 03 ff 00 80 12 10  e1 f5 33 ed f5 34 90 e6  |..........3..4..|
00000040  02 01 91 14 70 03 02 01  b5 14 70 03 02 02 09 14  |....p.....p.....|
00000050  70 03 02 03 2f 24 f5 70  03 02 03 0c 04 24 fa 40  |p.../$.p.....$.@|
00000060  02 01 91 14 70 03 02 01  b5 14 70 03 02 02 09 14  |....p.....p.....|
00000070  70 03 02 03 2f 24 f5 70  03 02 03 0c 04 24 fa 40  |p.../$.p.....$.@|
As you can see, bytes 0x00-0x1F are identical to 0x20-0x3F (and so on, every 64 bytes). Every second block of 32 bytes correctly corresponds to the original/fixed firmware, but is incorrectly duplicated into the previous 32 bytes (in particular, also the first 12 bytes).

Since it was starting to get a little late at that point and my head started feeling a little fuzzy, I asked Chris if he had any idea about the reason for this error. He immediately suggested that this might be due to the fact that small EEPROMS such as the 64-kbit one used here only have a 32-byte write buffer instead of the 64 bytes on larger EEPROMS. He also wrote a bugfix for his second-stage loader in a matter of hours, which then allowed me to write the fixed image to the EEPROM and bring the Surface back from the dead. :-)

  • a firmware which trashes its own EEPROM due to a single wrongly set direction bit is perhaps a little fragile
  • fixing nearly 8000 € worth of equipment feels pretty good, particularly if you broke it in the first place ;-)
  • A final hat-tip to Chris McClelland and the team from the Media Informatics Group - thanks everyone!