Zune Bug: Test Driven Bug Fix

The Microsoft Zune 30G had a well known crash to bring in the new year. Here is the snippet of code that is the alleged culprit, from one of MS’s suppliers (Freescale).

The job of this function is to convert the input days to the current day, month, year, and day of the week. The input, days, is the number of days since January 1, 1980.

BOOL ConvertDays(UINT32 days, SYSTEMTIME* lpTime)
{
    int dayofweek, month, year;
    UINT8 *month_tab;
 
    //Calculate current day of the week
    dayofweek = GetDayOfWeek(days);
 
    year = ORIGINYEAR;
 
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
 
    }
 
    // Determine whether it is a leap year
    month_tab = (UINT8 *)((IsLeapYear(year))
                      ? monthtable_leap : monthtable);
 
    for (month=0; month<12; month++)
    {
        if (days <= month_tab[month])
            break;
        days -= month_tab[month];
    }
 
    month += 1;
 
    lpTime->wDay = (WORD) days;
    lpTime->wDayOfWeek = (WORD) dayofweek;
    lpTime->wMonth = (WORD) month;
    lpTime->wYear = (WORD) year;
 
    return TRUE;
}

Isn’t this exactly the kind of problem that QA is supposed to find? Someone at least should have advanced the clock to check that leap year was handled. Ahhh, but it was released in 2006, who would be thinking of leap year :-) Date problems have been out since the start of the millennium!

Like most bugs, they have simple roots. But some have dire consequences. Thankfully this bug did not reveal itself in some safety critical device. Or maybe there are safety issues for a Zune. Not having music for the New Years Eve party might drive someone to do things they normally would not.

As it happens, what seems to be a simple typo leads to an infinite loop on the last day of a leap year. This code had never seen the last day of a leap year, until it saw December 31, 2008 in real time!

It looks to me like defensive coding coupled with a missed keystroke are at least partially to blame. My hypothesis is that “if (days > 366)” should have been “if (days >= 366)”. My guess is the well-intentioned programmer did not press that “=” key quite hard enough. I used to work with a guy who spent seeks looking for these little typos. These boundaries are never as easy as they look.

Test Driven Bug Fixing

Let’s just say that from now on, if you need to fix a bug, you first demonstrate the bug with some unit tests. I’ll test drive fixing this bug. I first have to have an idea of how to test this problem area. If I can feed in the days parameter, I can test the behavior of various days.

To test the specific failure date, I need to feed in the number of days that represents December 31, 2008, the day of the Zune crash. Where will I get the number of days since 1980 for the crash date? Counting them manually sounds like a drag. I need a helper function to generate my input test data. I am not sure how to code the helper yet, but to get my TDD legs under me I write this test…

TEST(Rtc, Days1980to1981)
{
    days = daysSince1980(1981);
    LONGS_EQUAL(366, days);
}

1980 was a leap year, so the days from the start of 1980 to the start of 1981 should be 366 days. I can get that to pass with a simple hard coded return result.

What would be the answer for January 1, 1982? I think I’ll add a second parameter to represent the day of the year. This test expresses it.

TEST(Rtc, Days1980to1982)
{
    int days = daysSince1980(1982, 1);
    LONGS_EQUAL(366+365+1, days);
}

In the production code there is an IsLeapYear() function. It’s a static function; I expose it for test purposes. Here is the test data generator that passes those two tests.

int daysSince1980(int endYear, int dayOfYear)
{
    int days = 0;
 
    for (int year = 1980; year < endYear; year++)
    {
        days += 365;
        if (IsLeapYear(year))
            days++;
    }
    return days + dayOfYear;
}

Now a little experiment to get the days up until just before the Zune bug date.

TEST(Rtc, calcDaysFor20081229)
{
    int days = daysSince1980(2008, 364);
 
    LONGS_EQUAL(2, days);
}

The failing test reports the actual number of days: 10591. I get the calculator out and do a sanity check… 10591/365.25 -> 28.996 and then some. It could be right. That’s almost 29 years. Let’s write a test and check all the other data about this number of days. It should be Tuesday, 12-Dec-2008. Notice this checks the day before the problem day. It seems that the week starts on Monday, with day == 1.

TEST(Rtc, check20081229)
{
    days = daysSince1980(2008, 364);
 
    ConvertDays(days, &time);
    LONGS_EQUAL(MON, time.wDayOfWeek);
    LONGS_EQUAL(29, time.wDay);
    LONGS_EQUAL(12, time.wMonth);
    LONGS_EQUAL(2008, time.wYear);
}

I’ll be writing a few more tests like this, so I might as well add a helper function to do the asserts, and add an enum to make day of the week more readable. All the tests so far have an “int days” so it should become part of the test fixture. I’ll show the whole fixture when we’re done.

..snip..
enum Day {MON=1, TUE, WED, THU, FRI, SAT, SUN };
..snip...
 
void assertDate(Day dayOfWeek, int day, int month, int year)
{
    LONGS_EQUAL(dayOfWeek, time.wDayOfWeek);
    LONGS_EQUAL(day, time.wDay);
    LONGS_EQUAL(month, time.wMonth);
    LONGS_EQUAL(year, time.wYear);
}

If the bug report is correct, this test should never return. Eclipse/cdt is not going to like this. I have eclipse running my tests with every file save. Eclipse does not like makefiles that never return.

TEST(Rtc, check20081231)
{
    days = daysSince1980(2008, 366);
 
    ConvertDays(days, &time);
    assertDate(WED, 31, 12, 2008);
}

Sure enough, it’s hung! There is no killing it from eclipse. .. Be right back..

I’ll rig this code so it can’t get stuck. Conveniently enough this function has a BOOL return type, but currently only returns TRUE, or not at all! I’ll add in a safety relief valve.

BOOL ConvertDays(UINT32 days, SYSTEMTIME* lpTime)
{
    int safetyReleaseValve = 0;
    ..snip..
 
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
 
        if (safetyReleaseValve++ > 32000)
            return FALSE;
    }
    ..snip..

When the valve is triggered, it returns FALSE. I can test for that.

TEST(Rtc, check20081231)
{
    days = daysSince1980(2008, 366);
 
    CHECK(ConvertDays(days, &time));
    assertDate(WED, 31, 12, 2008);
}

The valve releases and the test fails. Now the popular fix for the problem is to change “if (days > 366)” to if “(days >= 366)”. Let’s try it…

tests/zune/RtcTest.cpp:39: error: Failure in TEST(Rtc, check20081231)
    expected <31 0x1f>
    but was  < 0 0x00>

Adding the “=” is part of the fix. It no longer blows the pressure relief valve, but the Zune reports the wrong day of the month, unless the 31-Dec-2008 is somehow interpreted as 0-Dec-2009! Interesting things happen at boundaries. That’s why you better test them.

The algorithm needs to exit on 366, the last day of the leap year. It should transition the year on 367 or greater.

while (days > 365)
{
    if (IsLeapYear(year))
    {
        if (days >= 367)
        {
            days -= 366;
            year += 1;
        }
        else
            break;
    }
    else
    {
        days -= 365;
        year += 1;
    }
 
    if (safetyReleaseValve++ > 32000)
        return FALSE;
}

It works, but it’s not pretty. In another blog, I’ll refactor this function to get it to better tell its story. But, we need a couple more tests. The zune won’t have to live through the year 2000. But would it?

TEST(Rtc, check20001231)
{
    days = daysSince1980(2000, 366);
 
    CHECK(ConvertDays(days, &time));
    assertDate(SUN, 31, 12, 2000);
}

This test fails the first time. It turns out that days of the week start with Sunday (==0).
After fixing the enum the test passes. What about the 1-Jan the day after a leap year ends?

TEST(Rtc, check20090101)
{
    days = daysSince1980(2009, 1);
 
    CHECK(ConvertDays(days, &time));
    assertDate(THU, 1, 1, 2009);
}

That’s good too. But those dates are all in the past. The Zune needs to work in the future. What happens at the end of this year and the next leap year? It’s so easy to find out now that the test fixture is in place.

TEST(Rtc, check20091231)
{
    days = daysSince1980(2009, 365);
 
    CHECK(ConvertDays(days, &time));
    assertDate(THU, 31, 12, 2009);
}
 
TEST(Rtc, check20100101)
{
    days = daysSince1980(2010, 1);
 
    CHECK(ConvertDays(days, &time));
    assertDate(FRI, 1, 1, 2010);
}

And the next leap year.

TEST(Rtc, check20121231)
{
    days = daysSince1980(2012, 366);
 
    CHECK(ConvertDays(days, &time));
    assertDate(MON, 31, 12, 2012);
}
 
TEST(Rtc, check20130101)
{
    days = daysSince1980(2013, 1);
 
    CHECK(ConvertDays(days, &time));
    assertDate(TUE, 1, 1, 2013);
}

I can be confident in the day to date conversion now, at least with regards to leap year. I am not sure that inspections alone would find this problem. Even if one does, the only sure way to know if code works is to run it, testing that it meets its requirements.

I hear all the time that low-level device driving code cannot be tested in a platform independent way. Sure there are parts of a driver that can only be tested in the real hardware, but that excuse does not work here.

It only took me 30 minutes to write these tests. I bet the original developer spent more than 30 minutes manually testing this code. Look at the results: the developer spent time manually testing, and a critical bug gets through anyway.

There is quite a bit more test code than production code. So What. It makes you go faster and helps avoid bugs.

The final test fixture looks like this:

#include "CppUTest/TestHarness.h"
#include <memory.h>
 
extern "C"
{
#include "Rtc.h"
}
 
enum Day
{
    SUN = 0, MON, TUE, WED, THU, FRI, SAT
};
 
TEST_GROUP(Rtc)
{
    SYSTEMTIME time;
    int days;
 
    void setup()
    {
        memset(&time, 0, sizeof(time));
    }
 
    void teardown()
    {
    }
 
    int daysSince1980(int endYear, int dayOfYear)
    {
        int days = 0;
 
        for (int year = 1980; year < endYear; year++)
        {
            days += 365;
            if (IsLeapYear(year))
                days++;
        }
        return days + dayOfYear;
    }
 
    void assertDate(Day dayOfWeek, int day, int month, int year)
    {
        LONGS_EQUAL(dayOfWeek, time.wDayOfWeek);
        LONGS_EQUAL(day, time.wDay);
        LONGS_EQUAL(month, time.wMonth);
        LONGS_EQUAL(year, time.wYear);
    }
};

The production code shown in this article the property of Freescale semiconductor.

2 thoughts on “Zune Bug: Test Driven Bug Fix

  1. Yeah James, but a whole stinking half hour? We can’t do this sort of thing with real code, it’ll take too long to ship.

    I wonder how much money was lost on handling this defect, not to mention sales now lost forever to that *other* evil empire. Maybe even more than your rate! :-)

    Cool post.

  2. Pingback: Accessing static Data and Functions in Legacy C — Part 1 | James Grenning’s Blog

Leave a Reply

Your email address will not be published. Required fields are marked *


nine + = 17

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">