A nasty bit of undefined timezone behavior in Golang

7 min read

Go is a great language. Really, it is! We complain about the rough edges, but on the whole it's been a great choice for us, and we're not sorry we picked it.

go is a great language... but

But just for fun, let's talk about some of those rough edges at length.

eng managers hate him

That was a pot shot. There are some actual foot-guns hidden in the runtime:

Honestly, you get used to all these things. No language is without its quirks, and terrible design choices are no barrier to massive popularity. Just ask Rasmus Lerdorf, creator of PHP, by any measure one of the world's most successful languages.

rasmus lerdorf quote

But we found a new bug in golang, one that stumped an entire room of seasoned industry veterans for a solid hour. I actually hesitate to even call it a bug, because it's so obscure, and the trigger mechanism so specific to our unit test that it broke, that most people will probably never run into it. So, maybe surprising undefined behavior is a better description?

Discovering the bug

A new customer came to us with a problem: his R program writing rows to Dolt wasn't actually writing any rows. This was pretty surprising, since at this point we've integrated customers on a variety of platforms and languages, and they all manage to get basic read / write functionality working. That's the benefit of being MySQL compliant: if it works for one language or connector, that's pretty decent evidence it will work for others. But he sent us a repro case that demonstrated the problem. Something wasn't working.

statement <- enc2utf8(new("SQL", .Data = "INSERT INTO `mtcars`\n  (`mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`)\nVALUES\n  (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"))
params <- unname(as.list(mtcars))
rs <- new("MariaDBResult",
          sql = statement,
          ptr = RMariaDB:::result_create(conn@ptr, statement, is_statement = TRUE),
          bigint = conn@bigint,
          conn = conn)

# No rows affected both before and after binding data to the the result
rs
#> <MariaDBResult>
#>   SQL  INSERT INTO `mtcars`
#>   (`mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`)
#> VALUES
#>   (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
#>   ROWS Fetched: 0 [incomplete]
#>        Changed: NA
dbGetRowsAffected(rs)
#> [1] NA
if (!is.null(params)) {
  dbBind(rs, params)
}

After a little digging into the intricacies of the R MySQL client, Max figured out what the problem was: the result schema from a COM_STMT_PREPARE should return a response indicating 0 columns for INSERT, UPDATE, and DELETE statements, since these statements don't have a return schema, just status metadata like rows edited. We were returning a column with our internal OKResult schema. Most clients don't care about this difference, but R's does.

The fix was easy: just don't return a schema.

But that's not the bug. That's just the bug fix that made us discover the real bug, the one that had us scratching our heads. Because when we went to release it to unblock our customer, we also picked up another change, an innocuous change to how our SQL server's log messages were formatted.

Broken unit test

When we went to release the above fix, our CI runs started failing with a suspicious timezone error.

    --- FAIL: TestSelect/select_*_from_branches_system_table (0.03s)
        sqlselect_test.go:1558: 
            	Error Trace:	sqlselect_test.go:1558
            	            				sqlselect_test.go:1423
            	Error:      	Not equal: 
            	            	expected: []sql.Row{sql.Row{"master", "so275enkvulb96mkckbun1kjo9seg7c9", "billy bob", "bigbillieb@fake.horse", time.Time{wall:0x0, ext:62135596800, loc:(*time.Location)(0xc0008fa310)}, "Initialize data repository"}}
            	            	actual  : []sql.Row{sql.Row{"master", "so275enkvulb96mkckbun1kjo9seg7c9", "billy bob", "bigbillieb@fake.horse", time.Time{wall:0x0, ext:62135596800, loc:(*time.Location)(0x3245200)}, "Initialize data repository"}}

I knew that our newest hire had recently made some timezone related changes in the SQL server, so I demonstrated outstanding leadership and blamed him without any further evidence.

zach blames vinai

But this didn't hold up under scrutiny once we actually looked at the nature of the breakage. Something else was going on.

The test that was breaking was asserting deep equality with a time.Time struct constructed like this:

	{
		Name:  "select * from branches system table",
		Query: "select * from dolt_branches",
		ExpectedRows: []sql.Row{
			{
				"master",
				"so275enkvulb96mkckbun1kjo9seg7c9",
				"billy bob", "bigbillieb@fake.horse",
				time.Date(1970, 1, 1, 0, 0, 0, 0, &time.Location{}),
				"Initialize data repository",
			},
		},

Deep equality was failing because the time struct returned by the query had a different timezone than the expected value. Specifically, we weren't giving an initialized time.Location for the expected time object, but the server was returning one.

Diff:
--- Expected
+++ Actual
@@ -10,9 +10,26 @@
    loc: (*time.Location)({
-    name: (string) "",
-    zone: ([]time.zone) <nil>,
-    tx: ([]time.zoneTrans) <nil>,
-    extend: (string) "",
-    cacheStart: (int64) 0,
-    cacheEnd: (int64) 0,
-    cacheZone: (*time.zone)(<nil>)
+    name: (string) (len=5) "Local",
+    zone: ([]time.zone) (len=1) {
+     (time.zone) {
+      name: (string) (len=3) "UTC",
+      offset: (int) 0,
+      isDST: (bool) false
+     }
+    },
+    tx: ([]time.zoneTrans) (len=1) {
+     (time.zoneTrans) {
+      when: (int64) -9223372036854775808,
+      index: (uint8) 0,
+      isstd: (bool) false,
+      isutc: (bool) false
+     }
+    },
+    extend: (string) (len=4) "UTC0",
+    cacheStart: (int64) 9223372036854775807,
+    cacheEnd: (int64) 9223372036854775807,
+    cacheZone: (*time.zone)({
+     name: (string) (len=3) "UTC",
+     offset: (int) 0,
+     isDST: (bool) false
+    })

After a lot of excited and confused shouting as we argued about how to ship our crucial bug fix when CI was failing for unrelated reasons, somebody figured out that the test started passing again when you commented out another test case, one that failed and wrote a log message. And it wrote a log message using a new format I had just merged, one that did this:

msg := fmt.Sprintf("%s %s [conn %d] %s {%s}\n", entry.Time.Format(time.RFC3339), lvl, connectionId, entry.Message, dataFormat.String())

But that's crazy. Surely formatting another, unrelated time.Time struct in another part of the code base doesn't change the default timezone value in unrelated code.

it does though

How is this possible?

As it turns out, our SQL server was using the time.Local location, which has some interesting behavior. golang doesn't try to determine what your timezone info is without asking you. That would be rude! Instead, it waits for you to tell it to load this info explicitly, after which further calls to fetch the Local location will include this cached, detailed information. But calling Format on a time object, for some format strings (such as time.RFC3339), needs to load this information, so will do so implicitly as a side effect. After such a call to time.Format, all time.Time structs with the time.Local location will include this info, when they didn't before.

This is the kind of surprising, spooky action-at-a-distance nonsense that well-designed libraries tend to avoid. Or they at least try!

If Rob Pike happens to be reading this, know that I criticize because I care.

apology to Rob Pike

The fix was pretty easy: load the info ourselves, making explicit what was once implicit.

// We are doing structural equality tests on time.Time values in some of these
// tests. The SQL value layer works with times in time.Local location, but
// go standard library will return different values (which will have the same
// behavior) depending on whether detailed timezone information has been loaded
// for time.Local already. Here, we always load the detailed information so
// that our structural equality tests will be reliable.
var loadedLocalLocation *time.Location

func LoadedLocalLocation() *time.Location {
	var err error
	loadedLocalLocation, err = time.LoadLocation(time.Local.String())
	if err != nil {
		panic(err)
	}
	if loadedLocalLocation == nil {
		panic("nil LoadedLocalLocation " + time.Local.String())
	}
	return loadedLocalLocation
}

...

time.Date(1970, 1, 1, 0, 0, 0, 0, time.UTC).In(LoadedLocalLocation())

Conclusion

All of us with a few years of experience under our belts have seen worse bugs than this. This one was kind of fun, because it only set us back a few hours, never hit production or impacted customers, and was just so wacky in its root cause that it crossed over into entertaining. Thanks for the memes, Rob. But still: even an entertaining foot-gun is still a foot-gun. Learn from our mistake, kids. Never trust a datetime library will do what you want, especially any defaults.

Like the article? Interested in Dolt? Come join us on Discord to say hi and let us know what you think!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt