Tuesday, November 17, 2009

Iterating with mysql_data_seek()

Looking back, it makes sense... but I've only just realised that the mysql_data_seek() function can be a real drag on your code.

Sometimes things just hit you like a brick: this happened to me just recently while looking through some code. Before I go too far, though, maybe a little back story is necessary.

There comes a time, it seems, that every PHP developer wants to write a DB access class. It almost seems to be a right of passage: everyone codes one and eventually ends up using another library written by someone else.

I was working on a piece of code that used such a class. For the longest time, this class worked well. It worked across a number of projects and became one of those "pieces of code that you just use all the time". In a relatively intense piece of code, we were butting up against a pretty heavy time cost: we were looping through a little over 100,000 rows, but this loop was taking around 15 minutes to complete.

We looked to make sure there was nothing wrong with the inner workings of the loop. Were we doing some other loop inside it, even accidentally? Maybe there was a function call that was taking a long time to complete? We worked through each possibility, ignoring the way we were actually grabbing the data.

The way we would grab the data is similar to what (I believe) a lot of other, homegrown access classes does: pass in a row number, seek to that row and fetch the result. This would let us skip rows, short circuit out of a loop or jump results more easily if the PHP logic needed it. The way we would loop over it would be to get a row count at the start then do a typical, for ($i = 0; ($i < $row_count); $i++) loop, calling a GetDataRow($i) within the loop. As I mentioned earlier, GetDataRow($i) would call mysql_data_seek($i) then mysql_fetch_assoc().

As we drilled down into our code, we started to realise our problem: things were just taking longer and longer to process, the further into the row count we went. This is probably why we never noticed anything: we had just never called this function on such a large result set.

Just to be sure, I tried starting the loop with $i equal to 50,000. The loop ran a bit quicker, but still that last 50,000 took a lot longer than the first 50,000. We had found our problem.

But what to do?

I did the same thing any sane PHP developer would do: hit the documentation at php.net. One, maybe two minutes later, I had all the data I needed to at least test something. It turns out that mysql_fetch_assoc() automatically moves the row pointer ahead one. Instead of explicitly calling the mysql_data_seek($i) for each $i, we only needed to continually call mysql_fetch_assoc() until it returned false.

While I found that interesting I couldn't see why this would be a problem. I mean surely the mysql_data_seek call would have a pretty good Big-O implementation... especially if we were meant to just be seeking to the row we were already on!

Unfortunately the data proved otherwise. Eventually it became obvious that in reality, we have a resource totally separated from the PHP process itself: there is no real way for us to just jump into a particular row.

So what kind of increase in performance did we see? To our absolute dismay, the 15 minute loop completed in approximately 4.5 seconds.

I know that that sounds incredible, but it's true. I've attached some code that should help to prove my point. If, however, you'd rather just see the numbers for yourself, they're here, too :)

Let me explain the numbers in the following table: we iterate through 100,000 rows, so l only show times for every 10,000 rows. Each number in the table is a count in seconds, from the start. That is to say, the number at the 30,000 row point is how long it took to get to those 30,000 rows. To see how long it took to get from one row to the next, you can see the number with the plus sign. The three different counts we do represent the three different tests we ran. "Iterating" is the time it took to get through the rows, without running mysql_data_seek for each row. "Seeking" is what happened when we did call mysql_data_seek and "Seeking (starting at 50000)" is, unsurprisingly, the times it took when calling mysql_data_seek, but starting at 50,000 rows. This test was run to see if the time was related to how many times we had called mysql_data_seek, or if it was related to which row number we were actually seeking to. As you can see, it's the latter.

At row number... Iterating Seeking Seeking (starting at 50000)
10000 0.029
+0.029
0.358
+1.157
N/A
20000 0.058
+0.029
1.514
+3.481
N/A
30000 0.087
+0.029
4.995
+7.563
N/A
40000 0.116
+0.029
12.558
+10.838
N/A
50000 0.145
+0.029
23.396
+12.973
N/A
+12.445
60000 0.175
+0.031
36.369
+14.185
12.445
+14.425
70000 0.206
+0.029
50.554
+16.671
26.869
+17.510
80000 0.235
+0.030
67.225
+19.212
44.379
+20.559
90000 0.264
+0.030
86.437
+21.122
64.938
+21.273
100000 0.294 107.559 86.211

If you're more of a visual person, here's a really basic graph:




For what it's worth, yes, the "Iterating" line is essentially flat. That's because of how much faster it was.

If you'd like to test this yourself, I've uploaded some code to http://files.codefromaway.net/mysql_data_seek.tar.gz : it can prepare some random data for you then run the tests so that you can compare the numbers yourself.

So that's that, and that's why I don't think I'll ever stop learning: it doesn't matter what you take for granted, some things will always work differently to what you expect...

No comments:

Post a Comment