shannon_a: (Default)
[personal profile] shannon_a
Here's what I wrote at RPGnet about the conversion of the forums from vB4 to XF, an arduous process that took up much of my week.

I previously ran a test conversion over on a test machine. There were two notable problems in the test conversion: first, smilies in signatures broke the conversion; and second the whole process died on trying to convert avatars. I figured out both those problems (with help from XenForo folks on the first), and was able to run the test conversion process in 32-33 hours. Afterward we discovered that most things had converted properly, with the biggest problem being permissions, which were frankly a mess. I figured out how to get our old URLs converted to the new URLs using an add-on, and that was the last possible show stopper. So, we had a potential new forum software. After comparing it to vBulletin 5, which we also testbedded, we decided we liked the feel of XF better, with some prime advantages being the very fluid use of dynamic HTML5 features and the vibrant add-on community. To be specific: we thought it was enough better for our purposes that it offset the increased work that a conversion would take over a simple upgrade, as well as the work users would have to do to learn the new system.

So I started the conversion process on the real forums Saturday evening, with the expectation that it'd be done Sunday night, and I could bring the forums back up Monday morning. Unfortunately, we hit three problems in the process which notably slowed things down. The total conversion time ended up being more like 100+ hours, about three times what was planned.

First, I made the mistake of running the conversion through our normal RPGnet setup rather than using a special machine for it. This means that the web server (where I was running the conversion) was separated from the MySQL server (where the conversion actually occurred). I'd considered this, but figured any latency would be minimal, because the two machines were on the same network, talking via private IPs, where there's little network contention. My round trip time is about half-a-millisecond. Though that's a lot bigger than the twentieth-of-a-millisecond or better round trip time when a machine talks to itself through network ports, I felt certain that any slowdowns would be from disk access on the MySQL side of things and that the network lag would be irrelevent.

I was wrong there, and at a guess that more than doubled the upgrade time. It's possible that there were other issues contributing to the general slowdown. These servers are built on cloud computers; they're not like the shared computers of the '00s, where you could really be hosed if someone else whose virtual machine shared hardware with you was doing a lot of work. But I have seen definite variation in disk access speed for some of my computers that do LOTS of disk work. So it could be we got lucky on the test machine or unlucky on our real server. But I find it most likely that putting that half-a-millisecond network connection between the web server and the MySQL server created most of our slowdown.

Second, sometime in the first 30 hours or so, a batch of a bit more than 100 entries from RSS feeds got put into the new database in a set of sequential and incorrect post ids. This caused a problem on night #2, when the conversion program, which seems to convert posts is a somewhat random order, found the first of the correct entries for those post ids, tried to create it, and couldn't because there was something already there. This halted the conversion process until I woke up in the middle of the night, stumbled to my computer, found it halted, stumbled back to my bed side table to find my glasses, then stumbled back to the computer to figure out what was going on and how to fix it, all at about 5AM. This then happened 100+ more times over the course of the upgrade, but with less stumbling. Most of it was right at the end, where the process was halting every minute or so.

It's less obvious what happened here, but my best guess is that even though the forums were off, some other process wrote to the forums and somehow this caused the disruption. This might have been the RSS feed readers built into vBulletin (though the feeds that got duplicated were for old messages, so I'm not really sure). It also could have been the automated morning posts of reviews, columns, and news — especially since the problem occurred the first time that any of those automated messages got written during the upgrade process (Monday morning, around 1am PT, which would have been an hour after they were written).

Third, the process halted two times for absolutely no reason, and recovered itself when I woke up the computer that was running the web browser. No idea why that was. I restarted my web browser and it went away, so maybe there'd been some leak in the browser that was causing problems.

Problems two and three both had an uncanny ability to knock the conversion offline while I was either sleeping or away from the keyboard. At least three times they knocked it out within half-an-hour of my going to sleep, even though they'd been working fine for many hours beforehand. All told they probably wasted 10-12 hours of the update doing nothing, which I found super-frustrating.

So, what would I do differently? To combat problem #1, I would make sure the browser and MySQL database were on the same machine. To combat problem #2, I would make sure the database was moved from its normal position, so that nothing unexpectedly wrote to it. I probably wouldn't worry about problem #3, but if I did, I could just restart the web browser after a day or two of work.

To accomplish those things, I would:
  1. Turn off the forums.
  2. Clone the database machine. (Rebuild it from its backup.)
  3. Get the cloned machine into order at its new IP address.
  4. Add a web browser.
  5. Install XenForo.
  6. Run the upgrade on the cloned machine.
  7. Copy the converted database over to the main database machine.
  8. Copy the XenForo install over to the main web server.
One of the reasons I *didn't* do that this time, is that It seemed like it added opportunities for problems to crop up, with all the moving of files between slightly unlike machines. But in retrospect, those potential problems were better than the slower upgrade.

But, lessons learned, not that I plan on doing this again :).

There were two other lesser problems.

First, for some reason one of the board notices (our "Trump" notice as it happens) totally broke the boards when they came back up. I couldn't access them at all until someone over at XenForo very quickly pointed me to that problem. That message went up after I branched off the testbed database, so it was literally something I couldn't have known. (The date was causing the problem, and I reset that and it was fine, but I think we've since cleared out all the notices.) This was literally such a bad problem that I though the upgrade was toast for about an hour on Thursday morning.

Second, page URLs weren't redirecting right, even though I'd had an add-on that was working right over on my testbed. This turned out to be because our real machine is slightly different from our test machine because it uses a more efficient web server called LiteSpeed rather than the standard Apache used on our testbed. I assumed this was the problem from pretty early on, it just took a while to figure out how to get them working in the different environment. If I was being 100% professional, I should have made sure my testbed was identical to our real setup, but we run RPGnet a little by the seat of our pants, using the time and resources we can eke out to do so, and in this case I was comfortable with the difference, and confident that we could find solutions if the variation caused problem (and we did).

And that's the story of the upgrade.

April 2025

S M T W T F S
  12345
6789101112
13 141516171819
20212223242526
27282930   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 8th, 2025 08:12 pm
Powered by Dreamwidth Studios