Sunday, August 30, 2009

Why You Want a Dual or Better

The Macbook headphone problem seems to be solved most of the time by blowing in it. Why this is is beyond me. I'm still quite upset about it, but at least I can get my music playing now while I write.

That is, of course, not the point of this post, just an aside. I thought I'd write up the principle arguments for a dual socket system and then turn the next post over to a friend of mine who holds the opposing view.

First of all, if you do any database work using a process-based database such as Oracle or Postgres, the more hardware threads you have, the better, generally. Since, in most architectures, each database process maps to at least one thread or process in the middleware or front end, you will find things go faster if you have even more hardware threads. When, due to the constraints of the problem, you find that your system has hundreds of threads with tens of database connections, in a time-critical application, you will discover that the dual socket system with its lots higher hardware thread count will complete in a lot less time.

One argument that has been put forward is that a cluster is more efficient than a single large machine. It may be that it is more cost-effective, but it cannot be faster because the latency involved in communicating over even well-designed gigabit is much, much higher than communicating locally. You can architect around this, using lots more processes and so on, but at the end of the day, a single bigger machine will beat several smaller machines for database manipulation.

There are lots of work profiles that this is not true of, such as video processing, where, quixotically, a single socket machine easily eats up the bandwidth of any sufficiently large drive, meaning that more than one socket really isn't helpful. The problem with video actually becomes one of getting the video to the processor, not one of getting enough processor threads.

However, I and many like me find it lots more convenient to have one large box rather than many smaller boxes primarily because the maintenance cycle is so much lower. I don't have multiple machines to constantly shepherd.

It is true, barely, that multiple hot single socket machines are cheaper for the amount of processing power they have than one large machine, but there are mitigating factors even here. For starters, any enterprise geek will tell you that the disk performance matters, and once you're done outfitting those single machines with adequate disk, the price tilts in favor of a single large machine. This is due to the price of SSDs, which may decline enough in the future to reverse the position, but, for now, a single, fast SSD is enough for most profiles in the single big machine, but you will need one for each small machine to achieve the same level of performance.

Further, ram aggregation is a good thing. While there are consumer products that can take more than 8gb ram, they are frightfully expensive, both for the board and for the ram, which invariably uses 4gb modules, which are very expensive. A decent dual socket board will have a minimum of 8 dimm slots, for up to 32gb ram. 16gb ram at the moment is very cheap. Now, in a single big server, you will find that the majority of that ram goes to buffers, which are shared among your processes. In other words, lots of your data is common amongst the processes you're running, but with multiple machines, it all gets duplicated around.

Now, if you're doing 3D rendering or video processing and have already solved the throughput problem mentioned above (I recommend a massive NFS server with at least 2gb, preferably 8gb cache) then you don't really care because neither one of those really burns ram. However, if you are working with a database such as Postgres, you will find an 8gb system to be inadequate once you get the base system and the shared memory (I was running 3gb shared memory for Postgres alone!!!). You end up with as little as 3gb buffers and cache, which cause lots of disk access.

Switch to 16gb and you can run everything at once. You can run two 2gb Java instances with a 2gb Eclipse instance, 3gb shm for Postgres, about 1gb Postgres footprint, and a 1.2gb os footprint, for 9.2gb, leaving nearly 7gb buffers and cache, which is normally enough to avoid significant disk access. And, yes, you'd have to run at least two machines to achieve that same ram footprint using consumer hardware, meaning you either have to fiddle with 10g (don't tempt me; I already have one card) or accept lots higher database latency. Alternatively, you can get the whole thing into one machine by dropping some of the java footprints, but that significantly increases the cost of garbage collection, which slows down the java processes.

So, the reality is that, despite that the main reason I have such a big machine remains that dual socket machines are just more interesting to me, I really could not live on less broad shoulders. And yes, I am trying to save for a quad socket system in the future, which is why I have the 8000 series Opteron processors rather than the 2000 series, as it would only cost about three hundred to get two more 8358se procs, nine hundred for a quad socket board, and around $240 for 16gb more ram. My case is large enough, my psu is both large enough and equipped with the appropriate connectors. I do need to replace the stock fans (100mm case fans, four to a side) with Scythe Kaze2s, for around two to three times the airflow, as those 8358se procs run very hot.

I am also going to eventually add a second Radeon 3870HD because I am running into a problem with dual screens where if both screens are receiving significant amounts of data, they tend to stutter. It is possible the problem is actually in XFCE, which I must investigate first, but I do have to say the machine is pretty much bored out of its life, playing hulu in one screen and freeciv in the other. When the freeciv app starts moving stuff around in a hurry, the flash hulu player stutters, a problem that is worse full-screen. Since I only ever tax at most three cores, as evidenced by the fact that only three cores go to full speed with the on-demand governor, and since the hulu playing happens without ever causing a cpu to go above the floor freq (1.2ghz), I think I can reasonably assume it is not a cpu load problem. I think I can also reasonably assume it is not a bandwidth problem, as the video card is in an x16 PCI-e slot and the system has four DDR-2 667MHz memory busses, and is doing interleaving on top of that.

One more thing I'm fiddling with is the need to have open gl on this radeon card in linux. I just want the pretty open gl screensavers. Seriously.

Saturday, August 29, 2009

Heat Problems, Again

Mash has been, well, running a tad hot, not really overheating. It's the same problem, where a null zone has developed over the chipset near the video card leading to elevated temperatures there. I pulled the cover off and found the culprit: a failed fan. Turns out it is a 100mm fan, which are pretty rare. I found a Scythe Kaze running around 42 cfm, which ought to be substantially better than the 15-25 cfm the fans in there are currently doing.

To fix it temporarily, I wedged a 120 mm Ultra Kaze in there, which produced substantially higher flow, despite not being sealed into the fan gang. I am so impressed with the Ultra Kaze fan. Many say they are too loud, but my machine is close to a vacuum cleaner at full speed anyway.

I did find a Delta fan, 92 mm, that can blow around 100 cfm, but it is also lots more expensive than the Scythe. I'd have to re-engineer the holder. I'm still considering it because 400 cfm would be nice.

Well, I Did Pay $1500 For It

I am speaking, of course, of my shiny aluminum unibody Macbook, the thorn in my side. The networking is all working, which is a relief, but the stupid red led of death is back. I have spent a lot of time trying to figure out what triggers it and how to fix it, even to the point of having an old headphone plug to stuff in there occasionally, but the one thing that seems to work most often is blowing in the hole. What is odd is that canned air does not work. Go figger. Every time it happens, I want to launch into a rant about badly-designed headphone jacks. Basically, they're subjected to every torque and abuse in the book, so they absolutely must be isolated from the mainboard and shock mounted. The plug itself has to be constructed of very high grade material. Here is not a place you want to save a few pennies, especially with netbooks nipping at your tail. Seriously, Apple, don't let me spend too much time wondering why I spent $1500 for this thing...

I guess I have to take it out to a 'genius' who will hem and haw and take it apart and look at it and tell me he can have a logic board in by monday or whatever, being as how it is still under warranty, but I hate dealing with that sort of thing. That is one major reason I've traditionally bought Apple hardware, the aura of being somehow a cut above, but the current Apple is in a relentless drive to produce stuff whose internals are ever less well-designed. There's a hundred dollars worth of machined aluminum casing, glass and so on, but the actual system inside sux.

Its networking is sub-par. Using the same disk, a Macbook Pro gets much higher throughput. This may also be due to the drive subsystem, which is also sub-par. To be fair, the Macbook Pro has a processor with adequate cache, but no computer should spend more than ten percent of its cpu handing nfs copying. Of course, lots of that is the hideously bloated internals in OS X, and I do use Linux so I have the bar set pretty high, but some of it is the net controller they use, which is a bargain-bin part. The ancient HP laptop I have has a Broadcom nic. It didn't cost as much as this thing new.

So, yeah, not a bad machine to write blogs on, but a disappointment in most other ways. I understand the new Macbook Pro 13.3 is supposed to be lots better, but, hey, I already paid $1500 for this.

Friday, August 28, 2009

The Macbook Is Working Right Again

The goofy headphone port problem, the red led of death, where when you remove a headphone, sometimes it switches to digital out for no apparent reason, that problem is gone. I've been using this thing for a few days since the last system update, and it appears fixed. I had often remarked that the problem went away after a reboot, so it seemed to be a software problem. Also, my Macbook can connect to the network again, oh frabjous day. It is really a pain to have to dump everything on the nfs server because the Macbook won't smb connect to anything. The most recent system update appears to have fixed that as well.

I have a good feeling about Snow Leopard. It is supposed to seriously reduce the footprint of the OS by jettisoning the PPC code, sorry all you diehards, I feel your pain. However, for us users of Intel Macs, it should really help with performance and stability to not have the extra code path. For starters, it requires a completely separate development team to do the PPC code. Also, over time, Apple has been improving optimization for Intel hardware at the expense of any other platform. Now, they can completely quit worrying about another platform, allowing greater optimization.

Here's hoping there will be fewer 'I paid $1500 for this pile' days in my future.

Mash Has Been Up For a Bit

I had to rework the system a bit. The extra video card never worked right under X and I decided to quit messing with it, after a power outage left me unable to get Xinerama working again. As I was pulling the card, I noticed a lot of heat building up in the top of the case in the vicinity of the fanless raid controller, so I ended up pulling the video card, moving the raid controller to the number two PCI-X slot and putting a fan in the number one PCI-X slot. The fan I put in is made by Thermaltake, iirc, and has an led, w00t, but we can't always get what we want. It has reduced the thermal load in the top of the case a bit, but more importantly, I closed off all the PCI openings with blanks.

That last bit is very important if you have a hot video card with a large fan, as the large fan ends up sucking hot air back in through those holes, which was a big part of my problem. This case has more than enough fannage to take on any system, having 4 80mm fans aiming at the motherboard tray, as well as 4 80mm fans evacuating air on the drive side. The case is split into two, with twelve 5-1/4 and 3 2-1/2 bays open to the front, with no internal bays. All my drives are in carriers with extra fans. There's 14 total fans. I believe in keeping things cool.

I guess I'd never really messed with a hot video card before, so it never occurred to me there could be a problem. So far as I can tell, only the video card (Sapphire ATI Radeon 3870HD with 512MB ram) and the 3ware 9500s-8 were running hot; the chipset, which is in that area, didn't feel too hot to the touch.

So, I've been using the new Mash, monster mash, for a few days now, and I can't emphasize enough how much faster the 8358se procs are for everyday usage. A buddy of mine and I have a longstanding running argument about faster procs vs. more procs, and I always end up with more procs, due to an inherent personality trait. He always ends up with faster procs, and we compare notes and swap abuse about it, slinging benchmarks at each other like monkeys in a zoo.

The final analysis, in my humble opinion, is that if you actually need more than, say, four processes, then you need more processors. If you don't need more than four processes, then don't buy the bigger system with the more processors, as the latency of a dual socket system eats into performance, particularly single-threaded performance.

I guess I was stunned to find out most people do only one thing at a time on their computer. Sure, they think they multitask, with a little web surfing, music playing and the compiler in the background, or wordprocessing or whatever, and that's the way I used to use a computer back in the day, which is why I always had more than one. However, these new dual quad core systems are a revelation. I've replaced almost every machine with one. The one I haven't replaced is, of course, the Macbook.

With eight cores and sixteen gigabytes of ram running Debian and XFCE, my common footprint is around 3.5GB ram, which is also how much space I'm currently using on the SSD. Because of how I've reconfigured my ram to turn on interleaving and to enable Linux to relocate low memory, I have about 14.5GB available. Subtract the 3.5GB footprint and that leaves 11GB for ramcache and buffers, meaning that the system essentially never reads and absolutely never swaps. Leaving swap on, of course, allows the kernel to more effectively manage memory, temporarily paging something out to relocate large blocks and so on. The swap is around 30GB on two separate swap partitions, one on each 15krpm SAS disk.

Now, this machine used to have a job, but so did I, and now we're both pretty bored. Seems like a lot of horsepower to play hulu, youtube and freeciv, with the occasional light browsing. I occasionally fire up Eclipse and mess with it, but I'm doing far more on the Macbook, learning to write code for the iPhone. I can't do that on Linux. As a result, I can't give you any real performance numbers, comparing, say, database loads on the old to the new, but I can say that the user interface is around 50% faster, that disk access is essentially instantaneous (sorry, hard to benchmark that) and that every thing I've tried to do with it has worked without a hitch, as these procs don't have the wide assortment of bugs the old ones had. It gives me hope for tomorrow.

Wednesday, August 26, 2009

Escape from Teh Suck

I decided to consolidate marklar and mash, my two duals, to a single box.  Marklar has the faster procs and video, mash has the faster disk subsystem.  I also decided to install linux on my intel x25-e 32gb slc ssd.

The very first thing I did was to pull the procs while marklar was still on.  Something was in the way of the power light, and I didn't notice it until I went to pull the video.  Fortunately, only the idle power and the SMU were powered up, and no damage was incurred.  Off to install the CPUs in mash...

No joy.  Dreaded quiet fan spin.  No beeps, nothing.  Well, maybe the CPUs did burn up.  Pull them back out.  Woops, how did that pin get bent?  Hmm.  Go to fix it with an awl, slip, bend a bunch more.  Pull the array controller card (meaning all the sata cables now have to be rerouted), pull the video cards, pull the power, pull the motherboard tray, take out all the ram, spend half an hour with an awl and a doublet I have lying around straightening pins.  Put it all back together.  Ran out of silver compound with just enough, so if I get this wrong, it'll be a bit before I can try again.

Flip the switch, with just one video card, two ram sticks and both CPUs.  Works.  Whew.  Put everything back together, plug dvd drive into LSI 1068 controller, fire her up.   #$^@#$%@#$.  The DVD drive won't work with the LSI 1068.  If I plug the DVD into the MCP-55 NVRAID SATA controller, I know, from long experience, that the 3ware and the LSI controllers will both quit working.

So, after much experimentation, I got the NVRAID running with the 3ware pulled.  Yay.  Install debian.  Reboot, hit the power switch, plug in the 3ware, GRUB error 17.  !#$^%@#$%@$#.  Dual SAS disks are still in the system.  With the 3ware card removed, they are sda and sdb.  Use the old /boot partition and get ready to get GRUBby.  However, if you put the 3ware back in and make sure its array (sda now) is set last in the boot order, the machine boots, then panics about no fs where it left them.  No problem, off to edit fstab and menu.lst, and I have a perfectly working machine.  Then some messing around with X and I have one video card working with two monitors attached.  Later, after some shut-eye, I will see about the other card and the other monitor.

Some impressions so far: the 8358se processors are seriously fast, lots faster than the 8347s.  I use the 8000 series because I have this idea in the back of my mind to build a four socket monster at some point.  I buy all my processors used on ebay, so there is a limit to the amount I'm willing to spend, and when they get that cheap, the 8000 and 2000 series are normally about the same price.

The addition of the ssd has pretty much ended any jerkiness the os may have had.  It simply never stops to think.  Installs fly by.  Downloads are faster, too, which is odd, but explainable given the lower latency of the ssd.

Normal machine-buying seems to be simply picking up the fastest processor, largest disk and most memory money can buy.  I've always thought that a machine should start with disk because no matter how fast your processor is, if your disk is not keeping up, it'll be idle most of the time.  So, I've had a lot of SCSI hardware, some SAS hardware, and now SSDs.  I won't be buying any more SAS hardware, that is how impressive these disks are.  They are hideously expensive, but the price is dropping and fast.

I think that in the end, people are going to see what a huge speed bump an ssd is and begin to appreciate what I've been saying all along, but maybe not; I've used other machines that were considered, like, seriously hot, dude, and they seem slow, loggy and easily hammered.  My machines seem slow to most other people.  However, with all my machines, if you have a hundred things to do in a hurry, it will get them done and leave enough performance lying around to surf the web.

And that's what the bandwidth and the dual sockets and the really fast disk are all about: getting lots of things done at once rather than getting one or two things done really fast.

So, finally, I have XFCE4, Opera, Flash, Firefox and Thunderbird working on this machine, making a reasonable dev box.  Oh, and there's Sun's Java6 and eclipse as well.  Oh, and the full GPP and GCC dev stack and the whole kernel tree.  3.4Gb.

I just had this ssd in my macbook for a bit, made the macbook quite snappy, but the install took up 16Gb!!!!  That left around 14Gb for me to do things with, which rapidly turned into 8 after a bit.  I began to fear the dreaded running out of disk crash with OS X that so often leads to the trash all preferences on dot-mac and reinstall solution.  Here in linuxland I have a full desktop with all the tools necessary to develop applications and it's taking less than a quarter the space OS X took.  Hmm.

Tuesday, August 25, 2009

Linux NFS and SCP Performance

I have been messing around with my Linux boxen.  I have two, connected through an HP gigabit wire speed switch.  The profiles are as follows:

Marklar: 'client'

2x Opteron 8358se in an ASUS KFN5-D SLI with 8gb of ram, an Hitachi Ultrastar 1tb and a Broadcom 5751 on a PCI-e x1 card.

Mash: 'server'

2x Opteron 8347 in a Tyan S2915 with 16gb of ram, a 3ware 9500s-8 with 6 Seagate 750gb SATA disks as the array, dual Fujitsu 147gb 15krpm SAS disks in raid0 as root and using the built-in forcedeth controller with Marvell phy.

I tried four different benchmarks.  For all the benchmarks, the copy was initiated on the client.  The file copied was the same 3.1gb Windows 7 beta ISO.  After the first copy, it was pretty much all in memory, so the first copy was thrown out.

I tried scp from the client to the array, copy to the array mounted through nfs, tuned for large files, scp from the client to my home directory, the raid0 SAS disks, and nfs copy to the SAS disks, optimized for general-purpose use.

The results were as follows:

Copy to array via nfs: 58MB/s average

SCP to array: 42MB/s average

Copy to raid0 SAS via nfs: 88MB/s average

SCP to raid0: 42MB/s

I guess several things come to mind: I have managed over 100MB/s over this link between two linux boxes before.  88MB/s over a large file is respectable.  The raid0 SAS disks ought to be able to do nearly 200MB/s easily, using linux' MD driver.  The local disk, an Hitachi Ultrastar terabyte disk is perfectly capable of averaging over 100MB/s, and, as said above, the whole file was in memory at the time, as confirmed by watching disk activity with saidar during copies.  Essentially, the SSL library limits ssh transfer to about 42MB/s on the opteron 8347, which clocks at 1.9GHz.  I believe the 8358se in the client is capable of substantially higher ssl throughput, with most benchmarks I've messed with showing it around 50% faster, with even higher increases if SSE is activated.

Another thing is that it is quite difficult to get to the theoretical 125MB/s limit of gigabit networking.  Despite that most modern systems have more than enough throughput and processing power to handle it, many systems don't have low enough latency to handle it.  For instance, my Macbook only can do around 35MB/s to nfs.  Part of that is the very slow internal disk, which, although 7200 rpm, doesn't come near the Hitachi, but a lot of it is the relatively high latency of the system.