lensman: (Default)
[personal profile] lensman

So I mentioned there was a server problem at work...
The server external HD array at work has been having problems all week, and the dept has grown to the point where they can't take this much down time... Time to look at failover solutions.

That is an area that is starting to get outside my current experience. So could any techies share some ideas, or caveats.



So far what I'm thinking is to use VMWare ESX/ESXi servers running Windows server VM's connected to a private non-routable back-end NAS ( About 6 TB of data, but easily expandable, and to have failover backup) I'd like to keep the users access to roll back snapshots of the file shares that Windows Server provides, and I'd also like to be able to keep the accounts accounts in a central AD domain (So that I don't have to worry about account provisioning & maint)...

I have a slight advantage leaning towards VMWare as the Inst has a pretty good lic deal, and the ESX & ESXi lic don't cost me anything at least. I'll have to double check on the VSphere Mgmnt lic. I think there would be a cost on that. Basically I'm looking to reasonably remove any single point failure possible with the exception of client side facing network which is run by our central IT, and I have no control past that point.

It's the end of the current budget Cycle so I'd like to move quickly, but we also want to take the time to design and build something that will be stable & scale as we continue to grow. We are looking at needing to be able to provision and roll out machines pretty quickly as the need arises for some of our joint work internationally. They are going to be looking into additional distance learning tools. I'd also like to be able to provide the users with access to their data through a Dropbox like sync app, but that allows us to retain the data. (Maybe something like iFolder. Although, surprisingly that project doesn't look like it's has much activity, so I'm a little leary. Other suggestions for this too are welcome.)

I've heard a couple of good things about EMC NAS's. The only NAS's I've had direct experience with have been cheap a$$ Buffalo Terrastations. (Which are actually not bad little linux boxes, but their upgrades are a pain/ or not available, and their domain connectivity is next to useless for an AD as large as ours at least using the default Samba they come with. Prior to the current situation I had been thinking about MS Windows Storage server versions of the Buffalo's for a possible NAS upgrade for one of my sub-groups as they've filled both a 2 & 4 TB arrays)

Date: 2011-06-07 12:33 am (UTC)
From: [identity profile] ninjarat.livejournal.com
I'm not going to recommend hardware because I don't know your systems.

I am going to provide some advice. Doing HA right can be big and expensive. Redundant computers with redundant system disks and internal storage. Redundant public facing network interfaces with redundant switches. Redundant private network interfaces with redundant switches there. Redundant heartbeat network interfaces with redundant switches there, too. Redundant fibrechannel interfaces with redundant fibre switches connecting the redundant backend storage. Redundant power everywhere. Redundant everything. Literally.

Doing HA wrong is easy, and there are two ways to do it. One is to skimp on the hardware, such as using only one switch for the heartbeat or public facing networks. Lose that switch and the whole cluster is effectively dead or worse: in a split-brain state which can lead to data corruption on the backend. The other is to use the "cold" nodes as live nodes. This is easy to rationalize: they're consuming power and cooling, might as well use them. When something faults and those services fail over you will find yourself operating over capacity and the whole thing will fall apart.

Do it right. Do it right the first time. Pay the expense. And then demonstrate it working. Yank the power out of something "critical" and watch it keep on ticking without anyone noticing. Put it back, clear the fault, and do it again with something else. Test everything.

Or do it wrong, and demonstrate the catastrophe when the critical point faults.

Read this:
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Specifically point 3.

Date: 2011-06-07 06:01 am (UTC)
ext_4429: (Default)
From: [identity profile] lensman.livejournal.com
Yes after a quick email from another colleague I'm currently thinking this will be 2 EMC VNXe NAS's (Have already gotten 2 internal good refs for these) cross connected to 2 backend non-routed Gig switches (Not sure if I should interconnect the two switches or not, but I don't think that will matter too much)
The switches in turn will be cross connected to 6 VMWare ESXi servers (3 are already existing ESXi boxes and have their data stores on each machine separately on is currently a standalone server but that will be re-purposed with it's functions moved into the VMWare cloud). These are all min 2x quad core with RAID 5 (Unfortunately two of them do not have redundant PS) and of course everything has UPS Battery backups, and live in a data center on campus. (Not on city power)

I was able to confirm that that the VShphere Lic is already covered by central. :-) So It'll just be Hardware and OS Lic.

Outward facing Network is run by central, and we're comfortable wiping our hands at that point. Otherwise I think I'd have to look at replicating off campus, and while we're growing we're not THAT big yet. Although I could see the potential for our partner countries to handle one or more of those :-) Which would make things "interesting"... (Road trip) :-)

Profile

lensman: (Default)
lensman

June 2012

S M T W T F S
     12
3456789
10111213141516
17181920212223
24252627282930

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 23rd, 2017 04:47 pm
Powered by Dreamwidth Studios