How one tool solved all 5 risks

6 02 2008

Note: this post focuses on one specific commercial solution (R-1 from RepliWeb) which I’m working with for the past 3.5 years.

RepliWeb’s R-1 has 2 built-in Transport Engines designed for high bandwidth/low error rate (usually LAN) and low bandwidth/high error rate (typically WAN) networks. Both include a special Per-File Data Integrity Assurance feature that guarantees that no broken/half-baked files will ever reach the Target.

The way it works is very simple yet ultra robust: each file is first replicated to an administrator-defined temporary (hidden) directory on the Target host. Only after the file has been verified to be completely and successfully replicated it is than moved from the temporary location to its final Target destination via native OS rename command and than R-1 replicates the next file.

This powerful one little gadget actually kills two birds!! It absolutely guarantees that no broken/half-baked files will ever reach the Target destination (note that the original file is not touched until the final rename action). It also protects the file from being accessed during replication because the replication is performed into a temporary (hidden) directory where no User or Application will look for the file. The virtue of the rename command is that the whole file is being linked at once to the Target directory so there is zero exposure there.

Risks #1 and #2 – 100% mitigated!

Another mechanism built into RepliWeb’s R-1 is the Transactional Deployment which is a Quantum-Updated Integrity Assurance mechanism. ‘Quantum-update’ refers to the inclusive group of files that need to be deployed on specific host. The mode of operation resembles Two Phase Commit or in layman terms: “All or None”; ALL files are first replicated into an administrator-defined temporary (hidden) directory on the Target host, and are moved to the final Target destination only after two conditions are fulfilled. The first condition is ‘hardcoded’ – all files must have been completely and successfully replication (no point in moving forward if at least one file is damaged/missing). The second condition is selectable by the administrator and can be one of the following:

  • As soon as all files made it successfully to the temporary directory
  • At specific time of day
  • When a designated Trigger File is created by an external application/process
  • Upon named User approval

Risks #1, #2 and #3 – knocked out!!

RepliWeb’s R-1 includes another powerful mechanism – inter deployment synchronization points when running a Distribution (1-to-Many) deployments. Simply put, R-1 can make sure that concurrent deployments to multiple Target hosts progress through the same steps simultaneously. When set to synchronize just before the Transaction Commit point an administrator can extend the Transactional Deployment boundaries across multiple hosts, thus unless ALL hosts have received their Quantum-update in whole, NONE of them will be updated!

Risks #1, #2, #3 and #4 – busted!!!

Last, but certainly not least, is RepliWeb’s R-1 Rollback mechanism. This serves in conjunction to the Per-File Data Integrity Assurance and Transactional Deployment mechanisms, providing an insurance policy against wide variety of harmful affects of deployment including successful deployment of the wrong content. When used the Rollback engine will set aside every original file that has been touched by R-1 (i.e. overwritten or deleted) and will record any new file that has been introduced by R-1. At any time the Administrator can instruct R-1 to rollback in time and reverse the affects of all deployments as of specific point-in-time.

Risks #1, #2, #3, #4 and #5 – eradicated from the dictionary!!!!

Happy Deployments.





How to mitigate the 5 risks?

1 02 2008

The answer is not so simple. The risks stem mainly from the deployment tool you’re using – and this is the BIGGEST reason why your should carefully pick you weapon. So my first and foremost advice to you is to ask your vendor some 5 difficult questions AND don’t take their word for it – test it yourself in the lab!

Here are some additional ideas, I welcome your thoughts (add a comment just below this post):

Risk #1 (Broken/Half-baked file) – I don’t think there is much you can do except for holding your fingers crossed and/or pray. One thing I saw people do is to run all deployments interactively so that any error message is immediately reviewed and handled accordingly. I don’t think this is too practical in real-life enterprise environments.

Risk #2 (potential access to file while it is being copied) – Again, this is typically out of your control. Perhaps the only thing you can do is to eliminate User and Application access to the Target host(s) while replication is running. For example, disengage a webserver from a load-balancer before kicking off replication to make sure it’s ’empty’. I’m not sure this is too practical in large scale environments.

Risk #3 (partial update of single target) – Nothing much you can do, except carefully sniffing the logfiles (and I do hope your deployment solution generates good enough logs) after each and every deployment.

Risk #4 (partial update of some targets) – same as Risk #3

Risk #5 (successful deployment of the wrong content) – stay close to your blackberry/cellphone/beeper!

Happy Deployments!!





More about the 5 Risks you take every time…

26 01 2008

Couple of readers asked for more information about the 5 Risks you take… so here it is:

Risk #1: broken or half-bake file at the Target.
This simple problem is actually one of the deadliest there is. The reason: it is close to impossible for IT to spot such issues, thus they are left to be discovered by the business and the users… not a good position to be in, you will surly agree.

The primer cause for such problem is the old-fashioned replication technology used. If you’re using FTP, copy, cp, Robocopy, Rsync, Xcopy and a handful of commercial products you are betting each and every file replication will be successful – but in reality you have no guarantee. All of these tools will replicate each file DIRECTLY into the Target location so the first thing they’ll do is to KILL the ORIGINAL file and only than replicate the new file to that location.

So what will happen if for whatever reason replication breaks halfway (network jitter, no more room on Target, etc. etc.) or if some of the packages came through corrupted? You’ve guessed it – the original file is lost and instead got KABOOM. It’ll be extremely difficult for you to recognize the problem because from the outside the file seem to be there so dir/ls check will mislead you. The problem is internal to the file thus will be noticed only by end users (either see partial data or corrupted data or may get an error message).

Risk #2: potential access attempt to the file while it is being replicated.
This is not really an IT issue but the user’s ‘fault’. File replication is never a zero-duration operation. Depending on the size of the file and the speed of the network/host/storage it will take from a fraction of a second to minutes and even hours in some cases. When a file is replicated directly into the Target directory (just like FTP, copy, cp, Robocopy, Rsync, Xcopy and many others do) it seems to be available for use from the very first moment (try to run a dir/ls command and see for yourself) thus humans and applications may attempt to access it WHILE IT IS BEING COPIED.

The result is either access to partial data (only the portion that made-it-through can be read) or more seriously an access-violation error on either the accessing application side or the replication tool.

Risk #3: partial update of single Target
This perhaps is the classic problem that everyone thinks about when considering replication. In simple terms: “not all the files made it” – the update was for 57 files but only 55 made it – 2 files are MIA!!

The risks are two: (1) the Target is not identical to the Source as expected, and (2) since the Target is not fully updated there may be functional problems (such as broken URLs) or compliance problems (with regulations such as SOX and HIPPA, or regulatory bodies such as SEC and FINRA).

Risk #4: partial update to several Targets
Well, if running singe Webserver may be difficult, think what happens when out of 5 web servers in a farm only 3 are completely and successfully updated. There is now an imbalance in your farm hence some users will hit the updated servers and some will hit the partially updated ones.

Like with most of the problems, this one is more quickly noticed by the business and the end users and not by IT team!! Special difficulty is to pinpoint the problematic host(s).

Risk #5: successful deployment of the wrong content
To put it plain and simple: Who said life’s fair? Everything ‘IT’ went well and its really not your fault that stupid Joe decided to deploy the wrong content. So what, it is your a__ that is on the hot grill, and you need to correct the situation ASAP!!

In my next post I’ll discuss techniques to mitigate those 5 deadly risks, but before we get there one comment about statistics. Yes, some of you will say “What is he talking about??!! None of this can/will happen in my environment!”, and the answer is pretty simple – it is all numbers’ game. Today’s hosts, storage and networks (even WAN links) are pretty good and fairly reliable, but on the other hand the amount of web content deployed these days is ever increasing and in some cases reaches imaginary (aka astronomical) proportions. Furthermore, the criticality of web environments to day to day business operations reaches new heights from quarter to quarter. So with so much traffic the error is just a matter of time – statistically it WILL happen (even to YOU!). The question you should ask yourself is What does it mean to me? What will be the cost to my Organization (and will I be the one to pay the price)?





The 5 untold risks you take every time you synchronize a Web Farm

12 01 2008

“Just copying few files to few servers – what can go wrong?”

Little counter intuitive, right? People have been coping files day in and day out for years and years, what’s the fuss? Well, the difference in our case is that web farm updating is highly automated (unattended) and frequently executed (typically several times a day) which increases the statistical probability of something-going-wrong.

To make things more interesting a typical WWWROOT is comprised of 15,000-30,000 to 120,000-300,000 individual files. Indeed not ALL files change frequently but even a minor change to 1%-2% of the site translates to quite a bunch of files to be replicated.

Finally, one has to keep in mind couple of factors that further amplify any glitch: the first is the immense business criticality of Web environments today (one can not say enough about the cost of down-time), and the second is the fact that Web glitches are usually noticed BY END USERS – and that is the least desired option of them all…

The 5 risks you take every time you synchronize a Web farm are:

  1. Half-baked/Corrupted files in Production
  2. Potential access to files while they’re being copied
  3. Incomplete update of specific Web Server
  4. Incomplete update of the whole Web Farm
  5. Successful deployment of wrong content

How does your Content Deployment solution handles such situations? Are your operational procedures adequate?