Load Balancer best practice

working closely with load balancers throughout the years made me aware of quite a few difficulties and tricks, and hard learned lessons about the problems and the best ways of implementing a load balancer into your web farm.

I hope you enjoy the things i have to say here, and i will listen to what you have to say , correct and update this article.

1. Web application, clock synchronization.

In load sharing setup, one must not count on the servers clocks to be used for writing data to the database or presenting data to the user.

The clock of every server skew, some servers skew more than others, so even a clock sync of once a day, doesn’t make it sure that the servers are synchronized, lets look at the example:

User login to your website, your application took the server’s time and inserted the login time as 14:00.

Now the user submits a form, and the second request is being sent through a different server, now another entry is being sent to the database at 13:59, because  there was a few seconds difference between the clocks.

A later action was registered in the database as if it happened before the earlier one.

The best way to avoid it, is to use the database clock for all read and write actions, that ensures a steady time across various actions, through no matter how many servers you have in the pool.

2. application sessions type

If your application uses sessions, its better to use database based sessions, this will keep the user session alive no matter which server in your pool the request is being made from.

be sure to keep your servers equally loaded. you should be able to set your content group up as round robin with no stickiness.

If you must use server based session, then you must setup some kind of stickiness on your Load Balancer content group, so if you do, make sure the stickiness is based on source IP AND source port.

Pay attention, with some Load Balancers, when you chose sticky source IP they disregard the source port, this result in all requests coming from a certain IP to go to one server regardless of the amount of different requests, or different people initiating the requests, make sure the stickiness is based on source port as well, This will avoid search engines bots to load one of the servers while indexing your site, this could cause lack of performance on the loaded server.

4.server keep alive

if your Load Balancer support keep alive URI, which being used by the Load Balancer to make sure the server is alive, be sure to use a keep alive file which also uses the servers scripting engine(.php,.aspx,etc…) sometimes there is a problem with the scripting engine while the server itself is fine and still handing out responses, and you want that server to be removed from the pool and not resulting in application errors for your users.

Using a URI keep alive rather than other methods such as icmp, gives you the possibility to remove a server from the pool easily by renaming the keep alive file that the Load Balancer is checking for.

This can be used for maintenance, version updates, without doing any action on the Load Balancer itself.

5. Alerts

If possible, configure email/snmp traps alerts for major incidents, such as server unavailable, service unavailable,etc.

This helps identifying problems in the system, and to give you information when your environment status is being degraded.

6. Monitoring

Graph your interfaces, the bandwidth of each service, the hits of each service, and anything else you can possibly graph.

When investigating a fault, graphs of past behavior will help you to understand what could possibly went wrong.

7. Access lists

Setup access lists to restrict traffic going through the Load Balancer, maybe you are serving different internal networks, some Load Balancers by default behave like a router between different networks, which results in an unfiltered traffic between supposedly filtered networks.

Put explicitly allow access lists on all interfaces, and use logging, this will help identify access problems and/or unwanted traffic generated on the network.

Feel free to comment,

Lior.

SQL Server Cluster setup failure

This one time I was suppose to setup an SQL server 2005 Cluster.

The setup was failing its checkup time after time no matter what I did.

Apparently the problem was that the other node was logged in.

Strange as it sounds , logging off the other node stopped failing the checkup, and installation was finished sucessfuly.

strange eh ?

Cisco CSS Load Balancer – Traffic Drop

I just wanted to comment here on a problem I had which took me a while to understand and i couldn’t find any solution anywhere.

The setup is a cisco CSS 11500 Series with 2 web servers and 2 application server.

it was a Cisco ASA Firewall (192.168.1.1) -> Cisco CSS (192.168.1.2)

|-> Servers (192.168.1.10,11,12,13)

So the cisco CSS and the servers were sitting on the same network as the internal interface of the ASA Firewall. servers ofcourse has the CSS as the default gateway.

The application content group was configured with sticky sessions at the time.

The problem was that now and then our connection to the website was dropped and we couldn’t connect back for few minutes and then it was back up, during the time the website was accessible from anywhere else.

This problem drove us nuts, and took us a while to resolve this.

CSS was noticing quite a lot SYN ATTACKS. (CSS never drop traffic because of a DOS attack its just marks them )

so we were thinking maybe there’s a DDoS mitigation device along the way suspecting an attack and blocking the traffic, that was not it.

The problem was an ICMP redirect packets sent from the CSS telling the servers about a better route for the external addresses, the firewall.

for example coming from IP 2.2.2.34 the CSS would send an ICMP redirect message to the servers (windows server 2003) the windows machine would then inject this route in the routing table.

traffic would come from the CSS to the server , packets back were sent to the firewall and got lost there. this was causing the blockage.

By default, ICMP redirect routes were stayed in the routing table for 10 minutes , and then removed.

so to avoid this from happening, you either disable ICMP redirects on the CSS, (I don’t know why its turned on by default anyway)

or change a registry value in windows to disable icmp redirects route injection,(possible to disable in unix as well)

HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Tcpip\\Parametersby

change the setting of EnableICMPRedirect entry to 0 (in most cases it will set to active – 1 by default)

Hope that would help others as well,

Lior

Reorganizing your patch cables

check out these guys, they seem to enjoy it:)

Minimize your reaction time to DDoS attacks

Intro

Even though recent report shows less DDoS attacks the past year, that risk is still out there.

Now days most DDoS attacks can be mitigated quite effectively based on the fact that you poses some kind of an IPS.
However most time wasted is the initial process of understanding the fact that your network is under attack and what are we doing from
here.

Although main scenario shown here is when the network that is nuder attack is your remote network sitting in a Data-center far away.
the actions and precautions can and sometimes should be implemented in all kind of networks regardless of location.

Few of the reasons why time wasting is higher in remote networks are:

1. You have no physical access to the devices.
2. It takes time to understand why exactly the network is down.
2. You most of the time wasting your time talking to the NOC person and waiting too much time to get the net-op to help you.
3. The pressure is rising every wasted minute.

It’s usually a hardware failure (ours or the ISP) that we suspect whenever our network is down,rather than put a helmet and shout (we are under attack!)
most of the time if you tell someone less technical that the problem is one thing, he will go with you rather than questioning and telling you something like (“uhmm you know… i can see you are using 150% MORE than your usual traffic.. does it sound alright to you?”)

So after few tests with the NOC guy (still based on the assumption that it is a hardware failure) we go and ask him about anything suspicious on their end.
Then he tell you something like (“No sir, everything is fine, I mean you have 150% more traffic but i doubt it that it matters…”)

Now the panic starts…
You wasted more than an hour figuring that it’s something completely else, and it’s a DDoS attack you got no IPS and you have nothing to do! (do you?)
Well, the good news is that there is still something to do. The bad news is that even if you had an IPS you already lost valuable time and you will continue to lose time over the next steps if you are not prepared correctly.

Prepare yourself

Even if you DO have an IPS and your network is under attack in a remote data-center, you need to prepare yourself for the fastest and best response.
IPS’s most of the time are in learning mode, and that’s for a good reason, traffic in most networks change (new features, new versions, new traffic,etc..) and you don’t want to get to a point where your IPS is dropping packets just because it “thinks” the network is under attack.

You need to minimize your response time, and your decision need to be the most refined one.
IPS’s when panic activated, stopped the attack but also might have stopped some services which you need. this can be avoided most of the time.

Few basic rules:

1. if you don’t have a router in that network, meaning you have a firewall connected to the data center’s infrastructure directly and you have no access to the routers,
Get a decent router NOW. firewalls need protection as well.
Router can handle much more than a firewall in most cases. firewall usually NAT and filter the traffic, it keeps a table of live connections which too many can cripple it.
2. Monitor monitor monitor. Monitor all networking devices interface memory and CPU usage. use syslog to gather all the logs from the devices.
You most probably have a Load-balancer as well. get it’s statistics, VIP’s,group,services traffic/hits.
3. You have the monitored graphs and logs? great! Make sure you can ALWAYS access the data (management dedicated link/modem/data backup to an outer location).
If you can’t get that data when you need it most, you worked for nothing.
4. Talk to your ISP/Data-center IT manager about these kind of incidents procedure (it’s true, some of them don’t have one). I like to have one of the net-ops cell phone just in case so i     can hurry things up if needed.


About monitoring, some people monitor interface/CPU/memory of the routers and firewall, this is all nice and better than no monitoring, but you really want to monitor the switches and anything networking wise.
This is how you analyze a peak or a drop in traffic from outside until the exact port on the switch and to the server connected on it.

The most simple monitoring system will give you more than any IDS/IPS or a sniffer in your network, and when I say simple monitoring systems i mean MRTG/cacti/etc.. and an SQL based syslog.
A network without equipment graphs and logs is a crippled network. It’s a network that has a lot of problems and will have a lot of problems.
So before anything else get some monitoring tools.

If you know what is going on in your network at every point in time, you will ask the NOC guy the right questions and you will get to conclusions faster saving alot of valuable time.


We are under attack what now?


so, wether you  wasted  2 hours because  you have no monitoring tools or you got the the idea after 5 minutes becuase of your monitoring tools, now you need to deal with the problem.

The panic way is to activate the IPS to start mitigating the attack, the IPS will now filter ALL the traffic.
In most cases only one IP address is under attack and if your IPS now filter your class C network with 50 active addresses, you just drain valuable resources of your IPS.

If you have the most important data, which is the graphs and logs available, we can easily see if we are under attack and what exactly is under attack.
As said before most attacks are on one IP address, and it will usually be your corporate web server, but it could easily be something else.

There’s your router come in to the picture, use the router to divert the targeted ip address to the IPS, all other traffic should flow cleanly with no interruptions.
your firewall will thank you, it can do it’s job again, AND you are now maximizing the effectiveness of the IPS.

If you don’t have an IPS you still have something to do about it.
you or your ISP can blackhole the targeted IP address OR if the attack is originated in only one part of the world than you can filter the originated address block temporarily.

In any way, with IPS or without, with a router or without, with capabilities to mitigate an attack or not,
Always but always work with your ISP’s/Data-center’s net-op or NOC, sometimes you will need the attack to be mitigated further up in the chain as bandwidth plays an important role here.
Your ISP need to know why your bandwidth is high and If you are bandwidth capped you better get your ISP to temporarily remove the cap.


Conclusion


The things mentioned earlier are actions and precautions one can do to deal with DDoS attacks and/or network failures quicker.
Rather than sitting helplessly waiting for the data-center guys to sort it out. More important even is that nobody will (or at least not supposed to) know about your network or your traffic better than you.
You suppose to know better than anyone at your ISP how to handle your traffic(what addresses can be white listed and not filtered, which country you don’t do business with at all and can be blocked completely as a traffic originator, etc..)


Few words about IPS and other DDoS mitigation devices.
IPS will not help if you don’t have the bandwidth to handle the attack in addition to the legit traffic.
If your network is limited on bandwidth ask your ISP/Data-center to host the device at their end you can also meet together with few more customers and join efforts (money and political power) to get some kind of DDoS protection at your ISP, this is a very legitimate action in many places in the world.

And don’t forget, please Monitor your network…