Oh Noes, my is load high! Part 1

Server Load:

Server load is a measure of the amount of work that a computer system performs. The load average represents the average system load over a period of time.

A server load average will measure the number of active processes at any given time. The load average seen in top is simplistic and uses several variables to define it. Depending on the processor and memory available the “nominal” or normal load will vary. High load averages will usually be indicated by higher than average swap usage. Generally, linux will use the memory it has available and utilize swap to alleviate the higher than average load.

Linux splits up it usable RAM into chunks called pages. In order to free up memory, linux will write these chunks to a predefined space on the hard disk, called swap space, to free up that chunk of memory. The totals of RAM and swap space is equal to the amount of virtual memory a system has.

When viewing the results of, lets say top, the load averages are for the time frames of 1, 5 and 15 minutes. There are a several ways to monitor your servers load. The first thing you will need to do is login to your server via SSH.

1) uptime: The uptime command produces the following output:



uptime

 14:08:20 up 26 days,  3:46,  1 user,  load average: 0.08, 0.07, 0.02

According to the man page, Uptime gives a one line display of the following information. The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

2)procinfo: On Linux systems, the procinfo command produces the following output:

procinfo -a Linux 2.6.9-023stab046.2-enterprise (root@rhel4-32) (gcc 3.4.5 20051201 ) #1 SMP Mon Dec 10 15:22:33 MSK 2007 4CPU [host]


Memory:      Total        Used        Free      Shared     Buffers

Mem:        524288      326660      197628           0           0

Swap:            0           0           0
Bootup: Tue Jul  8 20:39:12 2008    Load average: 0.03 0.06 0.02 1/67 5086
user  :       8:55:25.21   0.3%  page in :        0

nice  :       8:42:41.50   0.3%  page out:        0

system:       9:28:11.27   0.3%  swap in :        0

idle  : 102d 23:25:21.71  98.4%  swap out:        0

steal :       0:00:00.00   0.0%

uptime:  26d  3:50:49.00         context :4294967295	interrupts:        0
Kernel Command Line:

  quiet
Modules:

File Systems: ext3 ext2 [proc] [tmpfs] [devpts]

Procinfo gives a wealth of information including;

Memory: The amount of memory available including Total, Used, Free, Shared, Buffers.

Bootup Time: The time the system was booted.

Load average: The average number of jobs running, followed by the number of runnable processes and the total number of processes (if your kernel is recent enough), followed by the PID of the last process run (idem).

user: The amount of time spent running jobs in user space

nice: The amount of time spent running niced jobs in user space.

system: The amount of time spent running in kernel space.

idle: The amount of time spent doing nothing.

steal: The amount of time spent the virtual CPU waiting for physical CPU.

uptime: The time that the system has been up.

page in: The number of disk block paged into core from disk.

page out: The reverse of the above.

swap in: The number of memory pages (chunks) page (written) in from swapspace.

swap out: The number of memory pages (chunks) page (written) out to swapspace.
(Swap in and out only refer to transferring pages between RAM and dedicated swap space or a swap file)

context: The total number of context switch since bootup.

disk 1-4: The number of times your hard disks have been accessed.

Interrupts: This is the two rows of numbers for each IRQ channel if your kernel is at version 1.0.5 or later.

Modules: The modules (device drivers) installed on your machine, with their sizes in kilobytes.

Character and Block Devices: All available devices with their major numbers.

File Systems: All available file systems.

3) w: The w command produces the following output:
w 14:38:03 up 26 days, 4:16, 1 user, load average: 0.01, 0.04, 0.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root ttyp1 x.x.x.x 14:08 550days 0.03s 0.00s w

Notice that the first line of the output is identical to the output of the uptime command.

4) top: The top program provides a dynamic real-time view and system summary information as well as a list of tasks currently being managed by the Linux kernel of a running system. The top command ranks processes according to the amount of CPU time they consume.

top
output
top - 14:41:33 up 26 days, 4:20, 1 user, load average: 0.04, 0.04, 0.00 Tasks: 58 total, 1 running, 57 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1% us, 0.1% sy, 0.0% ni, 99.8% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 524288k total, 323680k used, 200608k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

12014 root 16 0 1908 984 780 R 0.3 0.2 0:00.01 top 1 root 16 0 1640 604 524 S 0.0 0.1 0:02.15 init 27817 root 16 0 1544 528 444 S 0.0 0.1 0:11.73 syslogd 27821 root 16 0 1484 376 316 S 0.0 0.1 0:02.10 klogd 27834 named 15 0 68224 3204 1944 S 0.0 0.6 0:17.46 named

5) sar: The sar command writes the accumulated activity from the contents of a selected file to standard output (monitor) for the operating system for a specific timeframe. You can select specific information about system activities using flags. (Run the command ‘man sar’ for more information regarding these flags)

sar -q
outputs
14:40:01 3 80 0.00 0.02 0.00 14:50:02 3 77 0.03 0.05 0.01 15:00:01 4 84 0.00 0.02 0.00 15:10:02 3 87 0.06 0.07 0.02 Average: 4 73 0.11 0.09 0.08

Load Average:

Servers calculate the load average as the exponentially damped/weighted moving average of the load number. The three values of load average refer to the past one, five, and fifteen minutes of system operation.

To explain further:

If you have a single CPU, the load average is a percentage of the system utilization for a specific time period.
If you have multiple CPU’s, you must divide the number by the number of processors in order to get a comparable percentage.

For example, with a single CPU, you can interpret a load average of “1.75 0.40 9.28” as:

during the previous minute: the CPU was overloaded by 75% (1 CPU with 1.75 runnable processes, so that 0.75 processes had to wait for a turn)

during the last 5 minutes, the CPU was underloaded 40% (no processes had to wait for a turn)

during the last 15 minutes, the CPU was overloaded 828% (1 CPU with 8.28 runnable processes, so that 8.28 processes had to wait for a turn)

This means that this CPU could have handled all of the work scheduled for the last minute if it were 1.75 times as fast, or if there were two (1.75 rounded up) times as many CPUs, but that over the last five minutes it was twice as fast as necessary to prevent runnable processes from waiting their turn.

What is the right load for my server?

In a single CPU environment, anything around 1.0 and below is fine, try to stay under 1.0 for regular load averages. If your server slows down, check the load. A large trafic spike may cause the load to rise.

When your regular load averages starts to raise up around 2.0 then your server is very busy and you should consider upgrading your RAM if your hardware allows it. A regular average would be defined as when the server is doing what it was intended for, serving up webpages, not when processing logs or doing backups.

My next article will deal with what to do when you see the load constantly above normal.