Tuning Process Activity - Beginning Ubuntu LTS Server Administration

If something isn’t going well on your server, you want to know about it. So, before you can con-duct any process management, you need to tune process activity. Linux has an excellent tool that allows you to see exactly what’s happening on your server: the toputility. From this utility you can see everything you need to know. It is very easy to start top: use the topcommand.

When the utility starts, you’ll see something like Figure 6-1.

Figure 6-1.The toputility gives you everything you need to know about the current state of your server.

Using topto Monitor System Activity

The topwindow consists of two major parts. The first (upper) part provides a generic overview of the current state of your system. These are the first five lines in Figure 6-1. In the second (lower) part of the output, you can see a list of processes, with information about the activity of these processes.

The first line of the topoutput starts with the current system time. This time is followed by the “up” time; in Figure 6-1, you can see that the system has been up for only a few minutes.

Next, you see the number of users currently logged in to your server. The end of the first line contains some very useful information: the load average. This line shows three different num-bers. The first is the load average for the last minute, the second is the load average for the last 5 minutes, and the third is the load average for the last 15 minutes.

The load average is displayed by a number that indicates the current activity of the process queue. The value here is the number of processes that are waiting to be handled by the CPU on your system. On a system with one CPU, a load average of 1.00 indicates that the system is completely busy handling the processes in the queue, but there are no processes waiting in the queue. If the value increases past 1.00, the processes are lining up, and users may experience delays while communicating with your server. It’s hard to say what a critical value exactly is. On many systems, a value anywhere between 1 and 4 indicates that the sys-tem is just busy, but if you want your server to run as smoothly as possible, make sure that this value exceeds 1.00 only rarely.

If an intensive task (such as a virus scanner) becomes active, the load average can easily rise to a value of 4. It may even happen that the load average reaches an extreme number like 254. In this case, it’s very likely that processes will wait in the queue for so long that they will die spontaneously. What exactly indicates a healthy system can be determined only by doing some proper baselining of your server. In general, 1.00 is the ideal number for a one-CPU sys-tem. If your server has hyperthreading, dual-core, or two CPUs, the value would be 2.00. And, on a 32-CPU system with hyperthreading enabled on all CPUs, the value would be 64. So the bottom line is that each (virtual) CPU counts as 1 toward the overall value.

The second line of the topoutput shows you how many tasks currently are active on your server and also shows you the status of these tasks. A task can have four different statuses:

• Running: In the last polling interval, the process has been active. You will normally see that this number is rather low.

• Sleeping: The process has been active, but it was waiting for input. This is a typical status for an inactive daemon process.

• Stopped: The process is stopping. Occasionally, you’ll see a process with the stopped status, but that status should disappear very soon.

• Zombie: The process has stopped, but it hasn’t been able to send its exit status back to the parent process. This is a typical example of bad programming. Zombie processes will sometimes disappear after a while and will always disappear when you have rebooted your system.

The third row of topprovides information about current CPU activity. This activity is sep-arated into different statistics:

• us: CPU activity in user space. Typically, these are commands that have been started by normal users.

• sy: CPU activity in system space. Typically, these are kernel routines that are doing their work. Although the kernel is the operating system, kernel routines are still often con-ducting work on behalf of user processes or daemons.

• id: CPU inactivity, also known as the idle loop. A high value here just indicates that your system is doing nothing.

• wa: For “waiting,” this is the percentage of time that the CPU has been waiting for new input. This should normally be a very low value; if not, it’s time to make sure that your hard disk can still match up with the other system activity. If your CPU utilization is high, this should be the first parameter to check because a high workload on the CPU might be caused by a slow I/O channel.

• hi: For “hardware interrupt,” this is the time the CPU has spent communicating with hardware. It will be rather high if, for example, you’re reading large amounts of data from an optical drive.

• si: For “software interrupt,” this is the time your CPU has spent communicating with software programs. It should be rather low on all occasions.

• st: This parameter indicates the time that is stolen by the virtualization hypervisor (see Chapter 13 for more details about virtualization and the hypervisor) from a virtual machine. On a server that doesn’t use any virtualization, this parameter should be set to 0 at all times. If there is a virtual machine that is very active on your server, this parame-ter will increase from time to time because it measures activity in the host operating system; the host operating system has to share CPU cycles with the virtual machine.

The fourth and fifth lines of the topoutput display the memory statistics. These lines show you information about the current use of physical RAM (memory) and swap space.

(Similar information can also be displayed using the freeutility.) An important thing that you should see here is that not much swap space is in use. Swapping is bad because the disk space used to compensate for the lack of physical memory is approximately 1,000 times slower than real RAM.

If all memory is in use, you should take a look at the balance between buffers and cache.

Cacheis memory that is used to store files recently read from the server’s hard drive. When files are stored in the cache, the request can be handled very quickly the next time a user requests the same file, thus improving the general speed of the server. Cache is good, and having a lot of it isn’t bad at all.

Abufferis a region of memory reserved for data that still has to be written to the server’s hard drive. After the data has been written to the buffers, the process that owns the data gets a signal that the data has been written. This means that the process can go on doing what it was doing and has to wait no longer. Once the disk controller has time to flush the buffers, it will flush them. Although using buffers is helpful, there is a disadvantage: everything that was stored in the server’s buffers will be lost in case of a power outage if it hasn’t yet been written to disk. And that’s where the journal becomes useful: when the server reboots, the journal is read to recover the damaged files as fast as possible (see Chapter 4 for more information about journaling).

The bottom line of monitoring the cache and buffers parameter is that it is occupied memory that can be freed as soon as it is needed for something else. If a process has a mem-ory request, the server can clear these memmem-ory areas immediately to give the memmem-ory that becomes available to the process in need.

The lower part of the topwindow provides details about the process that’s most active in terms of CPU usage. It will be the first process listed, and the line also displays some usage statistics:

• PID: Every process has a unique process ID. Many tools such as killneed this PID for process management.

• User: This is the name of the user ID the process is using. Many processes run as root, so you will see the user name root rather often.

■ Note

For well-programmed processes, it’s generally not a problem that they’re running as root. It’s a dif-ferent story, though, for logging in as the user root.

• PRI: This is the priority indication for the process. This number is an indication of when the process will get some CPU cycles again. A lower value indicates a higher priority so that the process will have its share of CPU cycles sooner. The value RT indicates that it is a real-time process and is therefore given top priority by the scheduler.

• NI: The nice value of the process. See “Setting Process Priority” later in this chapter for more details on nicing processes.

• VIRT: The total amount of memory that is claimed by the process.

• RES: The resident memory size is the amount of memory that is actually mapped to physical memory. In other words, it represents the amount of process memory that is not swapped.

• SHR: The amount of shared memory is what the process shares with other processes.

You’ll see this quite often because processes often share libraries with other processes.

• S: This is the status of the process; they’re the same status indications as the ones in the second line of the topscreen.

• %CPU: This is the amount of CPU activity that the process has caused in the last polling cycle (which is typically every five seconds).

• %MEM: This is the percentage of memory that the process used in the last polling cycle.

• TIME+: This indicates the total amount of CPU time that the process has used since it was first started. You can display this same value by using the timecommand, followed by the command that you want to measure the CPU time for.

• Command: This is the command that started the process.

As you have seen, the topcommand really provides a lot of information about current sys-tem activity. Based upon this information, you can tune your syssys-tem so that it works in the most optimal way.

Dans le document Beginning Ubuntu LTS Server Administration (Page 170-174)