Another Computerworld journalist, Tom Thompson was saved for last here on because this is the gen, the 'skinny' as the Yanks would have it — how a processor actually works. Yes, you know what's in the box you're using, but how does it do it? Read on…

The processors in today's computers have grown tremendously in performance, capabilities and complexity over the past decade.

Clock speed has skyrocketed while size has dwindled even as the number of transistors packed on them has soared. A processor from 1983 made do with 30,000 transistors, while some current CPUs have upwards of 40 million.

Any computer program consists of many instructions for operating on data. A processor executes the program through four operating stages: fetch, decode, execute and retire (or complete).

The fetch stage reads a program's instructions and any needed data into the processor.

The decode stage determines the purpose of the instruction and passes it to the appropriate hardware element.

The execution stage is where that hardware element, now freshly fed with an instruction and data, carries out the instruction. This might be an add, bit-shift, floating-point multiply or vector operation.

The retire stage takes the results of the execution stage and places them into other processor registers or the computer's main memory. For example, the result of an add operation might be stored in memory for later use.

An important part of a microprocessor is its built-in clock, which determines the maximum speed at which other units can operate and helps synchronise related operations.

Clock speed is measured in megahertz and, increasingly, gigahertz. Today's fastest commercial processors operate at 2GHz, or two billion clock cycles per second.

Some hobbyists speed it up (a practice called overclocking) to get more performance. However, this raises the chip's operating temperature considerably, often causing early failure.

Processor circuitry is organised into separate logic elements perhaps a dozen or more called execution units.

The execution units work in concert to implement the four operating stages. The capabilities of the execution units often overlap among the processing stages. The following are some of the common processor execution units:

1) Arithmetic logic unit: processes all arithmetic operations. Sometimes this unit is divided into subunits, one to handle all integer add and subtract instructions, and another for the computationally complex integer multiply and divide instructions.

2) Floating-point unit (FPU): deals with all floating-point (non-integer) operations. In earlier times, the FPU was an external coprocessor; today, it’s integrated on-chip to speed up operations.

3) Load/store unit: manages the instructions that read or write to memory.

4) Memory-management unit (MMU): translates an application's addresses into physical memory addresses. This allows an operating system to map an application's code and data in different virtual address spaces, which lets the MMU offer memory-protection services.

5) Branch processing unit (BPU): predicts the outcome of a branch instruction, aiming to reduce disruptions in the flow of instructions and data into the processor when an execution thread jumps to a new memory location, typically as the outcome of a comparison operation or the end of a loop.

6) Vector processing unit (VPU): handles vector-based, single-instruction multiple data (SIMD) instructions that accelerate graphics operations.

Such vector-based instructions include Intel’s multimedia extensions and Streaming SIMD Extensions, 3DNow from AMD and AltiVec from Schaumburg Motorola.
In some cases, there's no discrete VPU section; Intel and AMD incorporate those functions into the FPU of their Pentium 4 and Athlon CPUs.

Not all CPU elements execute instructions. Considerable effort goes into ensuring that the processor gets its instructions and data as fast as possible.

A fetch operation that accesses main memory (somewhere not on the CPU chip itself) will use many clock cycles while the processor does nothing (stalls).

However, the BPU can do only so much, and eventually, more code or instructions must be fetched.

Another way to minimise stalls is to store frequently accessed code and data in an on-chip cache.

The CPU can access code or data in the cache in one clock cycle. The primary on-chip cache (called Level 1, or L1) is typically only about 32KB and can hold only part of a program or data.

The trick to cache design is finding an algorithm that gets key information into L1 cache when it's needed. This is so important to performance that more than half of a processor's transistors may be used for a large on-chip cache.

However, multitasking operating systems and a bevy of concurrent applications can overwhelm even a well-designed L1 cache.

To address this problem, vendors several years ago added a high-speed dedicated bus interface that the processor could use to access a secondary Level 2 cache (L2) at a very high speed, typically half or one-third of the processor's clock rate.

Today's newest processors, the Pentium 4 and PowerPC 7450, go further and place the L2 cache on the CPU chip itself, providing high-speed support for a tertiary Level 3 external cache.

In the future, chip vendors may even integrate an on-CPU memory controller to speed things up even more.