Translation lookaside buffer

2. Using a page table hits performance

The page table schemes we've talked about so far do work. However, consider performance. Remember that a processor accesses memory very often---at least once every instruction. In addition, consider that every access to memory, using paging, requires that the page table be consulted. Many processors require several memory accesses to fetch each instruction, and, depending on the instruction, a few memory accesses while executing the instruction.

On the surface, paging causes 100% performance overhead: memory accesses take twice as long as they would without the page table. With several memory accesses per instruction, this is terrible. With multilevel page tables, obviously, requiring multiple memory accesses to read the page table, this situation becomes even worse.

So at one extreme, we have page tables which are stored entirely in fast, on-chip registers, which have very little overhead, but which can only store small tables. At the other end of the spectrum are large, multilevel page tables, stored in memory (the MC68030 uses up to 4 levels of page table). Memory page tables can be larger, and faster on context switching, but potentially are horribly slower at actually accessing pages.

The solution is special cache memory in the MMU, which stores a small number of page table entries (between 16 and 512). Before making an expensive reference to the page table in memory, the MMU checks to see whether it has a copy of the PTE in its cache. If it does, it only spends the low on-chip reference time seen before. If it does not, it has to go to memory. The memory is commonly called a translation lookaside buffer, or TLB.

So the sequence is either: 1) check to see if the page table entry is already in the TLB---yes it is! 2) using the frame number from the cached PTE, translate the address, and access physical memory.

or 1) check to see if the page table entry is already in the TLB---no it's not! 2) look up the PTE in the off-chip page table (slow main memory access) 3) translate the address and access physical memory (again), and 4) store the PTE in the TLB for use later.

The processor takes care of maintaining the TLB, adding new PTEs and removing ones which have not been used for a while.

The fraction of memory accesses which go to the TLB instead of to slow main memory is known as the hit rate of the TLB. (Generally, the success rate of any cache is known as its hit rate.)

An example:

main memory accesses cost 80ns. TLB accesses cost 16ns. The hit rate is 98%

Without paging, the memory access time would be

80ns

If page table were stored entirely in registers, the memory access time becomes

16ns + 80ns
= 96ns

With a one-level page table and no TLB, the memory access time becomes

80ns + 80ns
= 160ns

So cost of accessing memory on a cache miss is (cost of accessing TLB, plus memory access time with a memory-based page table), or (16ns + (80ns + 80ns)), or 176ns.

With the TLB, and one-level paging, memory access time becomes

(cache-hit-cost × 98%) + (cache-miss-cost × 2%)
= (96ns × 98%) + (176ns × 2%)
= 94ns + 3.52ns
= 98ns

(This is an overhead of 22%, over raw, physical memory access speed.)

Even with 4-level page tables, where the cost of a cache miss is (16ns + (4 × 80ns) + 80ns), or 416ns, the average memory access is still only

(cache-hit-cost × 98%) + (cache-miss-cost × 2%)
= (96ns × 98%) + (416 × 2%)
= 94ns + 8.32ns
= 102.4ns

(or an overhead of 28%)

last updated 6 March 1998