skip to content



M3 Architecture Research Group

Computer Systems Laboratory
361 Frank H.T. Rhodes Hall
Ithaca, NY 14853 USA
m3 at csl.cornell.edu

Checkpointed Processor Architectures

Modern out-of-order processors tolerate long-latency memory operations by supporting a large number of in-flight instructions. This is achieved in part through proper sizing of critical resources. In light of the increasing gap between processor and memory speeds, tolerating upcoming latencies in this way would require resource sizes that might very adversely affect the clock cycle time. A closer look, however, reveals that such enlarged resources are in fact grossly underutilized on average [CAL'03].

To continue using many of the successful micro-architectural advances in future processor cores, whether standalone or as part of multicore architectures, it is imperative to break with some of the inefficiencies of traditional instruction processing. In this project we investigate novel micro-architectures that seek high performance and high efficiency through a balanced mix of conventional instruction processing and selective checkpointing of the architectural state.

In [MICRO'02] we propose Checkpointed Early Resource Recycling ("Cherry"), a pioneering micro-architectural technique that uses selective processor checkpointing and to allow optimistic recycling of critical processor resources, such as physical registers or load/store queue entries. This aggressive recycling mechanism improves the effective utilization of these resources [CAL'03], and results in higher single-thread performance without increasing their size. In [TACO'04] we introduce Ephemeral Registers, which allows simultaneous late allocation and early recycling of physical registers. Later in [MICRO'05] we describe Cherry-MP, a set of architectural extensions to enable correct integration of Cherry in multicore chips. In that work, we also provide quantitative evidence that Cherry can help compensate for the ILP loss in multicore chips with more, smaller processor cores.

Cherry, however, does not rid the processor of stalls due to long-latency loads, as it still requires instructions to complete execution before they can retire. In [VPW'04,HPCA'05] we propose Checkpointed Early Load Retirement ("Clear"), which speculatively retires unresolved long-latency loads blocked at the ROB head by supplying a "back-end" (at retirement) value prediction. This allows subsequent instructions to execute and retire. To handle mispredictions, we rely on our checkpointing support. This work received the 2005 HPCA Best Paper Award.

Aggressive CMOS scaling will make future chip multiprocessors increasingly susceptible to transient faults, hard errors, manufacturing defects, and process variations. Existing CMP proposals that implement dual modular redundancy (DMR) do so by statically binding pairs of adjacent cores via dedicated communication channels and buffers. This can result in unnecessary performance degradation when one core is defective (in which case its DMR pair must be disabled), or in performance/power losses when the DMR pair exhibits frequency/leakage variations. Static binding also puts additional pressure on thermal management, since DMR pairs running code with similar thermal characteristics are necessarily placed next to each other. In [DSN'07] we describe the use of Cherry's checkpointing support and a variation of Cherry-MP's coherence mechanism to construct a CMP where arbitrary cores can verify each other's execution. This results in hardware that degrades half as fast as mechanisms that rely on static binding, and provides support for on-demand triple modular redundancy (TMR) to overcome hard faults. It also allows for more flexible management of thermal density and variation-induced hardware inefficiencies.

Support

This work is supported in part by NSF award CCF-0429922, and equipment donated by Intel.