M-Machine Publications
Exploiting Fine-Grain Thread Level Parallelism on the MIT Multi-ALU Processor
Abstract
Much of the improvement in computer performance over the last twenty
years has come from faster transistors and architectural advances that
increase parallelism. Historically, parallelism has been exploited
either at the instruction level with a grain-size of a single
instruction or by partitioning applications into coarse threads with
grain-sizes of thousands of instructions. Fine--grain threads fill
the parallelism gap between these extremes by enabling tasks with run
lengths as small as 20 cycles. As this fine--grain parallelism is
orthogonal to ILP and coarse threads, it complements both methods and
provides an opportunity for greater speedup. This paper describes the
efficient communication and synchronization mechanisms implemented in
the Multi-ALU Processor (MAP) chip, including a thread creation
instruction, register communication, and a hardware barrier. These
register-based mechanisms provide 10 times faster communication and 60
times faster synchronization than mechanisms that operate via a shared
on-chip cache. With a three-processor implementation of the MAP,
fine--grain speedups of 1.2--2.1 are demonstrated on a suite of
applications.