The super-scalar design with its huge centralised control structures, such a result buses, seems to hit a design wall that it will not able to pass due to wire delays that prevent signal propagation across the whole chip. This effect is already visible in some super scalar designs like Pentium 4 that uses several pipeline stages for signal propagation. Tiled architectures address this problem by composing a processor from several simple tiles (these might be small simple processors with cache or just parts of a processor). While communication within a tile happens within a clock cycle, communication that crosses tile boundaries takes at least one cycle.
Several studies have already analysed the potential of tiled architectures for exploiting loop or instruction level parallelism. However, thread level parallelism (that builds on the popular shared memory model) has been excluded so far. The problem is that the proposed architecture cannot implement a hardware cache coherence protocol and shift the burden on the programmer to prevent cache incoherence. I will try to present a distributed solution that allows the use of the shared memory model while not relying on global buses to ensure cache coherence.