Apple M1 – Back to the 90s

Apple presented their first computer processor based on the ARM architecture, the M1. This transition was announced a few months back, and it was clear for some time that this was on the map. What was a surprise, to me at least, was the power of the chip they announced – something that could rival intel processor at a fraction of the power consumption. There are many articles out-there explaining why this processor is so powerful, for me this has a strong sense of déjà vu.

Of course, this is hardly the first architecture transition for Macintosh computers. While the most recent one in people’s mind was the switch from PowerPC to Intel 8086, this reminds me more of the one from Motorola 68K to Intel 8086. Basically Apple leveraged in 2020 many of the ideas that were nice theories in the 90s, many of which were tried by hardware manufacturers in 90s and failed. There are some differences, which are interesting, because they mean this time it might work.

The first concept that came up is RISC (Reduce Instruction Set Computer), the idea was to tailor the instruction set for an efficient execution, not assembler coding, which nobody does. In theory these simpler processors could do more work with less silicon, this was a very sexy idea when I was at the university. PowerPC was a risk architecture, so is ARM. The low component count is what allows ARM processors to consume little power and is the reason this architecture became prevalent in the mobile space. The same RISC architecture is also what allows the M1 processor to have a very deep Re-order buffer (ROB) which allows it to handle many instruction in parallel.

The second idea that was popular in the 80s and 90s were dedicated helper chips. This idea was already present in 8 bit computers, the Commodore had power graphic chips that could do a lot of things autonomously, the Amiga computer took this idea to the extreme, with very complex chips to handle graphics and sounds. The NeXT station had a DSP and so did the last 68K Macintoshes. The PC, in contrast, had a design where the central CPU did all the heavy lifting, floating point units used to be discrete components, they became common once they were integrated in the CPU.

The idea of helper chip did not disappear, Graphical Processing Units, (GPU), as typically found on graphic cards gained in importance in PCs around the turn of the century, but they tended to be more distant from the CPU, with their own memory. Having a common memory system and minimising copies when moving stuff between CPU and GPU has been a goal since calculation were moved onto the GPU, and the M1 does that. Yet, the Amiga already had a shared memory pool (the so-called chip-RAM). One thing that is new is the presence of a unit dedicated to neural network style computations.

So all these approaches were known in the past, and failed in some way or another, what is different now? There are in my opinion two factors: the first is scale, the approach used by PC ended up dominating because of their huge market-share, which meant more resources to improve the components. This was enough to compensate a less than optimal architecture. Mobile devices redefined the land-scape Apple sells nearly as many iPhones (230M/year) as there are PC sold in total (260M/year).

A somehow related factor is focus: the PowerPC alliance had three players, which had very different roles, IBM, who sold custom PPC chips for gaming consoles and super-computers, Motorola, which mostly sold low consumption embedded chips, and Apple who wanted competitive desktop processors. Sony was somehow involved with the Cell chips, which never got anywhere. Here the situation is simpler: Apple developed a processor to put into their products.

The second factor is power consumption. PCs became very powerful by using a lot of electrical power. My Commodore 64 had no fan, nor did the the early Macintoshes. Modern desktop PC often have multiple fans and large heatsinks. This is kind of OK for desktop computer, but a major issue for laptops, not only do they require frequent charging, they need to dissipate the heat, somehow. Even for desktop computers, there is not much headroom, 300 Watts is a lot of heat to dissipate, double that, and your computer is basically a small heater.

Interestingly, my old NAS had a PowerPC processor, my current one is quad core ARM. Plus ça change…

Illustration of Apple’s M1 processor © Henriok – Creative Commons CC0 1.0 Universal Public Domain Dedication

5 thoughts on “Apple M1 – Back to the 90s”

Fwiw, reorder buffer is mostly there to allow out of order (ooo) execution. That is by no mean specific to RISC – Intel x86 line (which is usually the poster child for non-RISC) has out-of-order execution since the mid-90’s. The wikipedia page ( https://en.wikipedia.org/wiki/Out-of-order_execution ) has some interesting details – for example, the PPC 601 was one of the first more mainstream processor with ooo execution, and also powered the transition of Macs to PPC.

Those days, any processor worth its salt for single core performance has ooo execution. Intel did try to do without a while ago – first generations of Itaniums did not have ooo exec, with the assumption that compilers would be smart enough to leverage instruction level parallelism (ILP) – that failed. The SPEs of the Cell did not have out of order execution, and its was not known for its convenience nor powerfullness (though it is unlikely to be attributable only on that; but the Cell was quite an interesting experiment). GPU are different beasts – there is certainly no ooo directly on each individual unit, but the scheduling across those units is getting more and more dynamic (though I don’t know much on that).

Also, nowadays RISC vs CISC is not really that meaningful anymore. In practice, if you look at ARM, the instruction set is not particularly simple anymore. And all x86 implementations in practice are built on top of a simpler & internal instruction sets (i.e., more RISC like, with more registers). I guess that one can say that it is this underlying internal arch which actually allows for ooo exec in x86 :p

Thias on 2020/12/03

My understanding is that ROBs are easier to implement and scale when the instruction set is regular, i.e. you can determine in advance where the instruction boundaries. Having fixed instruction length (à la RISK) makes this easier.

Reply
- Pierre Palatin on 2020/12/03
  
  Sure and Intel processors have been using a low level internal instruction set (IS) to make things easier for ooo since at least the Pentium Pro. Also, afaik, ARM IS is not that straightforward – at least the Thumb IS is more imaginative when it comes to instruction length (but I don’t know much about ARM, nor what M1 supports).
  
  The 2 layers of Intel chip had an unexpected advantage of also allowing patching processors behavior. Given the complexity of processors nowdays, if I were to completely speculate, I would imagine that most complex processors (incl. M1 or PowerPC) do have something similar anyway. All in all, the IS (incl. decoding) really does not matter that much – hence CISC/RISC being a quite fuzzy thing.
  
  Now performance of a computer those days is largely dependent on things others than just instruction processing – i.e., what we often think of the processor. There was a fun example of that on one line of Chromebooks, were one with the most underpowered processor was actually the most reactive in actual use. And to take advantage of that, custom chips tailored to your use case and system helps a lot – I would expect that to be Apple main motivation. But for the instruction set, Intel basically does not offer licences for x86. From there, Apple had either the choice of negociating hard with Intel, or go with another instruction set. Given that Apple is not afraid of breaking compatibility, going for ARM was probably a no brainer. I would not expect performance to have been a big factor for the IS here though.
  
  (now don’t get me wrong – I’m all in favor of simpler instruction sets. Alpha4ever. Even it does not make things suddenly all better, having the right abstractions tend to help in unexpected ways. The counter of the death by thousand cuts in a sense)
  
  Reply
  - Thias on 2020/12/03
    
    I doubt negotiating Intel for using the x86 instruction set was ever a consideration. The M1 processor is basically a beefed up version of the A14 present in iPhones and iPads. Apple has been producing their own mobile processors since the A4, 10 years ago. The performance curves just happened to cross. The architecture choice goes back to the first iPod (PortalPlayer 5002).
    This migration also means Apple only has to handle one architecture between iOS and Mac OS X and to simplify porting between the two.
    
    Reply

We started from here : https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

Pierre Palatin on 2020/12/03

Fwiw, reorder buffer is mostly there to allow out of order (ooo) execution. That is by no mean specific to RISC – Intel x86 line (which is usually the poster child for non-RISC) has out-of-order execution since the mid-90’s. The wikipedia page ( https://en.wikipedia.org/wiki/Out-of-order_execution ) has some interesting details – for example, the PPC 601 was one of the first more mainstream processor with ooo execution, and also powered the transition of Macs to PPC.

Those days, any processor worth its salt for single core performance has ooo execution. Intel did try to do without a while ago – first generations of Itaniums did not have ooo exec, with the assumption that compilers would be smart enough to leverage instruction level parallelism (ILP) – that failed. The SPEs of the Cell did not have out of order execution, and its was not known for its convenience nor powerfullness (though it is unlikely to be attributable only on that; but the Cell was quite an interesting experiment). GPU are different beasts – there is certainly no ooo directly on each individual unit, but the scheduling across those units is getting more and more dynamic (though I don’t know much on that).

Also, nowadays RISC vs CISC is not really that meaningful anymore. In practice, if you look at ARM, the instruction set is not particularly simple anymore. And all x86 implementations in practice are built on top of a simpler & internal instruction sets (i.e., more RISC like, with more registers). I guess that one can say that it is this underlying internal arch which actually allows for ooo exec in x86 :p

- Thias on 2020/12/03
  
  My understanding is that ROBs are easier to implement and scale when the instruction set is regular, i.e. you can determine in advance where the instruction boundaries. Having fixed instruction length (à la RISK) makes this easier.
  
  - Pierre Palatin on 2020/12/03
    
    Sure and Intel processors have been using a low level internal instruction set (IS) to make things easier for ooo since at least the Pentium Pro. Also, afaik, ARM IS is not that straightforward – at least the Thumb IS is more imaginative when it comes to instruction length (but I don’t know much about ARM, nor what M1 supports).
    
    The 2 layers of Intel chip had an unexpected advantage of also allowing patching processors behavior. Given the complexity of processors nowdays, if I were to completely speculate, I would imagine that most complex processors (incl. M1 or PowerPC) do have something similar anyway. All in all, the IS (incl. decoding) really does not matter that much – hence CISC/RISC being a quite fuzzy thing.
    
    Now performance of a computer those days is largely dependent on things others than just instruction processing – i.e., what we often think of the processor. There was a fun example of that on one line of Chromebooks, were one with the most underpowered processor was actually the most reactive in actual use. And to take advantage of that, custom chips tailored to your use case and system helps a lot – I would expect that to be Apple main motivation. But for the instruction set, Intel basically does not offer licences for x86. From there, Apple had either the choice of negociating hard with Intel, or go with another instruction set. Given that Apple is not afraid of breaking compatibility, going for ARM was probably a no brainer. I would not expect performance to have been a big factor for the IS here though.
    
    (now don’t get me wrong – I’m all in favor of simpler instruction sets. Alpha4ever. Even it does not make things suddenly all better, having the right abstractions tend to help in unexpected ways. The counter of the death by thousand cuts in a sense)
    
    - Thias on 2020/12/03
      
      I doubt negotiating Intel for using the x86 instruction set was ever a consideration. The M1 processor is basically a beefed up version of the A14 present in iPhones and iPads. Apple has been producing their own mobile processors since the A4, 10 years ago. The performance curves just happened to cross. The architecture choice goes back to the first iPod (PortalPlayer 5002).
      This migration also means Apple only has to handle one architecture between iOS and Mac OS X and to simplify porting between the two.
      
Antoine Boegli on 2020/12/04

We started from here : https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

Thias の blog

Probablement n'importe quoi…

Apple M1 – Back to the 90s

Like this:

Related

5 thoughts on “Apple M1 – Back to the 90s”

Leave a ReplyCancel reply