For many years, Moore’s law very accurately predicted improvements to CPU performance. You could expect a doubling in chip performance about every 18 months. Manufacturing improvements brought us smaller transistors, allowing more transistors to fit on a die without an increase in size. Currently, chips are available with structures as small as 10 nm, and foundries are preparing to introduce 7 nm technology.
As transistors continue to shrink, new challenges appear which make it harder for the industry to continue progressing at this speed—but there are other ways to increase the performance of a system without increasing cost or power consumption.
For example, instead of using one universal computer architecture for all the tasks in an embedded system, a variety of specialized hardware components can be used. Consider the common task of rendering graphics. While CPUs are capable of rendering graphics, graphics processing units (GPUs), built specifically for this purpose, handle the task more efficiently.
ARM system on chips (SoCs) take full advantage of the concept of combining CPUs with additional hardware accelerators. Typically, such systems contain a GPU, video encoders and decoders, audio processors, cryptographic accelerators and other specialized hardware. There are many different SoC combinations from a range of manufacturers.
Some accelerators inside SoCs, like video decoders, have been commonplace for many years. They allow playing video with very little load on the CPU. Other components and concepts are newer, like the big. LITTLE concept from ARM. In this case, a single SoC contains different Cortex-A CPU cores. Some are optimized for maximum performance, like the Cortex-A72; others are optimized for low power consumption, like the Cortex-A53. All of them are Cortex-A application processors, so program code can be seamlessly moved from one core to another, depending on the workload.
Other SoCs, like the NXP i.MX7, take it a step further and deploy different classes of ARM cores. On the NXP i.MX7, two ARM Cortex-A7 application processor cores handle the general computing load. These cores are ideal to run an operating system like Linux with or without a graphical user interface. The SoC also contains an ARM Cortex-M4 microcontroller. This core is less powerful than the A7; however, due to its simpler architecture, it draws less power. Its execution time is more predictable, with more deterministic latency. This makes it well suited to running a real-time operating system (RTOS) to handle critical real-time tasks. Traditionally, an external microcontroller is required to assemble this kind of architecture. This complicates the hardware design. Alternatively, the real-time task can share the A7 cores with the rest of the system. However, technologies like the Linux RT patch or Hypervisors requires trade-offs.
Another interesting use case for this architecture is low-power IoT applications. It is possible to switch off the higher-performance Cortex-A7 cores while continuing to monitor the environment with the lower power M4 core. The M4 core can wake up the rest of the system when required. This strategy combines the advantage of microcontrollers with the versatility of an application processor.
This form of heterogeneous multi-core system is becoming increasingly popular. The upcoming NXP iMX8 QuadMax will contain two Cortex-A72 cores, four Cortex-A53 cores, and two Cortex-M4 cores. Other SoC providers have also produced conceptually similar designs.
This kind of ARM SoC is a good fit for many embedded application due to their efficiency. However, the integration of SoCs required several external components such as high-speed DDR RAM, flash storage, Ethernet PHYs, and complex power management circuited to power the different HW accelerators and cores separately. This increases the initial development cost and risk of a project. This has the largest impact on small and medium volume products.
System on modules (SoMs), such as the Toradex Colibri and Apalis families of SoMs, provide the latest ARM SoCs in an easy-to-use package. These modules include all the common external components that accompany a SoC.
It’s important to note that hardware alone is insufficient to take advantage of a heterogeneous system architecture. Most software is designed to operate on a single CPU architecture, rather than leveraging the multiple hardware architectures of a heterogeneous SoC. To get the most out of a given SoC, the Software must be optimized for a particular system.
Toradex maintains a large software team that, optimizes and maintains the board support packages provided by SoC vendors. This includes optimizing BSPs to use the available hardware features to accelerate embedded computing workloads. The System-on-Module concept allows Toradex to provide optimizations to many different customers, as they all use the same module. This allows you to take advantage of a highly optimized BSP even if you don’t have 100,000 units project.
Toradex also invests in educating customers to make sure they are able to understand and take advantage of these new computing architectures in their own designs and software development. If you are interested in learning how to use a heterogeneous multi-core system combining ARM Cortex-A and Cortex-M cores.
Another application of heterogeneous system architecture that is rapidly gaining popularity is general-purpose computing on graphics processing units (GPGPU), as many modern SoCs come with powerful GPUs. The task of rendering graphics can be easily parallelized, so GPUs contain many relatively simple processing cores. Interestingly, there are many other tasks which can be parallelized using these processing cores. NVIDIA’s CUDA and the open standard OpenCL are the most common frameworks for programming GPUs for general-purpose tasks.
One technology which can benefit greatly from GPGPU-accelerated computing is neural networks. Deep neural networks are currently a topic of very high interest within AI and machine learning. The technology had already some high-profile successes. Large Headlines made DeepMind, whose AI has beaten the best human players in go, an ancient board game originating in East Asia. All current self-driving car technologies involve deep neural networks. This technology also powers voice assistants like Siri, Alexa, Watson, and Cortana, and is found in many other applications. It is expected that this technology will impact many areas of embedded computing, and more generally, many areas of our lives.
NVIDIA’s Tegra line of SoCs with the TK1, TX1, and TX2 all feature large GPUs to accelerate this kind of task. The upcoming NXP i.MX 8QuadMax SoCS features two dedicated GPUs with compute capabilities. As an example, one GPU can be 100% dedicated to running a deep neural network, while the other can be used for user interface tasks and image preprocessing.
ARM has announced the ARM Compute Library, which takes advantage of heterogeneous systems like NEON acceleration and GPUs to improve performance of deep neural network frameworks, such as Google’s Tensor Flow. As expected, there are also several companies going a step further and creating new hardware accelerators specifically for running neural networks. Notable examples include Google’s Tensor Processing Unit (TPU) and NVIDIA’s Deep Learning Accelerator (DLA) which will be open source. Microsoft also just announced a coprocessor for accelerating deep neural networks for the HoloLens.
As you can see, the industry is adapting to new processing workloads. Despite Moore’s law slowing down, innovation in other areas promises a steady stream of exciting new developments.