Sunday, July 25, 2010

About the Processor in your pocket

Do you know who is working inside your mobile ?

It is ARM.

ARM processors are used in nearly every mobile phone and every PDA made.There are lots of them in every car, running systems like airbags, fuel injection and ABS, in fact they are embedded in most of the world’s electronic devices.. Professor Stephen Furber is the principal designer of the ARM 32 bit RISC microprocessor, found in most handheld electronic devices and in more than 98 % of the world’s mobile phones. The development of the fast, energy efficient 32 bit processor 25 years ago unlocked the world of consumer electronics and to date, more than 18 billion ARM-based chips have been manufactured.

The ARM is a 32-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by ARM Holdings. It was known as the Advanced RISC Machine, and before

that as the Acorn RISC Machine. RISC stands for Reduced Instruction Set Computer. If the number of different instructions that a microprocessor can execute is reduced then it can execute them more quickly. This can be a big difference as each instruction takes less clock cycles (mainly just one cycle with the ARM). Also the processor will be a lot simpler, so will be cheaper to make and use less power

The ARM architecture includes the following RISC features:

  • Uniform 16 × 32-bit register file.
  • Load / store architecture.
  • No support for misaligned memory accesses (now supported in ARMv6 cores, with some exceptions related to load/store multiple word instructions).
  • Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. Later, "Thumb mode" increased code density.
  • Mostly single-cycle execution.

To compensate for the simpler design, compared with contemporary processors like the Intel 80286 and Motorola 68020, some additional design features were used:

  • Conditional execution of most instructions, reducing branch overhead and compensating for the lack of a branch predictor.
  • Arithmetic instructions alter condition codes only when desired.
  • 32-bit barrel shifter which can be used without performance penalty with most arithmetic instructions and address calculations.
  • Powerful indexed addressing modes.
  • A link register for fast leaf function calls.
  • Simple, but fast, 2-priority-level interrupt subsystem with switched register banks.

Pipelines and other implementation issues

The ARM7 and earlier implementations have a three stage pipeline. The stages are fetch, decode, and execute. Higher performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages

The architecture provides a non-intrusive way of extending the instruction set using "coprocessors" which can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions.

Wednesday, July 14, 2010

Freescale Introduces ARM based MCUs

Freescale has been know for a range of devices in microprocessor category. They recently announced at the Freescale Technology forum on the 22 jun, 2010 that Freescale has adopted the ARM family to offer 32 bit embedded microcontrollers.   This Kinetis family of 90 nanometer (nm) 32-bit MCUs are going to be base3d on the Cortex M4 processor.." Kinetis represents one of the most scalable portfolios of low power, mixed signal ARM Cortex-M4 processor-based MCUs in the industry" says the announcement.

The sampling is set for 3 to 6 months and volume productions would probably be about this time next year.

Monday, July 12, 2010

TI Da Vinci series Video Processor

Video is everywhere! There is a video camera in everybody's mobile phone! Well, almost. security devices like video doorbell, remote monitoring use video. So, there's a need for processing video at most unlikely places. You also need to be able to do it with minimal real estate on the device circuit board. With fast changes happening on the video technology, H.264 in MPEG4 becoming so common, you need devices capable of processing that. With HD and the push to get 1080p video everywhere, processing loads are enormous.

Texas Instrument has developed a family of devices that are pushing the frontier in video processing with SoC, systems on chip in a family of devices in their signal processing chips devices. TMS320 DM xxx are the typical part numbers in the family. DM standing for digital media applications.

One such recent device is the TMS 320DM365. This page has a download link for the data sheet. Take a look to familarize yourself with what a device like these may contain. The 365 device has 3 variations the 216, 270 and the 300 MHz devices.

Beside the CPU, which is a ARM9 RISC processor, these devices have video processing hardware support and some accelerators for typical functions. Such functionality on these devices include video resizer, on screen display (OSD), previewer and hardware statistics collection module. Tow co-processors off load a lot of video processing from the main CPU. They make the video codecs available to your application including the H.264.  Interface for analog cameras ease applications using analog video out cameras. The block diagram in two parts below will give you a quick look into the kind of resources available on the SoC.










Saturday, July 10, 2010

First Textbook On Programming Massively Parallel Processors

Students of advanced architecture need to read this book released in Jan this year. The massively parallel processor architecture that this book talks about is implemented in silicon by the CUDA architecture of the NVIDIA.  This book, the "Programming Massively Parallel Processors: A Hands-on Approach " launched on Jan 28, 2010. This is written by Dr. David B. Kirk, NVIDIA Fellow and former chief scientist, and Dr. Wen-mei Hwu, who serves at the University of Illinois at Urbana-Champaign as Chair of Electrical and Computer Engineering in the Coordinated Science Laboratory, co-director of the Universal Parallel Computing Research Center and principal investigator of the CUDA Center of Excellence. A joint effort by the academia and the industry.


Visit the Microsite for more details. the site also has commentary by authors about the book. The book is available to purchase directly from Elsevier or Amazon.

Thursday, July 8, 2010

CUDA, The Computing Unified Device Architecture

We have discussed the CPU vs. GPU in the last post here. What came out was that the NVIDIA'a CUDA architecture is gaining quite a reputation. So what is it! The process flow is as below, courtesy the Wikipedia.

Read about the architecture in detail here. The purpose of this post is to point out that when identical operations are to be carried out on hundreds, thousands of execution units, this works very well. remember, we pointed out that this is great for SIMD kind of computation. The memory to GPU ensures that data is supplied to all the cores/graphics processors/execution units. The control CPU supplies the instruction that apply to all of the cores., Results are returned, to the main memory after computation is done, through the GPU memory to main memory.

Saturday, July 3, 2010

CPU vs. GPU

There has been some debate in the industry about the right way to go about increasing computer performance significantly. The debate is about whether multi-core processors are going to deliver that or a handful of GPU or graphics processing units working with general purpose processors is going to do that, With multi-core CPUs the overlaps and hence throughput improvements happen at applications, may be task levels. Multi threading can improve the granularity of the overlaps further.


The recent controversy was started by Intel, the promoter of multi-core CPU shot off the first salvo with this paper; Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Which essentially said that the claims made by companies like NVIDIA of performance increase in the hundreds are not quite correct.According to this paper the performance increase over i7 960 by NVIDIA GeForce GTX 280 (a older generation device, by the way) was only x14 rather than x100 or more. 


This is contested in the blog by Andy Keane of NVIDIA that the company has delivered improvements in the x100 and more range and provides a list of companies who achieved that. Both parties agree that the speed up depends on the application being run.


At a basic level, the GPUs are SIMD devices where the overlap, as discussed in the introduction of this post, can be at the data level. Which provides the opportunity for large increase in throughput. Whereas the multi-core fits the MIMD model and thus speed up will depend on how the applications can be executed in parallel. We are not comaparing apples to apples here!