Parallel Ultra Low Power Embedded System · de encorajar o desenvolvimento de novos aceleradores pela comunidade open-source. Para testar a viabilidade desta abordagem, dois tipos

Parallel Ultra Low Power Embedded System

João Pedro Alves Vieira

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor(s): Prof. Aleksandar IlicProf. Leonel Augusto Pires Seabra de Sousa

Examination Committee

Chairperson: Prof. Gonçalo Nuno Gomes TavaresSupervisor: Prof. Leonel Augusto Pires Seabra de Sousa

Member of the Committee: Prof. Paulo Ferreira Godinho Flores

December 2017

ii

Acknowledgments

First of all, a special thank you goes to my family and closest friends, who supported me alongside this

journey and when it got though.

I would like to thank Professor James C. Hoe and Professor Peter Milder from Carnegie Mellon Univer-

sity, who were restless, helping on the debug of a main issue found.

I would also like to thank my supervisors, for the guidance and insights.

iii

iv

Resumo

O futuro do mercado de dispositivos electronico portateis sera construıdo em torno da Internet das

Coisas, onde objectos do dia-a-dia estarao ligados a internet e possivelmente controlados por outros

dispositivos. Estes tem comecado a aparecer nas nossas atividades diarias e e esperado que ten-

ham um grande crescimento num futuro proximo, como por exemplo monitores do estado de saude,

lampadas, termostatos, pulseiras desportivas, etc. A maior parte destes dispositivos sem fios com

sensores, dependem de baterias. Nos quais e essencial ter um modo de funcionamento energetica-

mente eficiente, atraves do desenvolvimento de dispositivos com arquitecturas capazes de responder

as necessidades de baixo consumo e desempenho em tempo real. Esta Tese tem como objetivo mel-

horar a eficiencia energetica de um processador de baixo consumo, nomeadamente o PULPino. Para o

alcancar, foram adicionados de forma modular aceleradores de hardware ao mesmo. Tendo o objetivo

de encorajar o desenvolvimento de novos aceleradores pela comunidade open-source. Para testar a

viabilidade desta abordagem, dois tipos diferentes de aceleradores foram individualmente adicionados.

Um primeiro acelerador criptografico SHA-3, que implementa um algoritmo de hash, podendo melhorar

a seguranca nos dispositivos IoT. Em segundo, um acelerador FFT, muito utilizado em aplicacoes de

processamento digital de sinal. Ambos os aceleradores foram testados no PULPino, relativamente as

suas capacidades de aceleracao e melhoria de eficiencia energetica. Conseguindo atingir poupancas

de energia ate 99% e 66%, aceleracoes de 185 e 3 vezes no SHA-3 e FFT respectivamente. Em relacao

a uma versao sem acelerador dos algoritmos executados no PULPino com um core RI5CY.

Palavras-chave: Internet das Coisas, Consumo de Potencia, Sistema Embebido, Eficiencia

Energetica.

v

vi

Abstract

The future of portable electronics’ market will be built around Internet of Things (IoT), where everyday

objects will be connected to the internet and possibly controlled by other devices. In fact, examples of

these devices have already started to take part on our daily activities and are expected to experience

a tremendous growth in a near future, such as health monitors, light bulbs, thermostats, fitness wrist-

bands, etc. Most of these devices rely on battery-powered wireless transceivers combined with sensors,

where it is essential to sustain energy-efficient execution by developing devices’ architectures capable

of delivering both low power and real-time computing performance. Within the scope of IoT applications,

this Thesis aims to boost the energy-efficiency of a state-of-the-art ultra-low-power processor, namely

PULPino. This challenge was tackled by modularly attaching hardware accelerators to it. They connect

to PULPino through a low-power and plug-n-play custom AXI-lite interface. It has the objective of encour-

aging the development of new accelerators by the growing PULPino’s open-source community. To test

the viability of this approach, two kinds of accelerators were individually attached. A first cryptographic

SHA-3 accelerator, implementing a commonly used hash algorithm, that could improve IoT applications’

security. And second, an FFT accelerator, having a widely used algorithm in Digital Signal Processing

(DSP) applications. Both accelerators were tested on PULPino, for their speedup and energy-efficiency

capabilities. Achieving savings up to 99% and 66% of energy, speedups of 185 and 3 times on SHA-3

and FFT respectively. In comparison to a non-hardware accelerated version of the algorithms executed

on PULPino RI5CY core configuration.

Keywords: Internet of Things, Ultra-low-power, Embedded System, Energy-Efficiency.

vii

viii

Contents

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Main Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 State-of-the-Art: PULP - Parallel Ultra Low Power Platform . . . . . . . . . . . . . . . . . 4

2.2 PULPino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Additional PULPino’s Core Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Interconnect Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Cache Coherent Interconnect for Accelerators (CCIX) . . . . . . . . . . . . . . . . 14

2.4.2 GEN-Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.3 Open Coherent Accelerator Processor Interface (OpenCAPI) . . . . . . . . . . . . 16

2.4.4 Standards Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Hardware Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Hardware/Software Co-design 22

3.1 AXI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 AXI Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Overall System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Hardware interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Software Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Hardware Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

4 Implementation and Experimental Work 38

4.1 Target Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 New AXI Interconnect Slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 New Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Experimental Results 46

5.1 Software vs Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1.1 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusions and Future Work 59

References 61

A Software-only Algorithms 67

A.1 SHA-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

x

List of Figures

2.1 PULP cluster with 4 cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Comparison between RI5CY and ARM’s Cortex-M4 . . . . . . . . . . . . . . . . . . . . . 7

2.3 RISC-V pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 LSU Software vs Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Shuffle instruction diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Area breakdown of three core configurations . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Energy consumption comparison between three core configurations . . . . . . . . . . . . 13

2.8 Use cases of CCIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.9 Comparison between typical CPU-memory interface and Gen-Z Media Controller . . . . . 15

2.10 Gen-Z arquitecture aggregating different type of media devices . . . . . . . . . . . . . . . 15

2.11 Comparison of CCIX, Gen-Z and OpenCAPI main features . . . . . . . . . . . . . . . . . 16

2.12 Comparison between SPIRAL generated design and Xilinx LogiCore FFT v4.1. . . . . . . 20

3.1 PULPino’s SoC block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 PULPino’s memory map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 AXI4 node overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 PULPino with attached accelerators block diagram . . . . . . . . . . . . . . . . . . . . . . 27

3.5 SHA-3 kernel overview architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 SHA-3 padding module’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 SHA-3 permutation module’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 SHA-3 accelerator data path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 SPIRAL Fast Fourier Transform (FFT) iterative architecture . . . . . . . . . . . . . . . . . 35

3.10 SPIRAL Fast Fourier Transform (FFT) fully streaming architecture . . . . . . . . . . . . . . 35

3.11 FFT accelerator’s data path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Xilinx Zynq-7000 SoC block diagram overview . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Implementation block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 SHA-3 computation speedup using hardware accelerator . . . . . . . . . . . . . . . . . . 48

5.2 FFT computation speedup using hardware accelerator . . . . . . . . . . . . . . . . . . . . 51

5.3 SHA-3 computation power versus energy ratio . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 SHA-3 accelerator energy saved on multiple frequencies . . . . . . . . . . . . . . . . . . . 53

xi

5.5 FFT accelerator, dynamic and static on-chip power consumption . . . . . . . . . . . . . . 55

5.6 FFT accelerator, static on-chip power consumption . . . . . . . . . . . . . . . . . . . . . . 56

5.7 FFT accelerator, energy saved vs computation time . . . . . . . . . . . . . . . . . . . . . 57

5.8 FFT accelerator, computation energy vs energy ratio(SW/HW) . . . . . . . . . . . . . . . 57

xii

Acronyms

AMBA Advanced Microcontroller Bus Architecture.

APB Advanced Peripheral Bus.

AXI Advanced eXtensible Interface.

CCIX Cache Coherent Interconnect for Accelerators.

CNN Convolutional Neural Networks.

DCT Discrete Cosine Transforms.

DMA Direct Memory Access.

DSP Digital Signal Processing.

DVFS dynamic voltage and frequency scaling.

FFT Fast Fourier Transform.

FIR Finite Impulse Response.

FPU Floating Point Unit.

FSBL First-Stage Boot Loader.

I2C Inter-Integrated Circuit.

IoT Internet of Things.

IPC Instructions per Cycle.

LANs Local Area Networks.

LNU Logarithmic Number Systems.

LSU load-store-unit.

LUTs Look-Up Tables.

xiii

MCU Microcontroller.

NoCs Network-on-chip.

OpenCAPI Open Coherent Accelerator Processor Interface.

PL Programmable Logic.

PS Processing System.

PULP Parallel Ultra Low Power Processor.

PWM Pulse Width Modulation.

SAIF Switching Activity Interchange Format.

SANs System/Storage Area Network.

SHA Secure Hash Algorithm.

SoC System on Chip.

SPI Serial Peripheral Interface.

ULP Ultra-Low-Power.

WANs Wide Area Networks.

xiv

Chapter 1

Introduction

In order to satisfy the growing demands of current consumer electronics market, it is estimated there will

be around 50 Billion of Internet connected devices (”things”) by 2020. Every year, the Internet of Things

(IoT) semiconductor device revenues increase for more than 30%, in comparison to the total semicon-

ductor revenue growth rate of about 5.5% [1]. A large part of IoT lies on battery-powered wireless

transceivers combined with sensors, usually designated by motes (short for remote), enabling an inter-

action with the environment. Such devices demand ultra-low-power circuits and are usually controlled by

a Microcontroller (MCU) responsible for sensor interaction (i.e, for gathering the data) and light-weight

processing. In IoT topology, besides the basic motes that in their most primitive form gather data, there

are also the end nodes that can be configured as gateways for the motes, which allow gathering data

from the motes and eventually pre-processing it before sending to the cloud. As a consequence, they

allow to reduce the amount of data to be communicated. One solution for end nodes is presented in

this Thesis, i.e. PULP in Section 2.1, which is able to satisfy both computing and energy-efficiency re-

quirements of IoT applications, by taking advantage of parallelism. Based on PULP solution, its further

simplification into a more basic unit (PULPino) that fits the motes’ requirements is the main goal of this

Thesis, in order to provide an adequate substitute for the MCU, having the same kind of functionalities

(PWM, timers, SPI, I2C, etc).

IoT nodes address a wide range of applications, which might be optimized in different ways to further

achieve enhanced energy-efficiency and performance. One approach that fits such goal, is to attach

hardware accelerators to the IoT node. In this Thesis a set of well known types of applications in which

IoT node might benefits from were carefully chosen, from Digital Signal Processing (DSP) and cryptogra-

phy research areas. DSP applications include a wide variety of kernels which are recursively executed,

and might require considerable computational power. Cryptography on IoT systems is a recently hot

topic. Due to the reduced computational capabilities and low-power characteristics of such systems,

having the required security might be challenging. In order to have higher levels of security in such

embedded systems, not only a reduction on cryptography algorithms computation is needed, but also,

to reduce the energy consumption associated with it.

1

To tackle such challenges and take a step further to achieve these objectives of boosting the energy-

efficiency of an ULP system targeting IoT nodes. Were integrated open source kernels into an accelera-

tor, one for each application (DSP [2] and cryptography [3]), based on a modular approach. Developing

an interface between the accelerators and processor. Thus, all the accelerators will attach and follow the

same communication protocol with the processor. This approach encourages the open-source commu-

nity to develop and share their own custom accelerator IPs. Continuing a path to open-source hardware,

as the release of further presented PULPino was intended for.

1.1 Motivation

In general, IoT nodes can be characterized by a combination of different features and requirements, such

as power constraints, size, communication, processing capabilities and availability. Most use cases of

IoT nodes’ applications require an Ultra-Low-Power (ULP) operation and high energy efficiency [4].

A multi-core energy-efficient platform appears as a good solution to address this issue, having the pur-

pose of satisfying the computational requirements of recent IoT applications, which demand a flexible

processing of data that originate from several sensors like heartbeat, ECG, accelerometers, cameras

and microphones [4]. The presented multi-core platform (PULP) was released as open-source hard-

ware, by making available only one of its single cores, enabling anyone to develop on top of it. Due to it’s

intrinsic low-power profile, the released cores are suited for IoT applications. Matching the requirements

of restricted energy and hardware flexibility on IoT nodes.

1.2 Main Objectives

The main objective of this thesis is to investigate and further exploit the capabilities of PULPino [5], an

ultra low power core, capable of working within power envelopes of a few mW, in the emergent area of

IoT. The final goal of this Thesis is on boosting the energy-efficiency of PULPino-based systems, for

satisfying IoT constraints and requirements.

The existing approaches in the literature are mainly focused on exploiting the capabilities of the PULP

platform [6] for different purposes, such as computer vision applications [7–9], extensions and hardware

acceleration [10–13] and heterogeneous programmable accelerator [14]. However, there are no existing

scientific works that focus on PULPino (one core derivation of PULP), specially for IoT purposes. The

work proposed at this Thesis aims at closing this gap.

This Thesis aims is different from the state-of-the-art, by developing modular and easily attachable hard-

ware accelerators, to further boost the energy-efficiency and performance of PULPino-based systems.

The accelerators development is based on the integration of a open-source kernels for the target appli-

cation. For the kernels to communicate with the processor, was developed and defined an interface as

its communication protocol. Such implementation, requires the development of extra hardware, aiming

at wrapping the kernel into one top level module, known as accelerator. The same way PULPino en-

courages open source hardware, this modular approach of PULPino accelerators, stimulates the open-

2

source community to further enhance and develop accelerators for different kind of applications. Each

user might test and deploy the accelerator that best fits its application. This might sparkle the start of an

open source library of accelerators for a wide variety of applications on the IoT domain.

1.3 Main Contribution of this Thesis

The main contribution of this Thesis is the development of an accelerator-processor interface, targeting

low-power embedded devices. The interface is based on the AXI-lite protocol, which is compatible with

AXI-based buses, widely used on embedded hardware designs. The interface provides an easy and

plug-n-play manner of deploying the accelerator into the processor’s AXI bus. It does not require any

further configurations, besides the connection of the standard AXI signals. Enabling faster and less

painful custom hardware design and development, targeting PULPino-based systems.

1.4 Outline

This Thesis is structured as follows. Chapter 2 presents the state-of-the-art approaches together with

detailed description of the PULPino core’s architecture, i.e., the fundamental topics for the scope of

this Thesis. Chapter 3 provides details about the hardware/software co-design and overall system ar-

chitecture used to attach and integrate the hardware accelerators. Chapter 4 provides details about

the experimental work, system’s configuration and how to integrate and deploy an accelerator in the

targeted platform. Chapter 5 presents an analysis of the results obtained from the experimental work.

Finally Chapter 6 draws the final conclusions and discuss future work.

3

Chapter 2

Background

In this chapter will be presented the state-of-the-art that features relevant scientific works and main char-

acteristics of PULP, a novel cluster platform intended to be released as open source hardware in 2018.

Since PULPino represents a small part of PULP (one core), which has been already released and is

the main focus of this Thesis, its most relevant features will be subsequently described. Additionally the

core configurations recently released in August 2017 are also featured. Having as Thesis goal the de-

ployment of hardware accelerators on PULPino’s platform, which interfaces through an interconnect bus,

state-of-the-art interconnect networks specially designed for such applications are addressed. Hardware

acceleration is the last section approached in this chapter, featuring an overview of the state-of-the-art

implementations of hardware accelerators related with the kind of applications targeted in the scope of

this Thesis.

2.1 State-of-the-Art: PULP - Parallel Ultra Low Power Platform

PULP is a joint project between Integrated Systems Laboratory (ISS) of ETH Zurich and the Energy-

efficient Embedded Systems (EEES) group of UNIBO. This project aims to develop an open source

scalable hardware and software platform with the objective to break the pJ/operation barrier within

power envelopes of a few mW [15]. It supports OpenMP, OpenCL and OpenVX as presented in [6],

thus enabling an easier development of parallel algorithms, and it overcomes the power constraints of

battery-powered applications which are restricted to a power envelop of a few mW. PULP’s architecture

is tuned for efficient near-threshold operation, being optimized for 28nm UTBB FD-SOI technology pro-

viding extended range of supply voltage, body bias and improved electrostatic control [6].

A PULP cluster embeds a configurable number of RISC-V based cores with a shared instruction cache

and scratchpad memory. Figure 2.1 illustrates a PULP cluster with 4 cores [5]. It has been already taped

out with OpenRISC cores and RISC-V based cores and achieves energy efficiency of 193MOps/mW in

a 28nm FDSOI technology [16–18].

It is also possible to scale the system, to adjust to the computing demands and power consumption

according to the applications by clock-gating the SRAM blocks and the individual cores, which amount

4

Figure 2.1: PULP cluster with 4 cores and 8 TCDM-banks of 8kB SCM and 8kB SRAM each and ashared instruction cache of 4kB [5].

can also be reduced down to a single core. The scaling is controlled by the power manager unit of the

cluster and it is tightly coupled with a dynamic voltage and frequency scaling (DVFS). As a result, the

performance of a 28nm FDSOI implementation can be adjusted up to 2GOPS by scaling the voltage

from 0.32V to 1.15V with the cores operating at 500MHz.

The PULP cluster is perfectly suited for IoT endpoint devices due to its efficiency and low power con-

sumption while still keeping high computational power [5]. Since only the individual cores are available

at the moment with open hardware license and its architecture is already optimized for ULP, they can be

intended for IoT remote nodes that do not require as much computational power as an endpoint device.

Although PULP represents a relatively recent platform (introduced in 2014), it has been subject of several

scientific works targeting different application areas and architecture extensions, as referenced further

ahead. In the following text, some of the most relevant state-of-the-art works are described.

Computer Vision Applications

PULP has been used in a set of applications regarding energy-efficient computer vision by taking advan-

tage of its parallel computing and support for OpenMP [19]. In [7], as a use case for PULP, it is shown

that a computationally demanding vision kernel based on Convolutional Neural Networks (CNN) can be

quickly and efficiently switched from a low power, low frame-rate operating point to a high frame-rate

one when a detection is performed. Therefore, PULP performance was scaled from 1x to 354x, thus

reaching peak performance/power efficiency of 211 GOPS/W.

Similar approach to the one from [7], was used in [8], when considering a different use case that ad-

dresses a motion estimation algorithm for smart surveillance. A CNN-based algorithm was implemented

for video surveillance, in which it is possible to scale from a low-power with low frame-rate state up to a

high-performance state. Based on this, a sample benchmark was developed intended for applications in

the nano-UAV field, where PULP was used to accelerate estimation of optical flow from frames produced

by an ULP imager, with the objective of autonomous hovering and navigation, achieving results of 14µJ

5

per frame at 60fps. Future work aims to improve PULP with the purpose of being competitive with HW

accelerators while the possibility of being programed with general purpose software is maintained.

Further advances were made regarding smart camera sensors, targeting ultra-low-power vision applica-

tions using PULP, and its usage within a case-study of moving objects detection, as presented in [9]. By

using PULP it was developed a 10.6µW low-resolution contrast based imager featuring internal analog

pre-processing. This local processing allows reduction of the total amount of digital data to be sent out

of the node by 91%. This is done by having a context aware analog circuit as the imager, which only dis-

patches meaningful post-processed data to the processing unit, thus reducing the sensor-to-processor

bandwidth by 31x with respect to transmitting a full pixel frame.

Extensions and HW Acceleration

Hardware extensions were also explored in order to bring new energy-efficient solutions, e.g., in the

case of a shared Logarithmic Number Systems (LNU) unit implemented in [10], which was as an energy-

efficient alternative to a conventional Floating Point Unit (FPU). This LNU, optimized for ultra-low-power

operations on PULP multi-core system, is efficiently shared by all the cores. For typical nonlinear pro-

cessing tasks, this design can be up to 4.2x more energy-efficient than a private-FPU design. This work

was continued in [11], where a novel transformation for hardware efficient LNU implementation was con-

sidered. An area reduction of up to 35% was achieved while supporting additional functionality. Being

implemented in a 65nm technology, it was demonstrated that the novel LNU can be up to 4.35x faster

than a standard FPU.

In [12] work was done towards the development of Hardware Convolution Engines (HWCEs), i.e., ultra-

low energy co-processors for accelerating convolutions. First implementations concluded that augment-

ing the PULP cluster with HWCEs could lead to an average boost of 40x or more in energy efficiency

in convolutional workloads. Moreover, to take advantage of these improvements, the previous refered

implementation, was applied to computer vision in [13]. The ability of CNN-based classifiers to ”com-

press” low information density data, such as images into highly informative classification tags, makes

them suitable to be used in IoT scenarios. It was proposed a 65nm system-on-chip implementing a

hybrid HW/SW CNN accelerator that meets energy requirements for IoT targets.

Heterogeneous Programmable Accelerator

PULP was used also as an accelerator on heterogeneous systems for speeding-up computation-intensive

algorithms. In [14], a heterogeneous architecture was developed by coupling a Cortex-M series MCU

with PULP, supporting offload of parallel computational kernels from the MCU to PULP by taking advan-

tage of the OpenMP programming model, supported by PULP.

On the IoT scope, in [20] was proposed Fulmine, a SoC based on a tightly-coupled multi-core clus-

6

ter for near-sensor data analysis, as a promising path to Internet of Things (IoT) endpoints. It minimizes

the energy spent on communication as well as the network load, and at the same time concerns about

security by making available encryption hardware supported functions. Also supporting software pro-

grammability for regular computing tasks.

On the context of computer vision but also the usage of PULP as an Heterogeneous Programmable

Accelerator, is presented in [21]. A novel implementation of an ultra-low-power systems base on PULP

together with a TI MSP430 microcontroller. It proposes a solution for the market of wearable devices,

which cannot have continuous data monitoring due to very short battery life. It could bring new func-

tionalities, such as sports performance enhancement, elderly monitoring, disease management or other

several application in sports, fitness, gaming or even entertainment. The PULP enables context classifi-

cation supported in convolutional neural networks, achieving very low power functioning with 2.2mJ per

classification while achieving speedups up to 500x with respect to TI MSP430 operating on the same

power restrictions.

2.2 PULPino

PULPino is a small single core system based on PULP, as previously mentioned in Section 2.1. As

such, PULPino represents a first step towards the release of full PULP as an open-source multi-core

platform. Being part of PULP, PULPino inherits its IPs and cores, focusing on ease of use and simplicity.

Its open-source release was done in February 2016 under the solderpad hardware license, including

complete RTL sources, all IPs, RI5CY core based on RISC-V, environment for RTL simulation and the

complete FPGA build flow. In January 2016, the first ASIC, called Imperio, has been taped out with

PULPino on it.

The main characteristics of the PULPino core are presented in Figure 2.2. It offers reduced power

Figure 2.2: Comparison between RI5CY and ARM’s Cortex-M4[22].

consumption and area for the same manufacturing technology and conditions when compared with

ARM’s Cortex-M4. It also features an IPC close to one, full support for the base integer instruc-

tion set (RV32I), compressed instructions (RV32C) and partial support for the multiplication instruction

set extension (RV32M). Non-standard extensions have been implemented featuring hardware loops,

post-incrementing load and store instructions, ALU and MAC operations. Dot-product and sum-of-dot-

products instructions on 8-bit and 16-bit data types allow to perform up to 4 multiplications and accu-

mulations in one cycle, consuming the same power as a 32-bit MAC operation. Also, support for real

7

time operating systems such as FreeRTOS was added. A low power mode is available, in which only a

simple event unit is active, being able to wake up in case of an event or interrupt arrival. Once in low

power mode, only the event unit is active and all other components are clock gated, consuming minimal

power.

The core with the extended ISA is on average 37% faster for general purpose applications. It can also

achieve average speedups on convolutions of up to 3.9x [5]. With extensions, the core is only 15x less

energy-efficient than the state-of-the-art hardware accelerator, but has the advantage of being a general

purpose architecture that can be used for a wide range of applications [5, 15].

Pipeline Architecture

Since the target is ULP operation, some considerations in the pipeline should be taken into account.

The number of pipeline stages is one key aspect of the design, since a high number of stages increase

the overall throughput and allow higher frequencies, but increases also the latency. A large number

of stages might increase the tendency of data and control hazards. In this case, some high-end opti-

mizations (such as speculation and branch prediction) might not be the best approaches to overcome

this issue, since they provoke an increase in the overall power consumption. The organization of the

Figure 2.3: Simplified block diagram of the RISC-V four stage pipeline[5].

pipeline is illustrated in Figure 2.3, which consist in four stages: instruction fetch (IF), instruction decode

(ID), execute (EX) and write-back (WB). The ALU is extended with fixed-point arithmetic and enhanced

multipliers that support dotp operations (while still keeping the same timing), since the critical path is

mainly determined by the memory interface.

The cluster can achieve frequencies of 350-400MHz under typical conditions with 65nm implementation,

being able to reach higher frequencies than the commercially available MCUs that usually operate in the

range of 200MHz [5].

8

Instruction Fetch Unit

RISC-V standard supports compressed instructions which are 16-bit. Since instruction cache has both

standard and compressed instructions stored in it, its possible to get misaligned instructions when an

odd number of 16-bit instruction are followed by a 32-bit instruction, thus requiring an additional fetch

cycle and the processor operation would have to be stalled. Due to the pre-fetch-buffer, shown in Figure

2.3 block diagram, it is possible to fetch 128-bit (complete cache line) instead of a single instruction

that reduces the accesses to shared instruction cache, once it is possible to fetch 4 to 8 instructions in

one access. The misaligned instructions problem is solved by having an additional register that stores

the last instruction. This register contains the lower 16-bit part of the 32-bit instruction that can be

combined with the higher part and forwarded to the Instruction Decode stage. This prevents stalls when

a misaligned instruction occurs, except for jumps, branch or hardware loops [5].

Hardware Loops

Hardware loops or zero-overhead loops allow to execute a piece of code multiple times, thus eliminating

the overheads of branches or updating a counter, being a very common feature in DSP. The hardware

loop controller can be configured through programming, by defining the start address (pointing to the

first instruction of a loop), end address (pointing to the last instruction to be executed in the loop), and

by setting the counter value that is decremented every time the loop is completed. These configuration

registers are mapped at the Control/Status Register(CSR) block illustrated in Figure 2.3. This config-

uration can be done during interrupts or exceptions through a set of dedicated instructions, which are

automatically inserted by the modified GCC compiler [5, 23]. Moreover improvements were made re-

garding a loop buffer that takes place as a cache holding the loop instructions, thus eliminating the fetch

delays [24].

Load-store Unit

The load-store-unit (LSU) is responsible for accessing the data memory, and it can load and store 32b,

16b and 8b words. A post-increment addressing mode feature was added to this unit, which performs

a load/store instruction, while it simultaneously increases the base address by the specified offset. This

feature leads up to a 20% speed up when memory access patterns are regular, normally found in loops

(e.g. matrix multiplication). The address increment is embedded in the memory access instructions,

thus eliminating the need to use separate instructions to handle pointers.

Support for unaligned data memory accesses, which can frequently happen, is also provided. This

support detects when an unaligned access occurs and stores the high word data in a temporary register,

which is combined with the lower word when the second access is issued. This feature has advantages

on the code size, shown in Figure 2.4, and the number of required cycles to access unaligned data.

9

(a) (b)

Figure 2.4: a) Support for unaligned access in software (5 instructions/cycles) and b) with hardwaresupport (1 instruction, 2 cycles)[5].

Packed SIMD Support

Support for subword parallelism is provided in PULPino, which consists in packing multiple words into a

word in order to process the whole word. This allows data-level parallelism being exploited at small-scale

SIMD processing, since the same instruction is applied to all the sub-words within the whole word [25].

Once this is a 32-bit processor, it can compute up to four bytes in parallel. This is advantageous for IoT

applications because data acquired from sensors is frequently 8-bit and 16-bit. To accomplish this, the

ALU integrates a vectorized datapath segmented into two or four paths, where vectorial operations like

addition and subtraction are computed in four sub-operations.

A shuffle instruction was implemented that is able to generate an output being any combination of sub-

words of the two input operands, as illustrated in Figure 2.5, where the third operand sets selection

criteria [5].

Figure 2.5: Shuffle instruction example diagram. Allows the combination of bytes from r1 and r2 throughmask encoding [5].

Fixed-Point Support

Fixed-point operations can be thought as a low cost floating-point alternative. Considering that many

applications do not require floating-point accuracy, a simpler fixed-point arithmetic operation can be

used, saving power and area [5].

Fixed-point numbers are usually given in the Q-Format, which means that a number is represented in

format Qm.n, where m the number of integer bits and n for the fractional bits. This core extensions were

10

designed to support any Q-format only limited by m+ n < 32.

Conversions from one fixed-point representation to another require the number to be normalized by

shifting the bits, but before elimination of extra bits, it might be desired to perform a rounding operation

to improve the accuracy. Hence, an instruction to add-round-normalize is provided which can save code

size and execution time, adding 2 units for rounding while shifting the number by the immediate value.

Also a clip instruction is available to check if the number is between two values and saturate the result

to a upper or lower bound if it is out of range [5].

Iterative Divider and Multiplication

Long integer division algorithm was chosen due to reuse of existing comparators, shifters and adders

of the ALU. This divider has a variable latency from 2 to 32 cycles, depending on the input operands.

Despite being slower than a dedicated divider, it has low area overhead [5].

The multiplier block has three modules: a 32x32 multiplier, fractional multiplier, and two dot-product

multipliers. It has the capability of multiplying two vectors and accumulate the result in one cycle. The

dot-product multiplier allows to perform up to four multiplications and three additions in one operation as

follows:

d = a[0] · b[0] + a[1] · b[1] + a[2] · b[2] + a[3] · b[3],

where a and b are 32-bit registers, a[i] and b[i] correspond to individual bytes of this register, and d is a

32b result. This multiplier also offers fixed-point support, which may require rounding and normalization

after the operation, both accomplished in a single instruction. It is important to note that, for these

operations, rounding and then normalizing reduces the overall error [5].

2.3 Additional PULPino’s Core Configurations

After the initial release of PULPino on February 2016, which is based on the previously presented RI5CY

architecture [5]. Were additionally released new cores with different configurations, in August 2017:

• RI5CY + FPU: the same previously presented RI5CY core enhanced with an single precision

Floating Point Unit (FPU), compliant with IEEE-754 standard for floating-point arithmetic [26].

• Zero-riscy: An area-efficient 2-level pipelined core, implementing RISC-V RV32-ICM instruction

set as RI5CY core [26].

• Micro-riscy: An even smaller core than the previous ones, implementing RV32-EC instruction set.

It features only 16 general purpose registers and no hardware multiplication support [26].

The new core alternatives bring an improved area-efficiency management from the hardware designer’s

point-of-view, allowing him to choose which core better fits the target application, moving towards more

efficient and area-effective embedded systems designs. Fig. 2.6 shows an area breakdown among the

three different core architectures. As it can be noticed, the RI5CY core has the biggest area footprint

11

Figure 2.6: Area breakdown of three core configurations. Results from an ASIC synthesis run [26].

which might seem a drawback. Although, it presents it-self as the best choice for DSP applications,

featuring a set of hardware extensions and optimizations for the purpose, as presented before in Section

2.2.

Zero-riscy has its area reduced in more than half regarding the full-featured RI5CY core. Designed to

be small and efficient, it operates with a 2-level pipelined area-efficient RISC-V based core. Wherein is

possible to configure it to full support four ISA configurations: RV32I base integer instruction set, RV32E

base integer instruction set, RV32C standard extension for compressed instructions and finally RV32M

integer multiplication and division instruction extension. Moreover, to reduce the area foot-print some

enhanced instructions supported by RI5CY and RI5CY + FPU were removed, namely hardware loops,

post-increment load/store, fixed point, bit manipulation and packed-SIMD. The addition of the FPU unit,

although not shown at the provided graph under analysis, implies an extra area of 28.6kGE (kilo Gate

Equivalent), 1.7x area increase, when added to RI5CY core. No more details about area or energy

consumption are available for the core RI5CY + FPU.

Micro-riscy, the smallest core, is 3.5x smaller than RI5CY. Even more optimized than the previous ones,

it targets mainly ULP control applications in which area and power consumption requirements prevail

against performance. As depicted in Fig. 2.7, micro-riscy has the lowest energy consumption when

executing Runtime (a control intensive application with very few ALU operations). In the same figure, all

the cores run at a low frequency of 100kHz, implemented in a UMC65 technology at 1.2V and 25oC. As it

can be seen, depending on the application benchmark some cores perform better, due to its architecture

being optimized to such applications. For 2D convolution, RI5CY has the lowest energy consumption

due to its enhanced instructions to improve performance over DSP applications [26].

12

Figure 2.7: Energy consumption comparison between three core configurations [26].

2.4 Interconnect Networks

Computer based systems consist in individual components connected together, communicating with

each other. This logical thinking, does not only relate to internal computer components, but may also be

applied to computers them selfs. Therefore, having networks of connected computers. These networks

rely on communication standards to establish rules on how the data will be converted and transfered

among several components.

There are four different networking domains in which interconnect networks may be classified, upon the

number of connected devices and their proximity:

1. Wide Area Networks (WANs) - Connect a wide range of distributed computer systems around the

globe over large distances.

2. Local Area Networks (LANs) - Usually to connect computer systems across small areas of a few

kilometers. May also be used in machine rooms or through buildings.

3. System/Storage Area Network (SANs) - Used to connect multiple processors or in processor-

memory connections within multiprocessors or multicomputer systems. SANs may also be used

within servers and data center environments. In which is required a connections between storage

and I/O components, that usually have a distance span of a few tens of meters between them.

4. Network-on-chip (NoCs) - Networks used for interconnecting micro-architecture functional units

within chips, e.g. caches, processors, register files or IP cores. Currently this kind of networks

support a few tens to a few hundred of such devices connected among them, within a distance on

the order of centimeters. Nowadays, some proprietary designs are gaining wider use, e.g. Sonic’s

Smart Interconnect, IBM’s CoreConnect or ARM’s AMBA. There are also recent standards, with

the purpose of improving interconnectivity between accelerators as other system components(e.g.

CCIX, Gen-Z and OpenCAPI) [27].

13

The three new standards previously categorized as NoCs, were announced in 2016 being developed

towards the goal of optimizing and easing the connection between accelerators and processors in a

tightly-coupled manner. The driving forces behind these new standards, are related with better exploita-

tion of new and emerging memory/storage technologies (streamline software stacks, direct memory

access, etc) and solutions based on open-standards.

There are three distinct standards, explained up front, mainly because different groups of companies

had been working to solve similar problems and therefore each approach has its differences. In near

future, it is likely to have a convergence of standards. In the next sections all the three standards will be

explained in more detail.

2.4.1 Cache Coherent Interconnect for Accelerators (CCIX)

CCIX was founded with the purpose of enabling a new class of interconnect, based on emerging accel-

eration applications, such as 4G/5G wireless technology, in-memory data base, network processing and

machine learning. It allow processors based on different ISAs to peer processing to multiple accelera-

tion devices such as FPGAs, GPU, custom ASICs, etc. CCIX uses a tightly coupled interface between

processor, accelerators and memory, together with hardware cache coherence across the links. Data

sharing does not require drivers or interrupts. Fig. 2.8 presents some possible system configurations,

Figure 2.8: Use cases of CCIX [28]

using CCIX specification. Some of CCIX main targets are: low-latency main memory expansions; extend

the processor cache coherency to network/storage adapters, accelerators, etc [28].

14

2.4.2 GEN-Z

GEN-Z is defined as a high-performance, low latency, memory-semantic fabric enabling communication

throughout every device in the system. GEN-Z enables the creation of an ecosystem, in which a wide va-

riety of high performance solutions can communicate together. Allowing an unification of communication

paths and simplifying software through load and store memory-semantics.

(a) Typical CPU-Memory interface (b) Decoupling media from the SoC

Figure 2.9: Comparison between typical CPU-memory interface and Gen-Z Media Controller

A typical CPU/Memory interface is shown in Fig. 2.9a in which SoC (having a media controller) connects

to a DRAM interface over a memory bus. In comparison, as depicted in Fig. 2.9b, the media controller is

decoupled from the SoC, being placed at where it makes more sense to be, with the media module. This

important change, made possible through Gen-Z fabric, allows for every compute entity to be agnostic

and disaggregated.

Gen-Z architecture easily aggregates different types of media, devices or resources and allow them to

scale independently from any other resource in the system, as shown in Fig. 2.10. It may also function

as a gateway to other networks e.g. Enet and InfiniBand [29].

Figure 2.10: Gen-Z arquitecture aggregating different type of media devices [29].

15

2.4.3 Open Coherent Accelerator Processor Interface (OpenCAPI)

OpenCAPI is an Open Interface Protocol that allows any processor to attach to coherent user-level ac-

celerators, I/O devices and advanced memories (accessible via read/write or DMA semantics). The

semantics used to communicate with the multiple components are agnostics to processor architecture.

The main key attributes of OpenCAPI are high-bandwidth, low latency and being based on virtual ad-

dresses. Which are implemented on the host processor to simplify the attached devices. Consequently,

ease the devices’ interoperability between systems of different architectures [28].

2.4.4 Standards Comparison

The three emerging standards referred before are likely to converge in one consortium between several

companies, in a way that they might complement and evolve into a more advanced and complete stan-

dard. In Figure 2.11 resumes the main specifications detailed before, in which is possible to compare all

of its main features and specs.

In essence, Gen-Z is a new data access technology which allows operations with directly attached or

disaggregated memory/storage. CCIX enables coherency among several heterogeneous components.

OpenCAPI provides a coherent accesses between system memory and accelerators, through virtual

addresses supported by the host processor.

Figure 2.11: Comparison of CCIX, Gen-Z and OpenCAPI main features [28].

All these standards are suited for high performance IPs and applications, which usually do not target

low-power applications and require high throughput data rates (as seen in Fig.2.11) On the scope of this

Thesis, is required a low-power interconnect technology, which aims at simplicity and reduced hardware

footprint. While still meeting the processor/accelerator data bandwidth requirements. Therefore, none of

the state-of-the-art standards presented are suited for being implemented as an processor-accelerator

interface in a PULPino-based system.

16

2.5 Hardware Accelerators

A hardware accelerator is a specialized unit designed to perform a very specific task or a set of tasks,

achieving higher performance and energy efficiency than a general purpose CPU unit for specific ap-

plication. The use of accelerators is not new, in fact it dates back to 1980s with the deployment of

a floating-point co-processor as one of the first adoption of accelerators. Since then, they have been

widely featured in SoC architectures for embedded systems designs in the past few decades [30].

Nowadays, hardware acceleration design faces challenges such as flexibility and design cost [31]. A

design is flexible when it may address a large set of applications with the same initial design. Most of the

accelerators are based on fixed-functions only used in a specific target application, therefore providing

high flexibility requires a large number of accelerators to cover a wide set of applications. Accelerators

are especially suited for real-time applications, I/O processing, data streaming (network traffic, video,

audio, etc), specific ”complex” functions (DCT, FFT, exp, log, etc) or specific ”complex” algorithms such

as neural networks. Designing such systems in hand-written RTL implementations is highly tedious,

time-consuming and consequently costly. Possible solutions to overcome this issues may be found in

high level synthesis tools, as an automated way to generate digital hardware through the process of

interpreting the algorithm written in an higher level language (C, C++, system C, chisel, etc). To after-

wards, translate it into synthesizable hardware that fits the application’s goals. Xilinx has been paving

the way to such implementations with Vivado HLS tool [32]. Another approach is to create solutions for

universal and optimized interfaces between processors and accelerators, as shown in Section 2.4.

The development of custom hardware solutions usually is related with high design costs, due to the

many hours, weeks or even years that some designs take to complete, mainly due to traditional design

methods. The design is mostly intuition driven, having the need to make decisions upfront and might

lead to costly miscalculations [31]. A possible and attainable solution is presented in this Thesis, by

defining an light-weight interface based on AXI-lite standard, between the processor and accelerator. It

allows the reutilization of accelerators hardware designs, in any systems that support AXI specification,

which is widely used. Associated to an embedded processor it enhances flexibility and decreases de-

sign cost by reusing the same accelerators in different platforms and scenarios.

Accelerators performance analysis take into account a critical parameter which is speedup. It measures

how many times, the use of an accelerator, could reduce execution time that a non accelerated system

would take to complete the exact same task. The speedup is influenced by the accelerator comput-

ing time, synchronization with the processor and data transfers bandwidth, which are usually the main

bottlenecks of the system. This analysis translates into the equation of accelerator’s total execution time:

Tacc = tin + tcomp + tout,

in which tin and tout represent the time taken to transfer the input and output data respectively into and

out of the accelerator and tcomp for the accelerator’s computing time.

Some problems in specific applications were tackled making use of PULP itself as an accelerator as

referred in Section 2.1, although there are none that combines PULPino with hardware acceleration

17

such as done in this Thesis. In the mentioned section, some applications of PULP target security in IoT

nodes [20] and Digital Signal Processing (DSP) [10][11]. Although, all of them are tightly-coupled with

PULP’s architecture, not being possible to easily change between different kind of accelerators, without

all of the effort of redesigning all the architecture and hardware, once again. As said before, the high

cost of custom hardware design, encourage the reutilization of IPs. Which is one of the main goals of

this Thesis, while still targeting low-power embedded applications, just like PULP.

Cryptographic and Digital Signal Processing (DSP) functions are highly suitable for hardware acceler-

ation. They usually require a considerable amount of operations per input. They are frequently used

along the application’s algorithm. These functions are generally inefficiently when computed in general

purpose CPUs, which do not have extensions for this specific functions. Examples of this kind of func-

tions improvements are [33] and [34], in which SIMD instruction set extensions, applied to ARM NEON

and Inter AVX (only in [33]) architectures, were developed to support applications such as SHA-3, Keyak

and Ketje. They have achieved significant performance improvements over software, in comparison with

cryptographic applications executed on general purpose CPUs. SHA-3 has not only been improved by

SIMD instruction extensions, but also by specialized hardware accelerators as proposed in [35]. The ac-

celerator implemented as a co-processor compliant with ROOC interface, is based on an parametrized

implementation using automated tools and integrated with a Rocket RISC-V processor. Developed un-

der a new hardware construction language (HDL) Chisel, it promises different levels of configuration,

regarding performance, energy efficiency and size. It was also developed a tool, to help the designer on

which configuration to choose to achieve an optimal point of operation, based on the under development

system’s requirements.

Besides the previous cryptographic applications presented, additionally in [36] are explored RSA and

Blowfish cryptography. In which, the goal was to provide a design that focuses on optimizing the critical

path of each cryptographic algorithm. [36] is based on a customized co-processor to improve the overall

throughput on an FPGA platform. In which were achieved reductions on energy consumption and im-

provements on performance, over a standard software implementation.

Besides security, Digital Signal Processing (DSP) is as well a huge subject on which most of nowa-

days computing intensive applications are based on [37], for instance, functions like adaptive Finite

Impulse Response (FIR) filters, Fast Fourier Transform (FFT) and Discrete Cosine Transforms (DCT).

Which are required in a wide range of applications, as audio/video compression and decompression,

convolution, encryption, computer vision, digital switches and many other systems [38]. Some of these

applications are currently used in novel IoT or ultra-low-power systems, as targeted in [7], [8], [9], [19]

and [21] regarding computer vision (as presented in Section 2.1). Apart from computer vision, PULPino

has available several improvements that enhance performance and reduce energy consumption (see

Section 2.2), over applications such as those DSP addresses.

Today’s main FPGA manufacturers already provide full featured DSP cores, e.g., for FFT take for in-

stance from Xilinx and Intel, respectively [39, 40]. Also, in FIR applications, some cores like [41] and

[42] were made available by Xilinx and Altera, respectively. A more optimized solution that claims to

18

beat such proprietary core implementations is SPIRAL. It is a novel hardware generation framework and

system for linear transforms is introduced in [43]. Wherein it takes a problem specification as input with

the additional configurations that will shape the datapath. The configuration method resembles the ones

used in the previously mentioned cores. The system will automatically customize algorithms, mapping

it to a datapath and finally it results in an synthesizable RTL verilog description file, which are ready

for FPGA or ASIC implementation. A comparison between these generated designs and the previously

mentioned Xilinx FFT core was performed and portrayed in Fig.2.12. In these graphs, throughput and

latency are exposed, from a DFT design with 1024 samples (16 bit fixed point) on a Xilinx Virtex-5 FPGA.

It was synthesized with Xilinx ISE, and all the data relative to performance and cost were fetched after

place and route stage. As seen in Figure 2.12, the SPIRAL design outperforms the Xilinx LogiCore FFT

core either on latency or throughput. It might have an extra area cost in comparison with the Xilinx one,

although it that might be managed by the designer on the generation tool [2] (further detailed in Section

3.3).

19

(a) Latency(us) vs area(slices)

(b) Throughput(Million samples/s) vs area(slices) vs performance(Gop/s)

Figure 2.12: Comparison between SPIRAL generated design and Xilinx LogiCore FFT v4.1. Results

from a DFT 1024 samples (16 bit fixed point) on Xilinx Virtex-5 FPGA [2]

20

2.6 Summary

In this Chapter, the state-of-the-art of PULP platform was presented, starting by describing its main

functional principle of operation and further elaborating the relevant scientific works performed on this

platform, namely in computer vision applications, extensions and HW acceleration and heterogeneous

programmable accelerators. Afterwards, a more detailed description of the PULPino core was pre-

sented, addressing its main features that are relevant for scope of this Thesis. Furthermore, are pre-

sented new core configurations for PULPino. As the objective of this Thesis is related with hardware

acceleration, the topic was addressed as the state-of-the-art interconnect networks related to it.

21

Chapter 3

Hardware/Software Co-design

In this Chapter the proposed hardware/software architecture, which is on a high-level system view, com-

posed by PULPino and several attached accelerators. Pulpino as the processing unit, made available

as open source, was adapted and modified to accommodate attachable accelerators into the existent

AXI bus.

The chapter starts with a presentation of the main protocol used (AXI) as its main sub-components.

In order to introduce the developed hardware under the goal of this Thesis, an overall system archi-

tecture is presented, complemented by a detailed description of the software and hardware interfaces.

Furthermore, the architecture of each accelerator is explained.

3.1 AXI Protocol

This section introduces some specifications, communication protocols and the general operation of the

AXI protocol, which allows the main system components on PULPino to be memory mapped. It enables

the possibility to access those components through the core with simple load/store instructions.

AMBA AXI4, after its antecedent AXI3, is an open standard specified by ARM. It facilitates the con-

nection and management of functional blocks in SoC designs, specially in the ones with large number

of controllers and peripherals. AXI4 was introduced with Advanced Microcontroller Bus Architecture

(AMBA)-4 in 2011, having backwards compatibility with its previous specification AXI3. It is now de facto

standard for embedded processors, being free of royalties and well documented specifications. AMBA

facilitates the way design blocks connect to each other, encouraging modular systems that do not de-

pend on technology and can be reused across different systems and applications, while maintaining an

high performance and low power communication.

In essence, AMBA integrates a set of protocol specifications [44]:

• Advanced eXtensible Interface (AXI)4 - Further explained in Section 3.1.

• AXI Coherency Extensions (ACE) - extends AXI with additional coherency that allow multiple

processors to share memory in a coherent manner.

22

• Advanced High-performance Bus (AHB) - Larger bus widths (64/128 bit), using a full duplex

parallel communication and has increased performance over AXI4.

• Advanced Peripheral Bus (APB) - It was designed to interface with peripherals that have sim-

ple interfaces and low power profiles. It provides low-bandwidth control accesses using a low

complexity signal list version from AHB.

AXI4 has several subsets, namely AXI-full, AXI-lite and AXI-Stream. These may interface together

through bridges and protocol converters. AXI-full targets high performance, high clock frequency sys-

tems designs. Mainly featuring support for: multiple region interface; quality of service signaling; support

for unaligned data transfers recurring to data bursts; individual control and data phases; individual read

and write data channels; simultaneous reads and writes support; handle multiple outstanding addresses;

out-of-order transactions.

All the AXI transactions operate on the basis of valid/ready signal handshake. The source of the data

holds the valid signal when it has valid data for transference. Once the destination is ready to receive

data, it asserts the ready signal. When both valid and ready signals are asserted, data is transferred be-

tween source and destination. Another useful signal is tlast, responsible for signaling the last data packet

in a burst data transaction. The protocol operates in a master-slave paradigm, meaning that each end

of the connection is required to be either a master or a slave. It uses 5 different channels: read address;

read data; write address; write data and write response. It is capable of bursts up to 256 beats, meaning

that it allows 256 individual data transfers in a single transaction, that are based upon the same address.

AXI-Lite is a lighter implementation of AXI-full protocol. Uses the same 5 channels, although the use of

bursts is not allowed. Each transfer is limited to a data width of 32 or 64bits. AXI-lite is suited for simple

implementations that do not have high bandwidth requirements. Set up and control components might

be used, which require proper configuration upon its utilization. The ease of configuration is related to

the fact that AXI-Lite supports from 4 to 512 individual addressable slave registers. Each of them may

be written to or read from. With its simpler implementation comes a smaller footprint, advantageous

in ultra low-power systems. Despite its reduced performance, its possible to bridge back to AXI-full, a

different and more complex standard which PULPino uses. Allowing a interaction between both protocol

specifications, even creating with ease a bridge among low and high throughput systems.

AXI-Stream uses one data channel in which the data only flows in one way: from master to slave. De-

signed to applications that require high bandwidth data transfers and low latency. By having only a data

channel, it does not require addresses to proceed with the data transactions. These transaction are

started once both slave and master are ready to receive or send data, respectively. It has an unlimited

burst of data, suited for applications that require data streaming. To indicate the end of a transaction, it

uses the signal tlast, being asserted on the last word sent over the data channel.

AXI-full is used in the current PULPino’s design to connect together components, such as debug unit

23

and SPI slave. Other components, like the instruction and data memories, core and peripherals make

use of AXI-full indirectly, since they require a bridge to interact between different interfaces and be able

to communicate through an AXI interconnect (further detailed in Section 3.1.1).

On the other hand AXI-lite, being a much simpler and lightweight protocol, in this Thesis proposed design

it is responsible for handling communications between core and accelerator.

One of the reasons AXI-Stream is not used in the proposed design, is because AXI-lite provides a better

generic interface that might support both configuration and ”stream” of data. AXI-Stream is mainly

suited for data streaming, and the lack off individual slave register make its configuration less desirable

to interface with components. For instance, the use of AXI-lite slave register, might be used to provide

means of synchronization between both processor and accelerator.

3.1.1 AXI Interconnect

Pulpino integrates multiple components, using an interconnect network, which is supported by AMBA

standard. As depicted in Fig.3.1, it uses a main interconnect AXI block and a bridge to Advanced Periph-

eral Bus (APB) to connect peripherals. They both feature a 32-bit wide data channels. It also includes

an advanced debug unit that enables access for both RAMs, core registers and memory mapped IOs

via JTAG. Pulpino has its components connected through an AXI interconnect block, that allows all of

Figure 3.1: PULPino’s SoC block diagram [45].

them to be mapped in a memory space, providing a homogeneous view of the system.

A memory map has been defined, in which all the components have user-configurable address spaces

as shown in Fig. 3.2. This map might be completed by the user, adding new address ranges to incor-

porate additional hardware mapped components. For instance, if an accelerator needs to be added, a

new set of addresses (start and end) is configured at the AXI interconnect configuration sources. Those

components are ideally meant to be as much plug-an-play as possible and have an ”universal” interface,

24

towards the goal to be easily deployed in any given IP.

The interconnect block provides a way of multiple masters and slaves to be connected to several blocks

Figure 3.2: PULPino’s memory map [45].

at once. A typical implementation of a AXI interconnect provides clock conversion mechanisms, data

widths conversions and even FIFOs if necessary. To route all the slaves and masters together it has a

central crossbar. The routing mechanism may be configured to be based on an addressing space for

each component (used in the presented design) or as simple as a round-robin solution. Pulpino has a

custom AXI interconnect block, which design objectives follow [44]:

• Suitable for high-bandwidth and low-latency designs;

• Operation at high-frequency without using complex bridges;

• Meet the interface needs of a wide range of components;

• Indicated for memory controllers that have high initial access latency;

25

• More flexible implementation of interconnect architectures;

• Be compatible with previously existing APB and AHB interfaces.

Figure 3.3: AXI4 node overview for a NxM interconnect node [46].

Axi node is a system verilog soft IP, defined as the top level source file for the AXI4 crossbar. In Fig. 3.3

is depicted a block diagram for a NxM interconnect node, composed mainly by four parts [46]:

1. Slices: optional buffers inserted on each master/slave ports. Provides buffering and cuts the critical

paths that may lead to timing degradation;

2. Request trees: there is one for each target port, containing the arbitration tree, request decoders

and error management;

3. Response trees: one for each initiator port, including arbitration tree for request and decoders for

responses;

4. Configuration block: is used to map the initiator regions (memory mapped). It can also be used to

implement a first layer of protection mechanism, despite not being used in PULPino, by limiting the

connectivity from a target port to an initiator port.

The initial memory map is defined at the instantiation of axi node, by configuring a set of parameters:

• NB MASTER: Number of memory mapped master components;

• NB SLAVE : Number of memory mapped slave components;

26

• {start addr i, end addr i}: arrays of 32-bit start/end addresses, e.g. if NB MASTER = 4 then four

pairs of start and end addresses must follow. Another example, in data memory start addr i =

0x0010 0000 and end addr i = 0x0010 8000, as shown in Fig. 3.2.

3.2 Overall System Architecture

This section addresses the proposed overall system architecture, based on PULPino. In order to im-

prove PULPino’s energy efficiency, hardware acceleration was provided in an loosely-coupled manner.

It is described in the following section how the acceleration was implemented, allowing to reduce a cer-

tain task execution time.

The goal is to develop a generic plug-n-play interface for the accelerators to be easily attached to the

core (Section 3.2.1), without having to adapt its interface each time a different one needs to be intro-

duced. The processor is able to interact with the accelerator by simple load/store instructions, which

were used as processor-accelerator synchronization mechanisms, further detailed on Section 3.2.2.

The developed work under the scope of this Thesis, was to attach the accelerators that interface with

Figure 3.4: PULPino with attached accelerators block diagram.

the processor, through the implemented AXI connection. The difference between the original pulpino

architecture and the new developed one, may be noticed by comparing Figure 3.1 and 3.4. The accel-

erator interfaces with an AXI-lite to AXI-full converter, as shown in Fig. 3.4. The AXI-full interface of the

converter is connected to PULPino’s AXI interconnect block, that handles the communications coming

from the processor. When the processor issues a load/store to the accelerator’s memory space, the

AXI interconnect block does the all the required address translation to redirect the read/write into it. Is

possible to have more than one accelerator at a time, by defining multiple address spaces for each one.

27

3.2.1 Hardware interface

The hardware interface between accelerator and the remaining system, only requires the standard AXI

signals to operate, being one of the main design goals. The accelerator block has one unique AXI-lite

port as interface, as depicted in Fig. 3.4. This block behaves like a wrapper that integrates the kernel

and has all the required hardware to handle the interface between it and the AXI-lite slave registers.

This integration might consist, for instance, on storing the processor’s incoming data and feeding it in

the kernel’s inputs according with its timing requirements. The same applies for kernel’s output data

handling, being afterwards read by the processor. All kernel’s control signals are provided as well.

The AXI-Lite interface that connects to the accelerator, has the following configuration:

• Address width: 2 bits; Addressing 4 slave registers of 32-bit each.

• Data width: 32-bit;

• Read/Write mode. In this mode both read and write channels are enabled.

The data width is restricted to 32-bit, due to the processor only allowing 32-bit operations. The amount of

registers used was a project choice, upon the goal of using the minimum hardware possible, while also

meeting the required functionality for interface. Once the kernel has been properly integrated within the

wrapper, the accelerator block is ready to be attached to the AXI-lite to AXI-full protocol converter block.

It allows each accelerator to be connected to the AXI interconnect block, and therefore memory mapped

and addressable by the processor. Each accelerator to be plugged in, has to have an independent

memory region configured in the axi interconnect block and its own protocol converter, as depicted in

Fig. 3.4. The design option for each accelerator to have its own converter might seem a bad design for

an energy efficient SoC. Although, this is a lightweight-hardware protocol converter, that corresponds to

less than 1% of the total PULPino’s platform’s hardware.

Both the converter and accelerator are instantiated in PULPino’s core region.sv top module file. This

module instantiates all core-related components: data and instruction memories, RISC-V core and its

debug module. To connect the accelerator to the converter, a new AXI4 slave bus was instantiated, which

contains all the required signals to perform a connection between two AXI-full interfaced components.

The AXI slave port of the converter is connected to the AXI interconnect block’s master AXI port. On the

other hand, accelerator’s AXI-lite slave port connects to the AXI-lite master port of the converter.

Once all these requirements are meet, the accelerator is ready to be plugged-in into the converter and

connected to the AXI interconnect block. In essence, this architecture aims to deliver an infrastructure in

which is possible to have a well defined interface between processor and accelerator. Encouraging the

open source community to develop a set of accelerators PULPino-compatible, creating an easy way for

developers to test new kernels on the go. Reducing development time and complexity of such hardware

systems.

28

3.2.2 Software Interface

The processor needs to interact with the accelerator in order to benefit from its capabilities. Such is

done through load/store instructions, addressing the memory region predefined to it. In essence, the

processor needs to send the data to be processed and afterwards fetch the computed results. A pro-

cessor needs to know when the results are ready to be fetched, therefore a synchronization mechanism

is required. To accomplish such requirements, a processor-accelerator communication protocol was de-

fined. It is based on read/writes to the available AXI-lite slave registers. Table 3.1 provides an overview

of such register and its functionality depending on the type of operation (read/write).

Table 3.1: AXI-Lite slave register write/read map functionalities.

Register Write Read

slv reg0 Reset Done

slv reg1 Input Data N.A.

slv reg2 Last Data N.A.

slv reg3 Optional Result

For instance, when the user wants to reset the accelerator, a write to its first address (for demonstration

purposes lets assume the first address is 0) must be performed sending the hexadecimal value of

0x01010101. This is the first step before starting to send new data into the accelerator. Then, the

address should be increased by one unit and the data streamed into address 1. The last data value

should be sent to address 2, it will indicate to the accelerator that this will be the last data input to be sent.

Once the computation is done, the output values are available to be read in address 3. The processor

might start to fetch results when the Done signal represented by the hexadecimal value 0xdeadbeef

is read from the address 0. This method requires the processor to be checking this register in polling

mode. An alternative to this method, is to have the Done signal associated to an interrupt vector, which

would trigger a flag once the accelerator’s computation were done. Redirecting the program counter

to the proper interrupt service routine, handling the output data as desired. Although this was not

implemented, it is a strong recommendation for future work. From the energy-efficiency point of view,

it would allow the processor to stay at a sleep, in a low power operation mode, while it waits for the

computation and then awaking upon the interrupt. The feature of stepping out of the sleep stage when

an interrupt is triggered is available on PULPino, as mentioned in Section 2.2. An optional functionality

for slv reg3 was added to fill the need that some accelerators might have. For instance, it might be

used to pass some configuration value or information about the input data, which happens in SHA-3

accelerator as detailed in Section 3.3.

3.3 Hardware Accelerators

The integration of open-source kernels into one top level accelerator is one of the main targets of The-

sis. It is intended to demonstrate the energy-efficiency impact of it on the overall system operation. The

29

kernels were chosen upon the first criteria of being open source, to match the kind of licensing provided

by PULPino’s team. This way, the hardware development made in this thesis might contribute to the

open-source community of PULPino and keep growing the framework. The kernels used in the further

presented accelerators were not inner modified. Based on its interface, control specifications and tim-

ing requirements, the required hardware was developed to integrate the kernel within the accelerator’s

block. The developed hardware that wraps the kernel inside the accelerator, acts as bridge between the

accelerator’s AXI-lite interface and kernel’s interface.

In the following subsections, two accelerators are proposed. Due to different kernel’s requirements,

different hardware designs were needed for both. Although the main structure is the same, due to its

similar requirement of streaming input data on each clock cycle. The processor does not meet such

requirement, since each read/write through AXI-lite takes 10/11 clock cycles respectively. Therefore,

additional hardware to accommodate the input data from the processor and then feed it in the kernel

is needed. Apart from this requirement shared by both kernel, the control signals to operate them are

different.

SHA-3

In any kind of modern computer systems, security is an extremely important feature. The most basic

security or authentication system use hashing algorithms. These take as input a stream of data and

return a fixed sized hash, for that specific input message. These hash functions are required to generate

unique hashes for any given message. The message cannot be generated from the hash and it should

be easy to compute [35]. NIST has standardized Secure Hash Algorithm (SHA), being SHA-3 the more

recent one in the family. It is a cryptographic hash function, originally known as Keccak. It was devel-

oped after successful attacks on MD5, SHA-0 and theoretical attacks on SHA-1.

A cryptographic kernel was chosen due to the recurring importance of security on low power IoT de-

vices, which still represents a challenge to nowadays common applications. A SHA-3 accelerator aims

to provide a faster computation of the hash function with a reduced energy consumption, for the target

low-power processor. The hardware implementation of this function exploits parallelism, which would

not be possible with the single core processor under analysis.

The input message might take any size, while the output length will remains the same. The output might

take the following lengths n ∈ {224, 256, 384, 512}bits. The current implementation has the highest se-

curity level of 512-bit among all SHA-3 variants. Keccak is based in the sponge construction approach,

which through random permutation functions allows inputting any amount of data, leading to great flex-

ibility. The padding of a message M to a sequence of 576bit blocks is denoted by M‖pad[576](|M |). It

makes use of multi-rate padding, denoted by pad10∗1, appending a single bit 1 followed by the sufficient

number of 0 followed by a single bit 1, such that the length of the result is a multiple of the block length

(576 in the current design) [47]. The number of blocks P is designated by |P |576, and the i-th block of P

by Pi. The number of blocks determines the number of times the permutation f is executed, as shown

in Algorithm 1.

30

Algorithm 1 The sponge construction [3]1: procedure SPONGE

2: Interface: Z = SHA− 3− 512(M),M ∈ Z∗2, Z ∈ Z512

2

3: P =M‖pad[576](|M |)4: s = 01600

5: for i = 0 to | P |576 −1 do6: s = f(s⊕(Pi)‖01600−576)7: end for8: return bscr

The kernel was developed by Homer Hsing, available in OpenCores website is under the Apache license

(version 2) [3]. The code is FPGA-vendor independent, and fully optimized. It uses only one clock

domain, without any latches. Capable of computing an 512 bit hash result in 29 clock cycles, is based

on a padding module followed by a permutation module as shown in Fig.3.5.

Figure 3.5: SHA-3 kernel overview architecture [3].

The input is limited to a width of 32 bit. Since it is far less than 576 bit, the padding module, which

architecture is shown in Fig. 3.6, uses a buffer to assemble the user input. If this buffer achieves

maximum capacity, the permutation module is notified that a valid buffer output is ready. Then, the

permutation module starts the calculation. The previous buffer is cleared and the padding module waits

once again for new input.

Figure 3.6: SHA-3 padding module’s architecture [3].

After the padding is complete, the output is the permutation block’s new input, as depicted in Fig. 3.7.

The permutation is performed by combinational logic, performing the permutation in 24 rounds. The

round constant used by it, is selected on its first iteration. A 1600-bit register stores the output, although

only 512-bit are selected from it, resulting in the final hash value.

31

Figure 3.7: SHA-3 permutation module’s architecture [3].

When the permutation module is computing the current input, the padding module is preparing the next

input. The permutation takes 24 clock cycles, meaning that the padding should get the next 576 bit

ready in time. The kernel presents input/output ports shown in Table 3.2.

To start computing a hash value, the core must be reseted by holding the reset signal synchronously

high during one clock cycle. This procedure must be repeated at every new hash value computation.

For instance if the kernel computed SHA-3-512 (”FooBar”) then it should be reseted before computing

SHA-3-512 (”XPTO”). The padder uses the input signal byte num in last input block, that indicates how

many bytes the input has, hence the message M may not be multiple of 32 bit. If the last input block is

reduced to 1 byte, it should be aligned to the most significant bit. Letting ”A” be the message with 1 byte,

the input signal in should follow like this: in[31:24]=”A”. Notice that if the message is multiple of in width,

an additional input zero-length block should be provided. For example, let the input be:

in = ”ABCD”, is last = 0

Then set,

is last = 1, byte num = 0

Table 3.2: Table of SHA-3 kernel’s input/output ports.Port Width Direction Description

clk 1 In Clockreset 1 In Synchronous positive asserted resetin 32 In Input databyte num 2 In Number of bytes of inin ready 1 In Input is valid or notis last 1 In Current input is last or notbuffer full 1 Out Buffer is full or notout 512 Out Hash resultout ready 1 Out Result is ready or not

To comply with these requirements, a wrapper to accommodate the SHA-3 kernel was developed. It

complies with the specifications provided in Section 3.2. Based upon the following data path depicted in

Fig. 3.8. The design aims for efficiency and reduced hardware usage, by using only one FIFO (32 words

32

Figure 3.8: SHA-3 accelerator data path.

of 32-bit) and a set of control signals. The input data is redirected from the processor into the kernel as

soon as it arrives (every 11 clock cycles). The FIFO acts as a buffer in the event of the kernel’s buffer

achieves full capacity. If it does, the buffer full signal will be asserted. Upon such event, the input data is

held in the FIFO, until the kernel is ready to continue computation. Afterwards, the data is feed into the

kernel’s input at the previous rate. When the last input data is sent by the processor, FIFO might have

exceeding data due to previous buffer full events. Such data is streamed into the kernel on every clock

cycle after the last input data. Which is sent to the slv reg2 AXI-lite register, asserting the is last and

selecting the previously received byte num value. As explained before, the byte num signal has to follow

certain rules, based on the length of the last message block. The user is responsible for this handling, by

sending a last dummy input data with the value zero after the last value message block, into the slv reg2

register. When the out ready signal is asserted, the 512bit hash key is made available at the output port,

being iteratively fetched by the processor on blocks of 32bit. This output is redirected to the slv reg3

register, since it is limited to a 32bit word, an auxiliary counter is used to iterate through the 512bit result.

It starts by sending the least significant word iterating 16 times until the most significant word is sent.

Afterwards the accelerator needs to be reseted, to be able to proceed with further computation.

Metrics regarding the accelerator performance, achieved upon simulation and hardware tests, are pre-

sented in Chapter 5.

FFT

Fast Fourier Transform (FFT) is widely used in DSP and in many other application fields, as a fast and

more efficient algorithm to compute the Discrete Fourier Transform (DFT). It allows the conversion from

the signal’s original domain to the frequency domain. While the DFT has a complexity of O(n2), the FFT

complexity is O(nlogn), in which n is the data size [48]. Many applications require a highly efficient com-

putation of this operation, often leading to a hardware implementation. The algorithm may be mapped

in many different architectures, depending on the hardware restrictions and performance requirements

of the application [43].

This Thesis addresses a low-power processor targeting IoT applications, which often require DSP al-

gorithms, namely FFTs. Therefore, it is important to have a commonly used DSP accelerator such as

33

the addressed one, aiming at high energy-efficiency and performance. The purpose was not to develop

an FFT kernel from scratch, but instead selecting the more appropriate one, having a low-power profile

and low hardware resources requirements. At the same time, it should be based on an open-source

license, as stated before. The FFT kernel was wrapped together with an AXI-lite interface, communicat-

ing through it with the processor. All the additional hardware to bridge between the kernel’s and AXI-lite

interfaces was designed with the goal of reducing the used hardware, while efforts were made to take

profit from the kernel’s performance. The AXI-Lite interface might limit the input data feed rate, such as

the kernel’s throughput, in case the AXI-lite data transactions cannot keep up with it.

The chosen FFT kernel was developed by SPIRAL - Software/Hardware Generation for DSP Algorithms

Table 3.3: SPIRAL FFT kernel online configuration parameters [2].Parameter Value Range Description

Problem specification

transform size 256 4-32768 number of samplesdirection forward forward or inverse DFTdata type fixed point fixed or floating point

32 4-32 bits fixed point precisionunscaled scaled or unsaled mode

Parameters controlling implementation

architecture iterative iterative or fully streamingradix 2 2, 4, 16 size of DFT basic blockstreaming width 2 2–256 number of complex words per cycledata ordering natural in/out natural or digit-reversed data orderBRAM budget -1 maximum number of BRAMs (-1 = no limit)Permutation method JACM’09 JACM’09 [49] or DATE’09 [50]

[43], which has won the ACM TODAES Best Paper Award 2014. SPIRAL provides an online tool [2] for

hardware generation, outputting a generated FFT kernel in verilog, upon the chosen parameters. These

parameters might taken the values specified on Table 3.3, on the ”Value” column are the parameters

chosen for the used kernel. The remaining available options (if applicable) for those are described in

the next column ”Range”. A short description of the parameters meaning, is also presented in the last

column. It was chosen a kernel with 256 number of samples n, for the forward DFT defined as:

y = DFTnx,

DFTn = [e−2πjkl/n]k,l=0,...,n−1

where y is the n point output vector, and x the n point input vector. The data type chosen was ”fixed

point”, due to the lack of floating point operations support by the processor in hardware. The current

AXI-Lite configuration is set to 32-bit messages, the fixed point precision was set to 32-bit together in

unscaled arithmetic mode.

34

Figure 3.9: SPIRAL Fast Fourier Transform (FFT) iterative architecture [2].

Figure 3.10: SPIRAL Fast Fourier Transform (FFT) fully streaming architecture [2].

The parameters controlling implementation will be addressed next. Both architectures, iterative and

streaming were tested under the scope of this Thesis. The developed hardware is prepared to accom-

modate both architectures. The iterative one is slower than the streaming version, hence the data stream

has to iterate over a single stage O(logn) (n is the size of DFT to be computed) times, as shown in Fig.

3.9, and cannot begin a new input vector before the last vector is processed. In the case of a streaming

architecture the data stream would flow in and out the system continuously. The architecture consists

of multiple O(logn) cascaded stages, and each one is composed by computation and data reordering

components, as depicted in Fig. 3.10. The radix defines the size of the DFT block, controlling the num-

ber of points processed on a basic computational block. In case of an iterative architecture, the problem

size n must be a power of the chosen radix, meaning that n = rk for an integer k. Such restriction is

not applied to a streaming architecture. The streaming width must be a multiple of the chosen radix. It

controls the input data stream width (defined as w in Fig. 3.9 and 3.10), when increasing this parameter

by a factor of k, consequently the system’s parallelism increases by a factor of k. As specified in Table

3.3, multiple data ordering might be selected from natural in/out, natural input/reversed output and re-

served input/natural output, on the reversed ordering the MSBs become the LSBs and vice-versa. The

”BRAM budget” option allows to choose the maximum number of BRAM blocks when Xilinx FPGAs are

targeted, by adding synthesis directives into the generated verilog code, interpreted by the Xilinx tools.

For an non-restricted number of BRAMs, the value -1 should be inserted. The permutation (last line on

Table 3.3) method defines has the name indicates, how the permutation is done. From the two available

methods, the DATE’09 [50] is patent free, requiring almost twice the amount of SRAMs and with reduced

performance and higher logic costs. On the other hand, the second method JACM’09 [49], which holds

35

a patent (protected by U.S. Patent No. 8,321,823), is based on a different and improved technique [2].

Given the previous configurations for the FFT kernel, is generated a single verilog file by SPIRAL’s online

tool [2]. For the Kernel’s integration in the accelerator, it is instantiated within a wrapper, interfacing with

the processor through AXI-Lite. Such integration was done with additional hardware, which data path is

depicted in Fig. 3.11, acting as a bridge between the FFT kernel and AXI-Lite interface. The resultant

top level hardware block is denominated by accelerator. The input data redirected from the AXI-Lite

Figure 3.11: FFT accelerator’s data path.

slave registers, which contain the data sent by the processor, based on the protocol defined in Section

3.2.2. The input data with its width defined by w, might be 32-bit or 16-bit, hence the kernel is ready to

compute both types of data. The hardware design only has the 32-bit word width option available. The

input data is redirected by the demultiplexer at the moment it arrives, into a buffer register. Its purpose

is to store the data until it has been fully filled, since the processor sends one 32 bit wide data in each

communication. The register’s data width corresponds to 2∗N ∗w, withN being the FFT kernel’s number

of complex inputs. Each one is composed by a real and imaginary part, individually w bits wide. After

the buffer is full, its data is stored in the FIFO. Thus, it has the same data width has the buffer and a

data depth equal to the number of input samples, as defined in the online tool [2] explained before. After

the processor is done sending all the data samples, the FIFO will be full and ready to stream all the

stored data into the FFT kernel. The need for an streamed input in at each clock cycle after the kernel’s

36

next input signal assertion, is the main reason to include a FIFO memory in the design. Hence the

AXI-Lite interface can only provide one new message from the processor at every 11 clock cycles.

After the first kernel’s input it takes a known amount of clock cycles until the computed output is ready,

defined by the latency, which depends on the kernel’s configuration chosen in [2]. Upon assertion of the

signal next out by the kernel, its output is streamed into the same FIFO. The signal sel in fifo selects the

FIFO’s inputs between the data to be computed and the output results to be stored. Similarly, the signal

sel out fifo multiplexes the FIFO’s output data, redirecting it to either the kernel’s input or the output

multiplexer operated by sel out reg control signal. This last mux on the data path splits the one word

FIFO’s output 2Nw sized, into smaller words with w width compliant with the supported slave registers

width of AXI-Lite.

The hardware design which data path is shown in Fig. 3.11, is ready to handle different FFT kernel’s

configuration with minor adjustments on the accelerator’s input parameters and the kernel’s component

declaration and instantiation. These parameters adjust the data width of several signals, as the counters

used to control them. All the configuration options on the online generator tool [2] are covered, with the

exception of fixed point precision which is fixed to 32bit and the number of samples to 256. The defined

accelerator’s top module parameters are the following:

• FFT INOUT NR: number of inputs/outputs which the current kernel’s configuration can handle.

• DATA WIDTH FIFO: The data width of each word stored in FIFO. Calculation based on

FFT INOUT NR ∗ w,w = 32bits.

• DATA DEPTH FIFO: Data depth of FIFO, how many words of DATA WIDTH FIFO width are stored

in it. Value based on the number of data input samples divided by the chosen radix.

E.g. samplesradix = 256

2 = 128

3.4 Summary

In this chapter is introduced the target device in which the development was done. Followed by a

description of the AXI protocol, in which the accelerators are based. Detailing the developed interface

developed, on both hardware and software point of view. The architectures of each accelerator and

kernel are presented, as the developed hardware to wrap up all the components together.

37

Chapter 4

Implementation and Experimental

Work

This chapter presents the basic steps and procedures to set up an working environment to start with

PULPino on the chosen development board. From the fist step of configuring the system and getting it up

and running, to deploying the bitstream into the FPGA and executing a program on the core. Along with

the development process, simulations took part as an essential step. It is also presented how to simulate

such system. During all these development phases some draw backs and problems were found. They

will be detailed and possible solutions discussed.

As stated before, PULPino is an open-source project, therefore all the base sources were retrieved from

the project’s GitHub page [51]. Knowing that this was its first release, and not a mature project, its

expected to have incompatibilities and unsolved issues/bugs. This was one of the main setbacks found.

Having to deal with a release that was not well documented from a technical point of view (only a very

basic user manual and a datasheet are available). Due to the recent release (dating 2016), there is still

a very small active open source community working with this platform. dificulting the resolution of the

many of prompt issues.

In what concerns the development board, a ZedBoard was chosen to conduct the necessary tests and

fetch results, as presented in the next section.

4.1 Target Device

The development board was chosen accordingly with PULPino developers specifications. PULPino is

mainly targeted for RTL simulation and ASICs, although there is also a FPGA version supported on

ZedBoard. The FPGA version is not optimized for performance and efficiency, since it was used mainly

for emulation instead of standalone platform.

ZedBoard carries a Xilinx Zynq-7000 Family All Programmable System on Chip (SoC) XC7020-CLG484-

1. This device series enables extensive system level integration and flexibility through its main hardware,

38

software and I/O programmability. Most of its inner system level components have available GUI configu-

ration tools, which contribute to reduce the development time and ease of debugging, by auto-generating

the required source code upon the user’s hardware specification/requirements.

Figure 4.1: Xilinx Zynq-7000 SoC block diagram overview [52]

In Fig. 4.1, the main components for the range of Zynq devices are represented. The main components

of this design are: Programmable Logic (PL) and Processing System (PS). The PL is derived from Xilinx

7 series FPGA technology, namely Artix-7 for the present XC7020 device. The integrated PL block is

available for the user to deploy the custom designed hardware, as in any other FPGA, through hardware

description languages such as Verilog, VHDL or System Verilog. The given low-range Zynq-7000 PL

features 85.000 Programmable Logic Cells, 53.200 Look-Up Tables (LUTs), 220 Digital Signal Process-

ing (DSP) slices and 4.9Mb of BRAM memory.

The PS itself features an Application Processing Unit composed by two ARM cortex A9 hard-cores, with

a maximum frequency of 866MHz on the featured device, capable of 1GHz on higher-end SoCs. Also

featuring two level caches, L1 and L2, having respectively 32KB (each instruction and data memories)

and 256KB. Additionally, each processor has its own 256KB On-Chip Memory and FPU unit. Together

with the memory controller each processor has access to external 512MB DDR memory. I/O peripher-

als inner composed of SPI, CAN, UARTs, I2C, USB and Ethernet interface, are also available to the PS

through central interconnect block

39

The Zynq-7000 SoC can be booted in a multi-stage process and includes the boot ROM and First-Stage

Boot Loader (FSBL). The boot process initializes and clean-up the system, and prepares it to boot from

the selected external boot device. Once it is concluded, the FSBL is executed and the systems main

components (PS and PL) are configured accordingly. For instance, loading an light-weight operative

system into PS and a bitstream into PL.

4.2 System Configuration

Previous to pulpino’s deployment in FPGA, some essential system configurations and main steps were

taken. The recommended toolchain used is Vivado 2015.1 from Xilinx, for synthesis and implementation,

and ModelSim 10.2c from Mentor for simulation. To compile the source C/C++ files, which are meant

to be executed on the addressed core, was used a riscv-toolchain. It may be provided by University of

California - Berkeley if it is the RISC-V official toolchain, or the custom one from ETH Zurich University.

The last one was used, due to its support for all ISA extensions present in RI5CY core (see Section 2.2).

A dual-ARM Cortex A9 is also part of Xilinx SoC in use, therefore its compilers are a important. Hence

it is compiled by Xilinx SDK, gcc-arm-none-eabi and gcc-arm-linux-gnueabi are the ones required, ad-

ditionally with lib32 libraries.

The following stages of configuration presented below, describe a path that must be followed in order get

PULPino up and running, ready to be tested. An external hardware view of the system’s block diagram

is shown in 4.2, wherein the SoC Zynq-7000, contained by the ZedBoard, communicates with the PC

via serial interface (UART).

Figure 4.2: Implementation block diagram. Communications between PS-PL and PC-PS.

Boot

There are multiple methods for booting a linux system on a Zynq SoC, although it was used the SD flash

memory method. The ARM processor boot is a three-stage process: an internal boot ROM stores a

stage-0 boot code, which will configure the processor and the necessary peripherals to start fetching the

first stage bootloader (FSBL) code from the SD card. Which is composed by a root and boot partitions,

it was configured according to Xilinx specifications [53]. The FSBL will be copied to the SoC’s on-

chip memory and afterwards executed. The FSBL includes all the required initialization code for the

40

peripherals used in the PS, and configures the PL with the bitstream. The third step is to bring the OS

into the SoC’s memory from the SD card, because when the processor is powered on, the memory is

empty. This is done by the bootloader u-boot. Apart from the linux buildroot OS loading, it does other

tasks that the kernel might not be able to. Such as, configuring clock frequency, load device-tree, enable

boot commands, etc. The loaded buildroot OS, has all the necessary drivers and configurations for the

OS to initialize along with its peripheral,s using the hardware description present in the device tree [54].

Generating the Bitstream

Unlike most kind of Vivado’s projects, this one operates accordingly with a makefile that sets up the

environment and executes tcl commands (Vivado commands) to create the project, add the required

sources, set properties, set compile order, synthesizes and implements accordingly with an area opti-

mized strategy.

It is composed by two different projects. A top project, that has the PS, AXI buses and the required AXI

converters to interface both with the core (via SPI) and FPGA GPIOs ports. A second project contains

PULPino and all its RTL sources, meaning that it would only be needed this second Vivado project,

in case of a standalone implementation of PULPino an FPGA. It would facilitate all the developing/de-

bugging processes and allow the use of Vivado’s full block design potential, if this second project could

be added to the top one as a design block. Although due to an incompatibility between Vivado’s block

design and PULPino’s sources, this is not possible. Consequently, all Vivado’s available debug tools

and automatic block connections features (being one of its remarkable development advantages) are

not available. Translating into much more difficult and time consuming problem solving and development

processes.

Deploy the Bitstream

Once the bitstream is generated it can be programed into the fpga at least in two different ways. On

Processing System (PS) boot, loading the bitstream file from the SDcard and programming the fpga

when the u-boot linux is booting up. Otherwise, it can also be deployed using XMD Xilinx tool on the

PC the board is connected to. To perform it, one needs to issue the command ”fpga -f bitstream.bit” to

upload it into the FPGA. To be noticed that if this method is chosen, the core needs to reseted before

uploading any pre-compiled program into it. The reset should be done by issuing the spiload script on

pulpino (via serial port) loading an empty stimulus file into the memories, forcing the core to reset.

Execute a Program on PULPino

When the bitstream is deployed and the core reseted, it is ready to receive the proper stimulus files which

have all the data to be uploaded into the memories. Once the C/C++ code is compiled it’s generated a

stimulus file containing the memory address and the value to be stored on that memory address. This

file might be uploaded to the FPGA, either by saving it directly on the sdcard or by connecting to the

41

board via ssh and using secure copy (scp). Once uploaded into the board, on the PS a script (spiload)

that load the stimulus file into the fpga is executed. Not only loads the file via SPI into the memories but

also defines the boot address, reset the core, and listens to the core outputs that will be redirected into

linux stdout. This process could also be performed ”manually”, step by step, using a jtag debug tool that

connects to the internal AXI bus, having access to all the address space. Although a proper working

version that complies with this core version is not up to date. Such tool is only available in previous

versions of the core (e.g. or1k or or10n core version).

Simulation

An important phase of development is the test of an application in an controlled environment, in which

is easier to debug and ensure all its intrinsic functioning, for example through wave form analysis.

ModelSim 10.2c is the default platform in which Pulpino was tested. All the simulation scripts (available

with pulpino project) were conceived to fit this platform requirements and interface. These reproduce

the behavior of pulpino as if it was running on ZedBoard. Meaning that it loads the stimulus file over

SPI, reproducing the behavior of spiload (see 4.2) or loading the stimulus directly into the memory. The

loading method might be chosen upon the pre-set of ModelSim MEMLOAD argument.

If there is a need to simulate pulpino in another simulator engine, all the simulation files, including re-

quired libraries and simulation sources, need to be redesigned to fit the new simulator requirements.

For instance, the simulation sources of pulpino are not compatible with vivado simulator, mainly due to

some features of system verilog that are not supported (e.g. dynamic arrays). Making it difficult and time

consuming to port to new simulation platforms or hardware description languages.

The simulation environment is built using CMake. A bash script located at pulpino’s git repository sw

folder needs to be configured with the paths to ri5cy toolchain, ModelSim, pulpino git directory and

enable the use of compressed instructions. Next one on the list is compiling all RTL libraries using

ModelSim, which is done with the previously generated makefile. In this step, many compiling errors

may prompt, if ModelSim version is not the exact same one recommended, even if it is the same version

some extra features might not be enabled by the license, leading to compiling errors. Once all libraries

are ready, the simulation might be launched by issuing for instance ”make helloworld.vsim”, opening an

ModelSim GUI.

Some faulty ModelSim RTL sources and libraries were detected, when optimization is needed. There-

fore, is not recommended to use optimizations during simulation, which may lead to errors and untrust-

worthy results.

In the need of performing a post-synthesis or post-implementation simulation, a set of possible ap-

proaches were tested:

1. The use of Vivado Simulator. This method requires an adaptation of the system verilog written

simulation sources and testbench, hence it does not support all kinds of structures, e.g. Dynamic

42

data structures.

2. Generate a post-synthesis/implementation netlist to be simulated in ModelSim. A netlist contains

all the hardware blocks (LUTs, DSPs, etc) and connections between them, from which is generated

the synthesized/implemented schematic. Compiling the netlist in ModelSim requires that all the

Vivado’s Unisim libraries are properly imported into ModelSim, those libraries have all the hardware

elements used by Vivado upon the netlist generation. Although some Unisim elements are not

compatible with ModelSim compiler. Even if all the libraries have an error-free compilation, using

this method only allows to ensure the well-functioning of post-synthesized/implemented design.

On a development phase, if there is a need to debug the generated hardware in simulation, this

method is not recommended, since all the nets have names that were automatically generated by

the tool and do not resemble the original/user-defined ones.

3. Using ModelSim as default simulator in vivado. After synthesize/implement the project, is possible

to simulate it directly from Vivado, using ModelSim to compile and run pulpino’s simulation sources.

43

4.3 New AXI Interconnect Slave

Tackling the challenge of attaching a new accelerator to the AXI interconnect bus thorough a AXI-lite to

AXI-full converter, without any automatic configuration tools (as the ones Vivado has available). A new

RAM memory was attached, to test communications with the core and issuing load/store instructions. It

interfaces with an AXI-full specification in a new customly set address region, through a wrapper. Which

translates between AXI-full and the RAM read/write interface. The wrapper used is the same one that

comes along with PULPino’s data memory. Wherein the protocol conversion is already implemented.

The objective is to test the custom connections made at the core top level design in which the compo-

nents are instantiated and connected between them-selfs.

It was necessary to chose an AXI interconnect address space region, which was free to house the new

memory. Accordingly with PULPino’s memory map previously presented in Fig. 3.2. The address region

chosen is next to the existent data memory: from 0x00108100 to 0x00110100.

Listing 4.1: AXI Interconnect Instantiation in System Verilog

1 a x i n o d e i n t f w r a p

2 #(

3 .NB MASTER ( 4 ) ,

4 . NB SLAVE ( 3 ) ,

5 . AXI ADDR WIDTH ( ‘AXI ADDR WIDTH ) ,

6 . AXI DATA WIDTH ( ‘AXI DATA WIDTH ) ,

7 . AXI ID WIDTH ( ‘ AXI ID MASTER WIDTH ) ,

8 . AXI USER WIDTH ( ‘AXI USER WIDTH )

9 )

10 a x i i n t e r c o n n e c t i

11 (

12 . c l k ( c l k i n t ) ,

13 . r s t n ( r s t n i n t ) ,

14 . t e s t e n i ( tes tmode i ) ,

15

16 . master ( s laves ) ,

17 . s lave ( masters ) ,

18

19 . s t a r t a d d r i ( { 32 ’ h0010 8100 , 32 ’h1A10 0000 , 32 ’ h0010 0000 , 32 ’ h0000 0000 } ) ,

20 . end addr i ( { 32 ’ h0011 0100 , 32 ’h1A11 FFFF , 32 ’h0010 7FFF , 32 ’h0008 FFFF } )

21 ) ;

The memory map is defined with a start and end vector of addresses as shown in Listing 4.1. Associated

to each region is a new AXI master bus. Therefore is defined the number of masters by the parameter

NB MASTER=4, matching the number of defined address regions. These buses connect between the

AXI interconnect and the remaining components connected to the core (instruction/data memories, pe-

ripherals and other additional components).

The new instantiated memory was tested using GDB as debug tool. Issuing writes and reads on the de-

fined address range, proving functionality, testing configurations and new components. Based on these

tests, the work moved on to the next stage of adding an AXI converter. It would need to be capable of

44

converting from AXI-full to AXI-lite, supporting communications between processor and accelerator.

4.4 New Accelerator

Following the previous work on Section 4.3, the same procedure of defining a memory region applies

when attaching a new accelerator into the AXI interconnect, a top overview is shown in Figure 3.4. The

accelerator is instantiated in the core region.sv file, wherein all the core related hardware blocks are

as well (debug, data and instruction memories, protocol converters, memory multiplexers and RISC-V

core). In the same core region the accelerator, holding an AXI-lite slave interface is connected to the

master interface of the AXI-full to AXI-lite converter. This converter is based on the Vivado block AXI

protocol converter, although it was optimized to fit the requirements of pulpino’s AXI-full interface, com-

plying with the protocol specifications already in use on it. All the additional compatibility with AXI3 (see

Section 3.1) was removed, hence all AXI communications are based on the AXI4 specification.

The new accelerator inputs all the AXI-lite signals and has all the necessary logic to it. Additionally the

kernel block needs to be instantiated and any further necessary hardware added, to be able to work with

the AXI-lite mode of operations (using slave registers). The additional hardware added to the addressed

accelerators are covered in Sections 3.3 and 3.3.

4.5 Summary

On the previous sections of the current chapter, all the required steps to setup the environment in which

PULPino was tested are presented. After system configuration and boot, generating and deploying the

bitstream into the FPGA, are required to afterwards upload and execute programs on it. Additionally, it

is detailed in this chapter the approach taken to deploy an new AXI interconnect slave/accelerator.

45

Chapter 5

Experimental Results

In this Chapter are presented the experimental results obtained from the tests performed on PULPino,

with and without the hardware accelerators addressed in this Thesis. Both algorithms FFT and SHA-3,

are hereby under analysis. The goal is to compare the performance of both software and hardware, tak-

ing conclusions on the attainable speedup and energy-savings achieved. Moreover, energy consumption

and efficiency are also compared.

5.1 Software vs Hardware

To measure the speedup that an hardware accelerator provides against software-only one, where set

test benches to verify the performance and energy efficiency of both implementations. Herein this sec-

tion are presented the software-only algorithms (which do not require an accelerator), and the ones that

interact with the accelerators, used to compute SHA-3 and FFT. The software-only algorithms were ad-

justed to perform in similar conditions on the accelerators. Thus, performing a fair comparison between

hardware and software implementations.

Both accelerators were synthesized and implemented with the same tools and optimization strategy. On

synthesis was used the strategy: ”Flow Area Optimized High”, while on implementation the optimization

strategy was ”Area Explore”. These strategies were chosen by PULPino’s developers to be the most

adequate for the release here under analysis. Taking into account that the purpose is to operate under

restricted power envelopes on IoT domain, while still accomplishing its computational requirements.

5.1.1 SHA-3

The algorithm used to implement SHA-3 was based on [55] implementation (Appendix A), written in

C++. Compiled using gcc with -O3 optimization and all the available instructions extensions enabled.

Configured with the same amount of permutation rounds and 512 bit hash output, that equals the hard-

ware accelerator kernel configuration. The base input test message used was: ”The quick brown fox

jumps over the lazy dog ”. Having a size of 44 bytes when translated from ASCII to binary, is a known

pangram sentence that includes all the letters of the alphabet, commonly used test message on different

46

kinds of hash and encryption algorithms.

To test the accelerator, was developed and used the algorithm shown in Listing 5.1. It interfaces with the

accelerator according with specifications shown in Section 3.2.2. In this case, the optional functionality

for slv reg3, was used to send the byte num configuration value needed on the SHA-3 kernel.

The AXI-lite registers addresses are defined at the beginning of Listing 5.1. The address region was set

like described in Section 4.3. It starts by reseting the accelerator at line 14, followed by the byte num

value written on slv reg1. Then it is ready to start sending the input message NR MSG times (set for

test bench purposes), at line 19. The last input message is sent to slv reg2, in this case it is a dummy

message, because the byte num value is equal to zero (see Section 3.3). After the last input is sent, it

waits for the computation to finish (line 36). When the output hash is ready to be fetched, the value of

slv reg0 is equal to 0xdeadbeef. Finally the resultant 512-bit hash is read from slv reg3.

To acquire the amount of clock cycles the algorithm takes to finish, it was set a timer incrementing at

every clock cycle. Reseting and starting it at the beginning of the algorithm (line 11 and 12) and finally

stopped at the end (line 42) of the algorithm. Afterwards, code to print the outputs, for instance, to the

serial port that might be added, for debug or user interface purposes.

Listing 5.1: SHA-3 interface with accelerator C++ code

1 # def ine SLV REG0 0x00200000



4 # def ine SLV REG3 0x0020000C

5 # def ine NR MSG 10

6

7 void main ( ){

8 v o l a t i l e i n t ∗ a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG0) ;

9 unsigned i n t aux [ 1 6 ] ;

10

11 r e s e t t i m e r ( ) ;

12 s t a r t t i m e r ( ) ;

13

14 ∗ a x i l i t e r e g = 0x01010101 ; / / rese t doing a w r i t e on s l v reg0

15 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG3) ;

16 ∗ a x i l i t e r e g = 0x00000000 ;

17 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG1) ; / / increment addr to w r i t e on s l v reg1

18

19 f o r ( i n t k=0 ; k<NR MSG; k++){

20 ∗ a x i l i t e r e g = 0x54686520 ; / / ” The ”

21 ∗ a x i l i t e r e g = 0x71756963 ; / / ” qu ic ”

22 ∗ a x i l i t e r e g = 0x6b206272 ; / / ” k br ”

23 ∗ a x i l i t e r e g = 0x6f776e20 ; / / ”own ”

24 ∗ a x i l i t e r e g = 0x666f7820 ; / / ” fox ”

25 ∗ a x i l i t e r e g = 0x6a756d70 ; / / ” jump ”

26 ∗ a x i l i t e r e g = 0x73206f76 ; / / ” s ov ”

27 ∗ a x i l i t e r e g = 0x65722074 ; / / ” er t ”

28 ∗ a x i l i t e r e g = 0x6865206c ; / / ” he l ”

29 ∗ a x i l i t e r e g = 0x617a7920 ; / / ” azy ”

47

30 ∗ a x i l i t e r e g = 0x646f6720 ; / / ” dog ”

31 }

32 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG2) ; / / l a s t w r i t e

33 ∗ a x i l i t e r e g = 0x00000000 ; / / dummy w r i t e when byte num=0

34

35 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG0) ;

36 whi le ( ( unsigned )∗ a x i l i t e r e g ! = 0xdeadbeef ) ; / / wa i t computat ion complet ion

37

38 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (SLV REG3) ; / / read on addr 3

39 f o r ( i n t j =0 ; j<=15 ; j ++) / / Fetch Hash−512

40 aux [ j ] =∗ a x i l i t e r e g ;

41

42 s top t ime r ( ) ;

43 }

A set of tests were performed with different sized messages defined by NR MSG, always with the same

text of the message before but replicated up to 10 times. The hashes were checked in every different

message, to verify the well functioning of both systems (with and without acceleration). After executing

the tests to evaluate the amount of clock cycles required for both implementations, with a single 40MHz

clock (maximum frequency on FPGA), is possible to verify the speedup in Figure 5.1.

Through an analysis of the graph can be concluded, that a significant speedup is achieved over a

Figure 5.1: SHA-3 computation speedup using hardware accelerator. Multiple message sizes weretested.

software-only implementation, by having an additional hardware accelerator for the purpose. Achieving

a speedup of 104 times having a 44 bytes length message, meaning that the hardware accelerated

implementation is 104 times faster than a software-only one. Being possible to verify a speedup up

to 185 times on a 440 bytes length message, in comparison to the non accelerated version. With the

speedup tending to increase along with the length of the input message, as the linear regression line

indicates.

48

5.1.2 FFT

A well known algorithm was used to test the performance of the FFT on software-only. Namely Cooley-

Turkey FFT which implementation, written in C++, was based on [56] algorithm (Appendix A). As in

SHA-3 previously presented, it was compiled with the top optimization level with all PULPino’s instruc-

tion extensions enabled. It requires only the stock implementation of PULPino to be executed, since no

hardware acceleration is at stake in this software-only test bench.

Unfortunately, the hardware developed presented in Section 3.3, implementing the FFT accelerator, is

only fully functional in simulation. Due to the lack of debugging tools, was solely possible to conclude

that a problem is present on the multipliers unit. After mapping the design into hardware, this units were

not outputting the most significant 32-bit result of 64-bit total, on a 32x32-bit multiplication. Although,

despite the malfunctioning of this unit, the hardware was properly mapped after synthesis and imple-

mentation. Making it possible to experimentally evaluate the performance and the efficiency of the FFT

accelerator. Both speedup and power consumption results are valid and comparable with the SHA-3

accelerator, relying on the same tools, optimization settings and platform.

In order to test the software-only and accelerated algorithms that implement the FFT, a data set of 256

complex samples were defined as input. Due to a complex number being composed by a real and imag-

inary part, the total of inputs goes to 512 of 32-bit each. It corresponds to a total of 2Kbytes of input and

output data.

The chosen FFT kernel, which is integrated in the accelerator, as detailed in Section 3.3, has several

possible configurations. Which allows a deeper analysis on different kinds of setups, to establish a term

of comparison between them and conclude which one would be more beneficial. Either on a energy-

efficiency or performance point of view. The defined configuration has fixed and variable parameters.

The fixed ones are the problem specification (check Table 3.3):

• Transform size: 256 input complex data samples;

• Direction: Forward DFT;

• Data type: Fixed Point;

• Fixed point precision: 32-bit;

• Mode: unscaled

The variable parameters control the translation of the FFT algorithm into the kernel’s verilog file. Multiple

setups were tested, by changing this parameters and its experimental results will be presented further

on. To test the FFT accelerator, the algorithm shown in Listing 5.2 was developed, which allows the core

to interface with it. It can be decomposed in four main sections. First it resets the accelerator at line 8,

then starts by sending the assigned inputs to slv reg1 and the last input to slv reg2, at lines 11 and 15,

respectively. Moreover, it waits for the computation to be complete after sending the last input, at line

18. Finally the results are fetched at line 21, from the slv reg3. The algorithm is completely independent

from the hardware configuration of the accelerator.

49

Listing 5.2: FFT interface with accelerator C++ code

1 void main ( ){

2 v o l a t i l e i n t ∗ a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (ADDR0) ;

3 i n t aux [512 ] ;

4

5 r e s e t t i m e r ( ) ;

6 s t a r t t i m e r ( ) ;

7

8 ∗ a x i l i t e r e g = 0x01010101 ; / / rese t doing a wr on s l v reg0

9 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (ADDR1) ; / / w r i t e on s l v reg1

10

11 f o r ( i n t i =0 ; i<511 ; i ++)

12 ∗ a x i l i t e r e g = buf [ i ] ;

13

14 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (ADDR2) ; / / l a s t w r i t e

15 ∗ a x i l i t e r e g = buf [511 ] ;

16

17 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (ADDR0) ;

18 whi le ( (∗ a x i l i t e r e g ) ! = 0xdeadbeef ) ; / / wa i t f o r computat ion complet ion

19

20 a x i l i t e r e g = ( v o l a t i l e i n t ∗ ) (ADDR3) ; / / f e t ch r e s u l t s from s l v reg3

21 f o r ( i n t j =0 ; j<=511 ; j ++)

22 aux [ j ] =∗ a x i l i t e r e g ;

23

24 s top t ime r ( ) ;

25 }

Tests were performed with different kind of parameters, by changing the architecture and radix (see

Table 3.3). When the radix is increased, automatically the streaming width is also increased to match

the number of input words needed by it. The architecture may vary between iterative and streaming

version (Section 3.3). On Figure 5.2 is shown the speedup achieved by adding an FFT accelerator. The

software-only FFT algorithm performs on a total of 38126 clock cycles, which is the basis to the speedup

calculations.

Figure 5.2 presents the ratio between the amount of clock cycles required to compute the FFT algorithm

on PULPino, with and without hardware acceleration. Is noticeable, by the analysis of Figure 5.2, that

the stream version speedup overcomes the iterative one, although it is achieved with extra hardware

cost. The extra hardware versus power consumption analysis will be further on presented in Section

5.2.

Not so significant speedup results were achieved with FFT, when compared to SHA-3 previous analysis.

This is in part due to the increased amount of compressed instructions used in FFT algorithm, in com-

parison with the SHA-3 one. The FFT software implementation executes a total of 39458 instructions,

from which 34179 are compressed, having 87% compressed instructions. On the other hand the SHA-3

software-only algorithm has only 26% of compressed instructions, of a total 251677 instructions from

which 64299 are compressed. RISC-V compressed instruction claim to increase, not only performance,

but also energy-efficiency and reduce the code size [57].

50

Figure 5.2: FFT computation speedup using hardware accelerator. On multiple radix implemented initerative or stream mode.

5.2 Power Efficiency

This section addresses how the novel hardware accelerators, developed under the scope of this thesis,

will influence the power efficiency of the hole system (pulpino + accelerators). Is intended to show that

with the use of hardware accelerators, both performance and energy consumption might be enhanced.

Having additional hardware, implies an increase of fabric area on ASICs or more logic resources on

FPGA. On certain applications that will directly benefit from the acceleration, it might reduce signifi-

cantly the computation time. Allowing the processor to sleep in a more early stage. Most part of IoT

embedded systems, which PULPino is fitted for, might directly benefit from this improvement. Hence,

usually they use a run to halt operation mode. In which, all the required computation is done as fast as

possible to afterwards enter in sleep mode. For an energy efficient hardware acceleration, the power

savings achieved by entering faster in sleep mode, need to overcome the extra static energy cost accel-

erators bring over.

Measuring the real power consumption of such presented systems is not a simple task. Due to the

required development platform (ZedBoard) restrictions, is not possible to measure real power consump-

tion on the FPGA fabric alone. Only to estimate it by using the available tools, explained further ahead.

To overcome this issue, would be needed a development board, that could power the FPGA chip with

an external power supply. Additionally, would also be needed real time control over the FPGA’s package

temperature, due to its influence in power consumption measurements [58].

With such hardware restrictions, the setup used to measure the system’s power consumption was Xilinx

Vivado working together with Modelsim as simulation tool. Modelsim provides all the activity from the

active signals of the application under analysis, which is compiled into a Switching Activity Interchange

Format (SAIF) file. On Vivado the hardware is synthesized and implemented, having the simulation run-

ning over the implemented hardware. It corresponds to the hardware effectively mapped into the board’s

51

FPGA. On such simulation, the saif file is loaded to improve it’s accuracy, providing average switching

activity information about active signals. More information about the simulation tool and justification

about the choice of Modelsim as default simulation tool of Vivado, was previously detailed in Section

4.2.

On the following section, the results from the several power consumption tests performed, upon the

hardware accelerators under the scope of this Thesis. The goal was to measure its power consumption

on the attainable states of operation, such as, size of encrypted messages in SHA-3 or multiple radix/ar-

chitecture configurations in the FFT one. To finally conclude which one would the most power efficient

regarding, computation time, static and dynamic power consumption and overall computation power. By

tracing energy savings graphs that help on such analysis. Some of the graphs showing the obtained

results presented on the next Sections, do not contain all three values (40MHz, 20MHz and 5MHz) for

frequencies. Due to the fact that some of them are similar, and not containing new relevant curves.

5.2.1 SHA-3

SHA-3 accelerator energy efficiency was tested by performing multiple encryptions, with different mes-

sage sizes. They were performed in a PULPino implementation with and without hardware accelerator.

Not having any additional hardware when testing the software-only in order to obtain most accurate

power measurements. With an initial message length of 44 bytes, which corresponds to the size of a

commonly used sentence as referenced before in Section 5.1.1. The next test messages correspond to

the initial message replicated and concatenated up to 10 times.

Figure 5.3, depicts the power measurement results obtained from the several computations described

Figure 5.3: Pulpino with SHA-3 accelerator computation energy with and without hardware acceleration.Combined with achieved energy ratio (SW/HW) at 5MHz

previously. In which are shown the computation power consumption, which corresponds to the total

52

amount of energy required to perform the message encryption. Notice that the total amount of energy

required by Pulpino, with SHA-3 accelerator, is multiplied by a factor of 10, in order to be perceptible in

the graph. Otherwise it would not be noticed, in comparison with the software-only computation energy.

Both were calculated accordingly with the following equation 5.1:

ComputationEnergy =1

Freq∗ ClkCycles ∗ Power (5.1)

In which Freq corresponds to Pulpino’s main operation frequency, ClkCycles to the amount of clock

cycles required by the processor to conclude the computation and Power defines the on-chip power

consumption estimated by Vivado. This power result is composed by two main parcels, dynamic and

static power. The dynamic power is originated from the logic switching activity. The static power repre-

sents the power consumed by the FPGA logic when no signals are toggling.

Regarding PULPino with SHA-3 accelerator, the dynamic and static values of on-chip power consump-

tion are 189mW and 123mW, respectively. On the stock version of PULPino, the dynamic and static

values are lower, 47mW and 121mW, respectively. This is due to not having the additional hardware that

the accelerator bring on. Although, is possible to notice that the additional static power is very small.

Adding only 2mW, which translates into a 1.7% increase on static power. This is the amount of extra

power the accelerator would consume when is idle. Hence the hardware does not vary with the increase

Figure 5.4: Pulpino with SHA-3 accelerator energy saved. At 40MHz, 20MHz and 5MHz of main clockfrequency with different encrypted message sizes.

of the message size, there is no need to collect new on-chip power consumption data on each compu-

tation (corresponding to each bar on Figure 5.3). Consequently, the power figures presented before, are

the same throughout calculations for hardware accelerated and software-only results.

On the right vertical axis of Figure 5.3, is represented the energy ratio, depicted by the black full line on

the graph. Which corresponds to the amount of times the energy required to compute was reduced, by

53

using the accelerator. In this 5MHz show case, it is reduced up to 160 times when the message length

is 440 bytes. As the tendency line of the energy ratio indicates, the energy savings tend to increase with

the length of the message. Having consequently, more efficient usages of the SHA-3 accelerator, as

the message length increases. The same pattern is common among all remaining frequencies results.

Although, the maximum achieved energy ratio varies. At 20MHz a maximum energy ratio of 114 times,

was achieved. At 40MHz, it can only go up to 100 times. All of this values correspond to an input mes-

sage of 440 bytes length.

Figure 5.4 depicts the energy savings achieved by using the SHA-3 accelerator. In which are shown

how much energy, in percentage, is saved in comparison with the stock version of PULPino, without

hardware acceleration. From the obtained results, the lowest energy savings start at 98.23% for a 44

byte message at 40MHz. Going up to 99.39% at 5MHz with a message size of 440 bytes. On lower

frequencies the energy savings are higher. Starting with difference of 0.68% between the highest and

lowest frequencies, on a message size of 44 bytes. This delta tends to decrease to 0.39% for 440 bytes

messages, along with the increase of the message size. Thus, the operation frequency, tends to have

less impact on the energy saved, for longer message sizes. Meaning, that ”long” messages can be

computed faster, by increasing the main clock frequency, with less impact on energy savings.

5.2.2 FFT

On the PULPino with FFT accelerator, the power results were obtained in a similar manner as the SHA-3

one. With a slightly difference, instead of varying the input data sizes as before, the same input was

tested on multiple configurations of the FFT accelerator’s architecture. More precisely, on radix 2 and

radix 4, using on both iterative and stream architectures (more details in Section 3.3). Despite the pre-

sented speedup results on Section 3.3, also having figures for radix 16 architecture, it was not possible

to successfully translate it into hardware on the FPGA. Due to similar problems of the previously stated

in Section 5.1.2 regarding the faulty multiplier unit of Vivado. Nevertheless, the power results on radix

2 and 4, are sufficient to draw conclusions on the energy efficiency of PULPino with the attached FFT

accelerator.

The power results for PULPino without hardware accelerator, means that no extra hardware was added.

On Figure 5.5, it is shown the total on-chip power consumption in mW, divided in two parts: dynamic and

static power. Each column of the graph represents a new hardware configuration to compute the same

input data, using FFT algorithm. The SW-only one, corresponds to the version in which there are no ac-

celerator attached to PULPino. All the remaining columns results, come from the multiple architectures

of the fft accelerator attached to PULPino. As might be interpreted from the graph on Figure 5.5, the

accelerated version consumes always more on-chip power than the sw-only version. This is explained

by the additional hardware required by the accelerator. Stream architectures, as more resource hungry

than the iterative ones, have a higher overall on-chip power consumption. Consequently, achieving su-

perior speedups than the iterative architecture, as shown in Section 5.1.2. This increase in total on-chip

power, is mainly due to dynamic power, since static power varies significantly less, as shown in Figure

54

Figure 5.5: Pulpino with FFT accelerator, dynamic and static on-chip power consumption at 40MHz.

5.6. Even having relatively small variations, when compared with dynamic power, is noticeable the in-

crease of static power on stream architectures over the iterative ones.

The use of a FFT accelerator translates into a maximum increase of 5mW of static power, which corre-

sponds to 4% of the total on-chip static power of PULPino without accelerators. Meaning that when the

processor is at idle, only an maximum increase of 4% on power consumption would occur. Being an ac-

ceptable figure, for a such system that targets IoT embedded systems with restricted power envelopes,

which usually operates in a run to halt mode. Consequently, might stay at a considerable amount of time

at idle or sleep mode.

The purpose of attaching accelerator to PULPino is to enhance its power efficiency. Said so, Figure 5.7

presents the power saved in percentage by using the fft accelerator, when computing an FFT algorithm.

Achieving a maximum saving of 66 % when using a radix 2 iterative architecture configuration. Same

architectures as before on multiple frequencies, allow to analyze which one tends to save more power.

Combined with the computation time the algorithm takes to finish execution. The ”optimal” mode of

operation is achieved when the computation time is minimum and the energy savings is maximum. Thus,

if simple ratio between both of these results is calculated for every column of the graphs, is possible to

point out which one would be it. Therefore,

EnergySaved

ComputationT ime= 1.75

when using radix 2 iterative at 40MHz, presents itself as the highest ratio among all. This architecture is

from all the remaining, the one that requires less FPGA resources, thus at a higher frequency of 40MHz,

seems to be the best configuration to balance energy savings and computation time. Even though,

this might not be the best configuration for all embedded applications. Each one has its own energy

restrictions and computation time requirements. The graph on Figure 5.7, might be a useful guide in

finding out which architecture and frequency, best fits a certain type of system requirements.

55

Figure 5.6: Pulpino with FFT accelerator, static on-chip power consumption at 40MHz.

On Figure 5.8, is portrayed a graphs that combines computation energy and energy ratio, which were

calculated from the same results used on the previous graphs, using the same data input at 40MHz of

main clock frequency. The computation energy figures are based on the same equation 5.1 as presented

on the previous Section 5.2.1. These results also confirm that the FFT accelerator’s architecture which

saves more power is the radix 2 iterative version. It can be stated by analyzing the energy ratio line. This

line corresponds to the ratio between the energy consumption (on the sames graph) of both software-

only and hardware accelerated version.

5.3 Summary

In this chapter, the analysis of the experimental results were presented, starting by the achieved speedup

due to the hardware accelerators. All the results are relative to the implemented SHA-3 and FFT ac-

celerators developed under the scope of this Thesis. Afterwards, a power efficiency analysis on such

accelerators is presented, in which conclusions are drawn about its power consumption and overall

impact on energy-efficiency of PULPino.

56

Figure 5.7: Pulpino with FFT accelerator, energy saved vs computation time.

Figure 5.8: Pulpino with FFT accelerator, computation energy vs energy ratio (SW/HW) at 40MHz.

57

58

Chapter 6

Conclusions and Future Work

In conclusion, the initial goal of boosting the energy efficiency of PULPino for applications on embedded

IoT devices, operating within restricted power envelopes, was successfully accomplished in this Thesis.

The improvements were achieved by attaching two different hardware accelerators, namely a crypto-

graphic SHA-3 accelerator and a digital signal processing FFT accelerator.

In order to successfully attach and deal with the heterogeneity between accelerator and processor, a

custom low-power AXI-lite based interface was developed. Having the advantage of providing a simple

and plug-n-play manner for the current and future accelerators to interface with the processor. Encour-

aging the development of new attachable accelerators by the open source community, since PULPino

was release under an open source license. Consequently, saving development time due to reutilization

of hardware designs. Paving the way for more modular embedded systems, in which is possible to add

the most suitable accelerator for a certain kind of final application.

Under the scope of this Thesis, two accelerators were attached and speedup and energy efficiency eval-

uated: SHA-3 and FFT. Achieving a speedup of 185 times on the SHA-3 algorithm and 3 times on the

FFT one. Being possible to achieve higher values of speedup as the input data size increases, as shown

by the presented tendency lines. As stated, the cryptographic algorithm presents itself as the most suit-

able for acceleration in comparison to the FFT one. This can be explained by the low percentage of

compressed instructions that the SHA-3 non accelerated algorithm translates into. Apart from that, the

type of processing performed in the SHA-3 is more suitable for acceleration in hardware then the FFT.

Having only 26% of compressed instruction against 87% on the FFT non accelerated algorithm. RISC-V

compressed instructions claim to reduce code-size, while enhancing energy-efficiency and performance

[57].

Regarding energy savings, the SHA-3 and FFT accelerators can save up to 99.39% and 66% of energy,

respectively. For the FFT accelerator, an ”optimal” point of operation was proposed, setting it with a

radix 2 iterative configuration at the maximum attainable frequency of 40MHz. Among the several tested

configurations, the best ratio was obtained between energy savings and the amount of time required to

compute the FFT algorithm. Other modes of operation might be best suited, depending on the energy

59

requirements of the target application.

Regarding future work, there are always room for improvements on the current AXI-lite interface, which

connects the accelerator to the main AXI interconnect bus. Other kind of accelerators might benefit from

additional control signals or other custom features, that might be added to this interface. Apart from

AXI-lite, there are other kind of buses that could improve data communication between the accelerator

and the AXI interconnect bus. Such as AXI Stream, that has advantages on data streaming, but lacks

individual control registers. This could be overcome by an implementing a more complex communication

protocol over the data stream, having higher data throughputs while still being possible to control the

accelerator, without any external signals. On this Thesis, the processor is the one to fetch the required

data from the memories into the accelerator, over the AXI bus. Nevertheless, higher data interchange

could be achieved by featuring the accelerator with an Direct Memory Access (DMA) functionality. Al-

lowing a direct access to such data directly from the memory. Despite being possible to achieve higher

data throughputs, on a usually known bottleneck. It also brings over additional hardware, which might

increase the overall power consumption on a system targeting low-power applications.

Another possible improvements, for future work, is to enable the control signals received from the accel-

erator to trigger intrinsic PULPino interruptions. Meaning that the processor would not have to operate

in pooling mode. With interrupts, when the external signal is received, a flag is triggered and a the

proper interrupt service routine executed. As PULPino already provides such feature on its peripherals,

a similar implementation could be developed for the new attachable accelerators. Consequently, having

room for improvements on its overall energy-efficiency.

60

References

[1] S. Davis, K. Holland, J. Yang, M. A. Fury, and L. Shon-Roy. The era of iot advancing cmp consum-

ables growth. In International Conference on Planarization/CMP Technology (ICPT), 2015.

[2] DFT/FFT IP Core Generator, 2017. URL http://www.spiral.net/hardware/dftgen.html.

[3] H. Hsing. SHA3 Core Specification, 2013.

[4] M. Alioto. Ultra Low Power Design Approaches for IoT. In HOTCHIPS, 2015.

[5] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gurkaynak,

and L. Benini. A near-threshold risc-v core with dsp extensions for scalable iot endpoint devices.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016.

[6] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, P. Flatresse, and

L. Benini. Pulp: A parallel ultra-low-power platform for next generation iot applications. In

HOTCHIPS, 2015.

[7] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini. Energy-efficient vision on the PULP platform

for ultra-low power parallel computing. In Proceedings of the 2014 IEEE Workshop on Signal

Processing Systems, Piscataway, NJ, 2014. IEEE.

[8] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini. Pulp: A ultra-low power parallel accelerator for

energy-efficient and flexible embedded vision. Journal of Signal Processing Systems, 84(3):339–

354, 2016. ISSN 1939-8115. doi: 10.1007/s11265-015-1070-9. URL http://dx.doi.org/10.

1007/s11265-015-1070-9.

[9] M. Rusci, D. Rossi, M. Lecca, M. Gottardi, L. Benini, and E. Farella. Energy-efficient design of an

always-on smart visual trigger. In 2016 IEEE International Smart Cities Conference (ISC2), pages

1–6, Sept 2016. doi: 10.1109/ISC2.2016.7580824.

[10] M. Gautschi, M. Schaffner, F. K. Gurkaynak, and L. Benini. 4.6 a 65nm cmos 6.4-to-

29.2pj/[email protected] shared logarithmic floating point unit for acceleration of nonlinear function kernels

in a tightly coupled processor cluster. In 2016 IEEE International Solid-State Circuits Conference

(ISSCC), pages 82–83, Jan 2016. doi: 10.1109/ISSCC.2016.7417917.

61

http://www.spiral.net/hardware/dftgen.html

http://dx.doi.org/10.1007/s11265-015-1070-9

http://dx.doi.org/10.1007/s11265-015-1070-9

[11] Y. Popoff, F. Scheidegger, M. Schaffner, M. Gautschi, F. K. Gurkaynak, and L. Benini. High-

efficiency logarithmic number unit design based on an improved cotransformation scheme. In 2016

Design, Automation Test in Europe Conference Exhibition (DATE), pages 1387–1392, March 2016.

[12] F. Conti and L. Benini. A ultra-low-energy convolution engine for fast brain-inspired vision in mul-

ticore clusters. In 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pages

683–688, March 2015. doi: 10.7873/DATE.2015.0404.

[13] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini. A heterogeneous multi-core system-

on-chip for energy efficient brain inspired vision. In 2016 IEEE International Symposium on Circuits

and Systems (ISCAS), pages 2910–2910, May 2016. doi: 10.1109/ISCAS.2016.7539213.

[14] F. Conti, D. Palossi, A. Marongiu, D. Rossi, and L. Benini. Enabling the heterogeneous accelerator

model on ultra-low power microcontroller platforms. In 2016 Design, Automation Test in Europe

Conference Exhibition (DATE), pages 1201–1206, March 2016.

[15] PULP - An Open Parallel Ultra-Low-Power Processing-Platform, 2016. URL http://

iis-projects.ee.ethz.ch/index.php/PULP.

[16] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gurkaynak, A. Bartolini, P. Flatresse, and L. Benini.

A 60 gops/w, -1.8 v to 0.9 v body bias ulp cluster in 28 nm utbb fd-soi technology. Solid-State

Electronics, 117:170–184, 2015.

[17] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gurkaynak, J. Constantin, A. Bartolini, I. Miro-Panades,

E. Beigne, F. Clermidy, F. Abouzeid, P. Flatresse, and L. Benini. 193 mops/mw @ 162 mops, 0.32v

to 1.15v voltage range multi-core accelerator for energy efficient parallel and sequential digital

processing. Cool Chips XIX, pages 1–3, 2016.

[18] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini. A heterogeneous multi-core system-

on-chip for energy efficient brain inspired vision. ISCAS, pages 2–4, 2016.

[19] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. Energy efficient parallel com-

puting on the pulp platform with support for openmp. In 2014 IEEE 28th Convention of Electrical

Electronics Engineers in Israel (IEEEI), pages 1–5, Dec 2014. doi: 10.1109/EEEI.2014.7005803.

[20] F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gurkaynak, M. Muehlberghuber,

M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini. An iot endpoint system-on-chip for

secure and energy-efficient near-sensor analytics. IEEE Transactions on Circuits and Systems I:

Regular Papers, PP(99):1–14, 2017. ISSN 1549-8328. doi: 10.1109/TCSI.2017.2698019.

[21] F. Conti, D. Palossi, R. Andri, M. Magno, and L. Benini. Accelerated visual context classification on

a low-power smartwatch. IEEE Transactions on Human-Machine Systems, 47(1):19–30, Feb 2017.

ISSN 2168-2291. doi: 10.1109/THMS.2016.2623482.

[22] PULPino: A small single-core RISC-V SoC, 2016. URL iis-projects.ee.ethz.ch/images/d/d0/

Pulpino_poster_riscv2015.pdf.

62

http://iis-projects.ee.ethz.ch/index.php/PULP

http://iis-projects.ee.ethz.ch/index.php/PULP

iis-projects.ee.ethz.ch/images/d/d0/Pulpino_poster_riscv2015.pdf

iis-projects.ee.ethz.ch/images/d/d0/Pulpino_poster_riscv2015.pdf

[23] A. Traber and M. Gautschi. RI5CY:User Manual, 2016.

[24] G.-R. Uh, Y. Wang, D. Whalley, S. Jinturkar, C. Burns, and V. Cao. Techniques for Effec-

tively Exploiting a Zero Overhead Loop Buffer, pages 157–172. Springer Berlin Heidelberg,

Berlin, Heidelberg, 2000. ISBN 978-3-540-46423-5. doi: 10.1007/3-540-46423-9 11. URL

http://dx.doi.org/10.1007/3-540-46423-9_11.

[25] R. B. Lee. Subword parallelism with MAX-2, volume 16, pages 51–59. IEEEMicro, 1996.

[26] Pulp-Platform Documentation, 2017. URL http://www.pulp-platform.org/documentation/.

[27] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. The Morgan

Kaufmann Series in Computer Architecture and Design. Elsevier Science, San Francisco, CA, USA,

5th edition, 2011.

[28] B. Benton. Ccix, gen-z, opencapi: Overview & comparison. In OPENFABRICS ALLIANCE, 2017.

[29] Gen-Z-Consortium. Gen-Z Overview, 2016.

[30] Y. Shao and D. Brooks. Research Infrastructures for Hardware Accelerators. Synthesis Lectures

on Computer Architecture. Morgan & Claypool Publishers, 2015. ISBN 9781627058322. URL

https://books.google.pt/books?id=uzEECwAAQBAJ.

[31] Research Infrastructures for Accelerator Centric Architectures, 2017. URL http://accelerator.

eecs.harvard.edu/isca14tutorial/isca2014-tutorial-all.pdf.

[32] Xilinx. Vivado Design Suite User Guide - High-Level Synthesis, 2017.

[33] H. K. Rawat and P. Schaumont. Simd instruction set extensions for keccak with applications to

sha-3, keyak and ketje. In Proceedings of the Hardware and Architectural Support for Security and

Privacy 2016, HASP 2016, pages 4:1–4:8, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-

4769-3. doi: 10.1145/2948618.2948622. URL http://doi.acm.org/10.1145/2948618.2948622.

[34] H. Rawat and P. Schaumont. Vector instruction set extensions for efficient computation of keccak.

IEEE Transactions on Computers, PP(99):1–1, 2017. ISSN 0018-9340. doi: 10.1109/TC.2017.

2700795.

[35] C. Schmidt and A. Izraelevitz. A fast parameterized sha3 accelerator. Technical Report UCB/EECS-

2015-204, EECS Department, University of California, Berkeley, Oct 2015. URL http://www2.

eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-204.html.

[36] C. Liu, R. Duarte, O. Granados, J. Tang, and J. Andrian. Critical path based hardware acceleration

for cryptosystems 1, 2012.

[37] P. Gaydecki and I. of Electrical Engineers. Foundations of Digital Signal Processing: The-

ory, Algorithms and Hardware Design. IEE circuits and systems series: Institution of Electri-

cal Engineers. Institution of Engineering and Technology, 2004. ISBN 9780852964316. URL

https://books.google.pt/books?id=6Qo7NvX3vz4C.

63

http://dx.doi.org/10.1007/3-540-46423-9_11

http://www.pulp-platform.org/documentation/

https://books.google.pt/books?id=uzEECwAAQBAJ

http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-all.pdf

http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-all.pdf

http://doi.acm.org/10.1145/2948618.2948622

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-204.html


https://books.google.pt/books?id=6Qo7NvX3vz4C

[38] I. Kramberger. Dsp acceleration using a reconfigurable fpga. In Industrial Electronics, 1999. ISIE

’99. Proceedings of the IEEE International Symposium on, volume 3, pages 1522–1525 vol.3, 1999.

doi: 10.1109/ISIE.1999.797022.

[39] Xilinx. Fast Fourier Transform v9.0 - LogiCORE IP Product Guide, 2015.

[40] Intel. FFT IP Core - User Guide, 2017.

[41] Xilinx. FIR Compiler v7.2 - LogiCORE IP Product Guide, 2015.

[42] Altera. FIR Compiler- User Guide, 2011.

[43] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Puschel. Computer generation of hardware for lin-

ear digital signal processing transforms. ACM Transactions on Design Automation of Electronic

Systems, 17(2), 2012.

[44] Xilinx. AXI Reference Guide, 2011.

[45] A. Traber and M. Gautschi. PULPino: Datasheet, 2016.

[46] I. Loi. AXI 4 NODE Application note, 2014.

[47] M. P. G. Bertoni, J. Daemen. The Keccak reference, version 3, 2011.

[48] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society for Industrial and

Applied Mathematics, Philadelphia, PA, USA, 1992. ISBN 0-89871-285-8.

[49] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Puschel. Hardware implementation of the discrete

Fourier transform with non-power-of-two problem size. In International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), 2010.

[50] M. Puschel, P. A. Milder, and J. C. Hoe. Permuting streaming data using rams. Journal of the ACM,

56(2):10:1–10:34, 2009.

[51] PULPino’s Github online repository, 2016. URL https://github.com/pulp-platform/pulpino.

[52] Xilinx. UG585 - Zynq-7000 AP SoC Technical Reference Manual, 2016.

[53] Xilinx’s Tutorial - Prepare Boot Medium, 2016. URL http://www.wiki.xilinx.com/Prepare+Boot+

Medium.

[54] Zynq-7000 All Programmable SoC Software Developers Guide, 2015. URL https://www.xilinx.

com/support/documentation/user_guides/ug821-zynq-7000-swdev.pdf.

[55] A baseline Keccak implementation, 2011. URL https://github.com/coruus/saarinen-keccak/

tree/master/readable_keccak.

[56] A Simple and Efficient FFT Implementation in C++, 2017. URL http://www.drdobbs.com/cpp/

a-simple-and-efficient-fft-implementatio/199500857?pgno=1.

64

https://github.com/pulp-platform/pulpino

http://www.wiki.xilinx.com/Prepare+Boot+Medium

http://www.wiki.xilinx.com/Prepare+Boot+Medium

https://www.xilinx.com/support/documentation/user_guides/ug821-zynq-7000-swdev.pdf

https://www.xilinx.com/support/documentation/user_guides/ug821-zynq-7000-swdev.pdf

https://github.com/coruus/saarinen-keccak/tree/master/readable_keccak

https://github.com/coruus/saarinen-keccak/tree/master/readable_keccak

http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857?pgno=1

http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857?pgno=1

[57] A. Waterman. Improving energy efficiency and reducing code size with risc-v compressed. Master’s

thesis, EECS Department, University of California, Berkeley, May 2011. URL http://www2.eecs.

berkeley.edu/Pubs/TechRpts/2011/EECS-2011-63.html.

[58] R. P. Duarte and C.-S. Bouganis. Arc 2014 over-clocking klt designs on fpgas under process,

voltage, and temperature variation. ACM Trans. Reconfigurable Technol. Syst., 9(1):7:1–7:17, Nov.

2015. ISSN 1936-7406. doi: 10.1145/2818380. URL http://doi.acm.org/10.1145/2818380.

65



http://doi.acm.org/10.1145/2818380

66

Appendix A

Software-only Algorithms

A.1 SHA-3

67

1 // keccak.c2 // 19-Nov-11 Markku-Juhani O. Saarinen <[email protected]>3 // A baseline Keccak (3rd round) implementation.4 5 #include "common.h"6 7 #define KECCAK_ROUNDS 248 9 #define ROTL64(x, y) (((x) << (y)) | ((x) >> (64 - (y))))

10 11 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__12 #define __bswap_64(x) \13 ( (((x) & 0xff00000000000000ull) >> 56) \14 | (((x) & 0x00ff000000000000ull) >> 40) \15 | (((x) & 0x0000ff0000000000ull) >> 24) \16 | (((x) & 0x000000ff00000000ull) >> 8) \17 | (((x) & 0x00000000ff000000ull) << 8) \18 | (((x) & 0x0000000000ff0000ull) << 24) \19 | (((x) & 0x000000000000ff00ull) << 40) \20 | (((x) & 0x00000000000000ffull) << 56))21 #elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__22 #define __bswap_64(x) (x)23 #else24 #error Unsupported endianness25 #endif26 27 const uint64_t keccakf_rndc[24] =28 {29 0x0000000000000001, 0x0000000000008082, 0x800000000000808a,30 0x8000000080008000, 0x000000000000808b, 0x0000000080000001,31 0x8000000080008081, 0x8000000000008009, 0x000000000000008a,32 0x0000000000000088, 0x0000000080008009, 0x000000008000000a,33 0x000000008000808b, 0x800000000000008b, 0x8000000000008089,34 0x8000000000008003, 0x8000000000008002, 0x8000000000000080,35 0x000000000000800a, 0x800000008000000a, 0x8000000080008081,36 0x8000000000008080, 0x0000000080000001, 0x800000008000800837 };38 39 const int keccakf_rotc[24] =40 {41 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14,42 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 4443 };44 45 const int keccakf_piln[24] =46 {47 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4,48 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 149 };50 51 static inline int mod5(int a) {52 while (a > 9) {53 int s = 0; /* accumulator for the sum of the digits */54 while (a != 0) {55 s = s + (a & 7);56 a = (a >> 3) * 3;57 }58 a = s;59 }60 /* note, at this point: a < 10 */61 if (a > 4) a = a - 5;62 return a;63 }64 // update the state with given number of rounds65 66 void keccakf(uint64_t st[25], int rounds)67 {68 int i, j, round;69 uint64_t t, bc[5];70 71 for (round = 0; round < rounds; round++) {72 73 // Theta74 for (i = 0; i < 5; i++)

75 bc[i] = st[i] ^ st[i + 5] ^ st[i + 10] ^ st[i + 15] ^ st[i + 20];76 77 for (i = 0; i < 5; i++) {78 t = bc[mod5(i + 4)] ^ ROTL64(bc[mod5(i + 1)], 1);79 for (j = 0; j < 25; j += 5)80 st[j + i] ^= t;81 }82 83 // Rho Pi84 t = st[1];85 for (i = 0; i < 24; i++) {86 j = keccakf_piln[i];87 bc[0] = st[j];88 st[j] = ROTL64(t, keccakf_rotc[i]);89 t = bc[0];90 }91 92 // Chi93 for (j = 0; j < 25; j += 5) {94 for (i = 0; i < 5; i++)95 bc[i] = st[j + i];96 for (i = 0; i < 5; i++)97 st[j + i] ^= (~bc[mod5(i + 1)]) & bc[mod5(i + 2)];98 }99

100 // Iota101 st[0] ^= keccakf_rndc[round];102 }103 }104 105 // compute a keccak hash (md) of given byte length from "in"106 107 int do_keccak(const uint8_t *in, int inlen, uint8_t *md, int mdlen)108 {109 uint64_t st[25];110 uint8_t temp[144];111 int i, rsiz, rsizw;112 113 rsiz = 200 - 2 * mdlen;114 rsizw = rsiz / 8;115 116 memset(st, 0, sizeof(st));117 118 for ( ; inlen >= rsiz; inlen -= rsiz, in += rsiz) {119 for (i = 0; i < rsizw; i++)120 st[i] ^= __bswap_64(((uint64_t *) in)[i]);121 keccakf(st, KECCAK_ROUNDS);122 }123 124 // last block and padding125 memcpy(temp, in, inlen);126 temp[inlen++] = 1;127 memset(temp + inlen, 0, rsiz - inlen);128 temp[rsiz - 1] |= 0x80;129 130 for (i = 0; i < rsizw; i++)131 st[i] ^= __bswap_64(((uint64_t *) temp)[i]);132 133 keccakf(st, KECCAK_ROUNDS);134 135 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__136 137 for (i = 0; i < mdlen / 8; i++)138 ((uint64_t *) md)[i] = __bswap_64(((uint64_t *) st)[i]);139 140 int remaining = mdlen % 8;141 for (i = 0; i < remaining; i++)142 ((uint8_t *) md)[mdlen - remaining + i] = ((uint8_t *) st)[mdlen + remaining - i - 1];143 #else144 memcpy(md, st, mdlen);145 #endif146 147 return 0;148 }

1 #include "common.h"2 3 typedef struct {4 int mdlen;5 char *msgstr;6 uint8_t md[64];7 } test_triplet_t;8 9 static const test_triplet_t testvec = {

10 11 64, "The quick brown fox jumps over the lazy dog ", {12 0x07, 0xb8, 0x47, 0x18, 0xDC, 0xBA, 0x3C, 0x74,13 0x61, 0x9B, 0xA1, 0xFA, 0x7F, 0x57, 0xDF, 0xE7,14 0x76, 0x9D, 0x3F, 0x66, 0x98, 0xA8, 0xB3, 0x3F,15 0xA1, 0x01, 0x83, 0x89, 0x70, 0xA1, 0x31, 0xE6,16 0x21, 0xCC, 0xFD, 0x05, 0xFE, 0xFF, 0xBC, 0x11,17 0x80, 0xF2, 0x63, 0xC2, 0x7F, 0x1A, 0xDA, 0xB4,18 0x60, 0x95, 0xD6, 0xF1, 0x25, 0x33, 0x14, 0x72,19 0x4B, 0x5C, 0xBF, 0x78, 0x28, 0x65, 0x8E, 0x6A }20 21 };22 23 uint8_t md3[64] __sram;24 25 uint8_t *md __sram = md3;26 27 28 extern int do_keccak(const uint8_t *in, int, uint8_t *out, int);29 30 void keccak_test() {31 // for (int i = 0; i < 4; i++)32 do_keccak((uint8_t *) testvec.msgstr, strlen(testvec.msgstr), md, testvec.mdlen);33 }34 35 void test_setup() {36 }37 38 void test_clear() {39 //for (int i = 0; i < 4; i++)40 memset(md, 0, testvec.mdlen);41 }42 43 void test_run() {44 keccak_test();45 }46 47 int test_check() {48 //for (int i = 0; i < 4; i++)49 if (0 != memcmp(md, testvec.md, testvec.mdlen))50 return 0;51 return 1;52 }

A.2 FFT

71

1 #include "common.h"2 3 static int wprBase[] __sram = {4 32767, 32758, 32729, 32679, 32610, 32522, 32413, 32286,5 32138, 31972, 31786, 31581, 31357, 31114, 30853, 30572,6 30274, 29957, 29622, 29269, 28899, 28511, 28106, 27684,7 27246, 26791, 26320, 25833, 25330, 24812, 24279, 23732,8 23170, 22595, 22006, 21403, 20788, 20160, 19520, 18868,9 18205, 17531, 16846, 16151, 15447, 14733, 14010, 13279,

10 12540, 11793, 11039, 10279, 9512, 8740, 7962, 7180,11 6393, 5602, 4808, 4011, 3212, 2411, 1608, 804,12 0, -804, -1608, -2411, -3212, -4011, -4808, -5602,13 -6393, -7180, -7962, -8740, -9512, -10279, -11039, -11793,14 -12540, -13279, -14010, -14733, -15447, -16151, -16846, -17531,15 -18205, -18868, -19520, -20160, -20788, -21403, -22006, -22595,16 -23170, -23732, -24279, -24812, -25330, -25833, -26320, -26791,17 -27246, -27684, -28106, -28511, -28899, -29269, -29622, -29957,18 -30274, -30572, -30853, -31114, -31357, -31581, -31786, -31972,19 -32138, -32286, -32413, -32522, -32610, -32679, -32729, -32758,20 };21 22 static int wpiBase[] __sram = {23 0, 804, 1608, 2411, 3212, 4011, 4808, 5602,24 6393, 7180, 7962, 8740, 9512, 10279, 11039, 11793,25 12540, 13279, 14010, 14733, 15447, 16151, 16846, 17531,26 18205, 18868, 19520, 20160, 20788, 21403, 22006, 22595,27 23170, 23732, 24279, 24812, 25330, 25833, 26320, 26791,28 27246, 27684, 28106, 28511, 28899, 29269, 29622, 29957,29 30274, 30572, 30853, 31114, 31357, 31581, 31786, 31972,30 32138, 32286, 32413, 32522, 32610, 32679, 32729, 32758,31 32767, 32758, 32729, 32679, 32610, 32522, 32413, 32286,32 32138, 31972, 31786, 31581, 31357, 31114, 30853, 30572,33 30274, 29957, 29622, 29269, 28899, 28511, 28106, 27684,34 27246, 26791, 26320, 25833, 25330, 24812, 24279, 23732,35 23170, 22595, 22006, 21403, 20788, 20160, 19520, 18868,36 18205, 17531, 16846, 16151, 15447, 14733, 14010, 13279,37 12540, 11793, 11039, 10279, 9512, 8740, 7962, 7180,38 6393, 5602, 4808, 4011, 3212, 2411, 1608, 804,39 };40 41 void fft(int *data, int len) {42 43 int max = len;44 len <<= 1;45 int wstep = 1;46 while (max > 2) {47 int *wpr = wprBase;48 int *wpi = wpiBase;49 50 for (int m = 0; m < max; m +=2) {51 int wr = *wpr;52 int wi = *wpi;53 wpr+= wstep;54 wpi+= wstep;55 56 int step = max << 1;57 58 for (int i = m; i < len; i += step) {59 int j = i + max;60 61 int tr = data[i] - data[j];62 int ti = data[i+1] - data[j+1];63 64 data[i] += data[j];65 data[i+1] += data[j+1];66 67 int xr = ((wr * tr + wi * ti) << 1) + 0x8000;68 int xi = ((wr * ti - wi * tr) << 1) + 0x8000;69 70 data[j] = xr >> 16;71 data[j+1] = xi >> 16;72 }73 }74 max >>= 1;

75 wstep <<= 1;76 }77 78 {79 int step = max << 1;80 81 for (int i = 0; i < len; i += step) {82 int j = i + max;83 84 int tr = data[i] - data[j];85 int ti = data[i+1] - data[j+1];86 87 data[i] += data[j];88 data[i+1] += data[j+1];89 90 91 data[j] = tr;92 data[j+1] = ti;93 }94 }95 96 97 #define SWAP(a, b) tmp=(a); (a)=(b); (b)=tmp98 99 data--;

100 int j = 1;101 for (int i = 1; i < len; i += 2) {102 if(j > i) {103 int tmp;104 SWAP(data[j], data[i]);105 SWAP(data[j+1], data[i+1]);106 }107 int m = len>> 1;108 while (m >= 2 && j >m) {109 j -= m;110 m >>= 1;111 }112 j += m;113 }114 }

1 #include "common.h"2 3 #define NINPUTS 2564 5 short buf[2*NINPUTS] __sram;6 7 int dataR1[NINPUTS] = {8 /* inputs for test 1 */9 2, -4, -3, -8, -10, -11, -23, 11, 32, 10, 11, 8, 3, 3, -7, -5, 1, -4, -4, -4,

10 -9, -5, -4, -8, -5, -2, 0, 0, -6, -7, -2, 3, 3, 8, 15, 10, 6, 6, 1, 4, -1, 11 -10, -4, -2, -9, -5, -7, -8, -2, -5, -6, -2, -3, 1, -3, -8, -6, 0, 5, 4, 15,12 17, 6, 5, 2, 0, 2, -3, -5, 0, -5, -5, -4, -9, -6, -2, -4, -4, -3, -1, 1, -5,13 -7, -4, 3, 5, 6, 20, 16, 8, 7, 3, 7, 4, -5, -4, -3, -8, -6, -7, -7, -1, -2, 14 -2, -2, -3, 1, 1, -5, -4, 2, 6, 7, 13, 17, 8, 7, 6, 2, 7, 4, 0, -3, -6, -2, 15 -3, -7, -7, -4, -5, -4, -2, 1, 4, -2, -4, -1, 3, 5, 5, 18, 19, 9, 7, 2, 4, 2,16 -6, -5, 0, -1, -2, -5, -8, -2, -4, -7, -5, -4, 1, 0, -5, -4, 1, 3, 3, 7, 15,17 11, 6, 5, 2, 6, 3, -5, -4, -4, -7, -6, -9, -8, -3, -4, -5, -5, -4, 1, -3, -7,18 -5, 0, 4, 3, 12, 15, 7, 5, 4, 1, 1, -5, -7, -1, -2, -5, -4, -8, -7, -3, -6, 19 -6, -5, -3, 0, -5, -6, -3, 1, 2, 3, 13, 14, 9, 6, 3, 4, 3, -4, -6, -3, -5,20 -5, -6, -9, -5, -3, -5, -5, -3, 0, 0, -5, -3, 1, 3, 3, 9, 16, 10, 6, 6, 6, 8,21 2, -2, -2,22 };23 24 int dataI1[NINPUTS] = {25 /* inputs for test 1 */26 1, -1, -1, -2, -2, -2, -3, -1, 0, 0, 1, 2, 2, 3, 3, 2, 2, 2, 2, 1, 0, 0, -1,27 -1, -2, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1, 1, 2, 3, 3, 3, 3, 2, 2, 1,28 1, 0, 0, -1, -1, -2, -2, -2, -2, -3, -3, -3, -2, -2, -1, 0, 1, 1, 2, 3, 3, 4,29 3, 3, 3, 2, 2, 1, 1, 0, 0, -1, -1, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1,30 2, 3, 3, 4, 4, 4, 3, 3, 2, 2, 1, 0, 0, -1, -1, -2, -2, -2, -3, -3, -3, -3,31 -2, -2, -1, 0, 1, 2, 2, 3, 3, 4, 3, 3, 3, 2, 2, 1, 0, -1, -1, -2, -2, -3, -3,32 -4, -3, -4, -3, -2, -2, 0, 1, 1, 2, 2, 3, 3, 3, 2, 2, 2, 2, 1, 0, 0, -1, -2,33 -2, -3, -3, -3, -4, -4, -3, -3, -2, -1, 0, 0, 1, 2, 2, 3, 3, 3, 3, 2, 2, 1,34 1, 0, 0, -1, -1, -2, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1, 2, 2, 3, 3, 3,35 2, 2, 2, 2, 1, 0, 0, 0, -1, -1, -2, -2, -3, -3, -3, -3, -3, -2, -2, 0, 1, 1,36 2, 3, 3, 4, 4, 3, 3, 3, 2, 1, 1, 0, 0, -1, -2, -2, -2, -3, -3, -3, -3, -2, 37 -2, -1, 0, 1, 2, 2, 3, 4, 4, 4, 3,38 };39 40 int ref[2*NINPUTS] = {41 /* outputs for test 1 */42 6, 13, -47, -11, 91, 38, 44, 30, 48, 21, 48, 13,43 71, 26, 92, 41, 162, 76, 598, 139, -1002, -60, -284, 59,44 -210, 40, -154, 65, -100, 67, -23, 35, -98, 92, -18, 96,45 -125, 53, -414, 121, 135, 80, 78, 37, 62, 57, 29, 19,46 40, -16, 84, 115, 23, -6, 52, 8, -3, -52, 283, 126,47 123, -38, 50, -63, 28, -16, 31, -78, 31, -36, -2, -62,48 -12, -54, -33, -43, -1, 60, -12, -141, -44, -49, -21, -68,49 -62, -59, -32, -61, -39, -23, -48, -13, -70, -14, -28, -7,50 5, -98, -28, 10, -50, -10, -32, 2, -42, 11, -4, 36,51 -65, 37, -9, 19, -52, 28, -94, 3, 228, 74, 53, 73,52 69, 22, 52, 52, 56, 8, 21, 80, 55, 6, 41, 5,53 54, 21, 95, 83, 8, -75, -6, -27, 23, -32, 14, -27,54 20, -34, -2, -57, -2, -28, -7, -32, -11, -21, 19, -70,55 -20, -7, -16, -32, -25, -15, -27, -17, -21, -13, -25, 0,56 -10, -7, 17, -20, 48, -41, -154, 48, -59, 75, -44, 45,57 -19, 42, 1, 39, -18, 33, -2, 43, -1, 36, 30, 33,58 48, 150, 46, -33, 10, -11, 17, 5, 23, 9, 36, 2,59 29, 2, 22, -9, -1, -16, 11, -8, 47, -38, -1, -14,60 2, -20, 7, 4, 9, -25, -2, 7, 5, -30, -5, -1,61 8, 0, 14, -18, -7, 0, -6, 2, 10, -10, -4, 8,62 7, -3, 8, -3, 9, 7, -8, 10, 2, -3, 12, 8,63 19, -7, -1, -4, -2, -9, -3, 9, -3, 6, 18, -2,64 10, -1, 2, -1, -1, -6, 0, 5, -4, 10, 4, 1,65 0, -10, -6, 7, -4, 4, 8, 21, 8, -9, 3, 19,66 4, 32, 14, -6, -1, 29, 0, 13, 11, 22, 16, 9,67 56, 55, 5, -11, 2, 28, 22, 9, 25, 8, 22, 12,68 17, -2, 13, -6, 7, 10, 40, 31, 72, -156, 39, -36,69 5, -32, 12, -46, -16, -44, 17, -55, -21, -48, -22, -35,70 -50, -65, -141, -52, 40, 37, 12, 17, -21, -15, -16, -15,71 -28, 6, -10, 3, -21, 24, -14, 18, -20, 4, 8, 62,72 -12, 11, -6, 38, -9, 31, 1, 75, 24, 38, 12, 37,73 26, 38, -11, 31, 5, 80, 98, -86, 64, -30, 31, -18,74 61, -10, 21, -63, 50, -9, 55, -55, 66, -13, 48, -53,

75 219, -56, -86, -15, -58, -36, -12, -19, -56, -42, 5, -34,76 -27, -27, -17, 4, -48, -3, -20, -6, -1, 83, -29, 3,77 -65, 7, -45, 10, -44, 29, -23, 62, -58, 49, -28, 52,78 -39, 49, -14, 147, 27, -68, -7, 45, -8, 41, 10, 65,79 31, 40, 32, 63, 36, 6, 56, 71, 122, 39, 289, -121,80 1, 45, 56, 4, 33, 6, 96, -99, 47, 10, 43, -16,81 67, -40, 81, -38, 126, -58, -303, -140, -63, -55, 8, -75,82 -45, -88, 3, -31, -38, -54, -66, -54, -79, -34, -112, -52,83 -396, -18, 216, -26, 69, -15, 42, -17, 25, -15, 34, -26,84 40, 1, 38, -10, 141, -27, -83, 30,85 };86 87 extern void fft(int *, int);88 89 void test_clear() {90 for (int i = 0; i < NINPUTS; ++i) {91 buf[2 * i] = dataR1[i];92 buf[2 * i + 1] = dataI1[i];93 }94 }95 96 void test_run(int n) {97 fft(buf, NINPUTS);98 }99

100 int test_check() {101 for (int i = 0; i != 2 * NINPUTS; ++i)102 if (buf[i] != ref[i])103 return 0;104 105 return 1;106 }

76

Documents

Parallel Ultra Low Power Embedded System · de encorajar o desenvolvimento de novos aceleradores pela comunidade open-source. Para testar a viabilidade desta abordagem, dois tipos