GipherFS
A GPU Accelerated Ciphered File System
José Duarte Gomes Pires Lourenço
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Prof. Ricardo Jorge Fernandes ChavesProf. Aleksandar Ilic
Examination Committee
Chairperson: Prof. Alberto Manuel Rodrigues da SilvaSupervisor: Prof. Ricardo Jorge Fernandes Chaves
Member of the Committee: Prof. Miguel Ângelo Marques de Matos
June 2017
Acknowledgments
I would like to thank my supervisors, Professor Ricardo Chaves and Professor Aleksandar Ilic,
for their support and for the opportunity to finish this dissertation. Their knowledge and experience
provided me the right guidance to complete this work.
I would also like to thank my co-workers, especially Ricardo Ramalho, for their support, availability
and precious advice throughout the course of this work.
Finally, I would like to thank my family and especially my girlfriend Sofia for her patience, support,
understanding and motivation. My most sincere thanks to all of them.
i
Abstract
Information security is a major concern in computer systems and its importance has only increased
in recent years due to the pervasive use of information technology in common activities. The unau-
thorized release of confidential data is one the main threats to information security and is commonly
addressed by cryptographic mechanisms, through encryption. However, encryption algorithms are
typically computationally demanding and can place a heavy burden on a machine’s resources, espe-
cially those without accelerators, thus limiting and often degrading the overall performance in systems
with several users running concurrent applications.
On the other side, the Graphics Processing Unit (GPU) has evolved from a specialized computer
graphics accelerator to a general-purpose co-processor. The vast amount of computing power pro-
vided by the GPU, combined with its wide availability in commodity computer systems, presents a
great opportunity for delivering computationally expensive features without the penalty of degraded
system performance.
The work presented in this dissertation focuses on improving the file system security by transpar-
ently integrating efficient encryption mechanisms into its operation. It proposes to offload the AES
encryption algorithm, used for protecting file data, to the GPU, using CUDA. For this purpose, a pro-
tection layer was added to the ext4 file system, using a Virtual File System based on FUSE with
built-in confidentiality mechanisms, using the GPU as a cryptographic co-processor.
The experimental results obtained show that the relocation of this computation to the GPU mini-
mizes the impact on the CPU of introducing cryptographic mechanisms and improves its throughput
by a factor of 3,5.
Keywords
Confidentiality, File System, Cryptography, AES, CUDA, FUSE
iii
Resumo
A segurança da informação é uma das principais preocupações em sistemas computacionais e a
sua importância tem vindo a aumentar nos últimos anos devido ao uso generalizado de tecnologias
da informação em atividades comuns. O acesso não autorizado a informação confidencial é uma
das principais ameaças à segurança da informação, e é normalmente evitado através de mecanis-
mos criptográficos. No entanto, estes mecanismos são tipicamente exigentes computacionalmente e
consumem muitos recursos, especialmente em sistemas sem aceleradores, limitando e degradando
o desempenho geral de sistemas utilizados por vários utilizadores simultaneamente.
Por outro lado, o GPU evoluiu de um acelerador especializado em computação gráfica para um
coprocessador de uso genérico. O vasto poder computacional disponibilizado pelo GPU, assim como
a sua ampla disponibilidade apresenta-se como uma óptima oportunidade para a implementação de
funcionalidades computacionalmente exigentes sem afectar o desempenho do sistema.
O trabalho apresentado nesta dissertação foca-se na melhoria da segurança do sistema de fichei-
ros através da integração de mecanismos criptográficos no seu funcionamento, de forma eficiente e
transparente. Propõe-se a realocação do algoritmo AES, usado para proteger a informação dos fi-
cheiros, para o GPU, utilizando CUDA. Para tal foi adicionada uma camada de proteção ao sistema
de ficheiros ext4, recorrendo a um sistema de ficheiros virtual, baseado em FUSE, que incorpora
mecanismos de confidencialidade, utilizando o GPU como coprocessador criptográfico.
Os resultados experimentais mostram que a realocação da computação para o GPU minimiza o
impacto da introdução de mecanismos criptográficos no CPU e melhora o seu throughtput por um
factor de 3,5.
Palavras Chave
Confidencialidade, Sistema de Ficheiros, Criptografia, AES, CUDA, FUSE
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background and State of the Art 7
2.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Symmetric-key cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Assymmetric-key cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Evolution of the GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 User Space File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Proposed Solution 31
3.1 Automatic implicit encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Encryption mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Block based implementation of read and write operations . . . . . . . . . . . . . 35
3.2.2 Padding: PKCS#7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Solution Implementation 39
4.1 plainfs: a non-ciphered file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Intercepting I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 Performing I/O operations on underlying file system . . . . . . . . . . . . . . . . 42
4.2 cipherfs: a non-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Block based I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 AES functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 gipherfs: a GPU-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 GPU implementation of AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Maximizing payload length: FUSE big-writes mode . . . . . . . . . . . . . . . . . 51
vii
4.3.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . 52
5 Solution Evaluation 55
5.1 Direct integration of GPU AES acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Enabling big-writes in the proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Conclusions and Future Work 65
Bibliography 67
viii
List of Figures
2.1 Private-key cipher [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 AES round operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 ECB encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 ECB decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 CBC encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 CBC decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 CTR encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 CTR decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 GPU vs CPU performance evolution [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 G80 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.11 Fermi Architecture [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12 Full Maxwell chip-block [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.13 CUDA automatic load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.14 CUDA two-dimensional block grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.15 CUDA memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.16 FUSE architecture [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Virtual File System: read and write operations . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Straightforward read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Block based read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Padding of a partial block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Padding of a complete block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Virtual File System: read and write operations . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 User level file system I/O operations implementation . . . . . . . . . . . . . . . . . . . . 43
4.3 I/O operations redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Encryption transformations: adding padding and ciphering . . . . . . . . . . . . . . . . . 46
4.5 Decryption transformations: deciphering and removing padding . . . . . . . . . . . . . . 46
4.6 Handling a 4096 bytes payload, made of 256 AES blocks . . . . . . . . . . . . . . . . . 48
5.1 Naive gipherfs: read bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Naive gipherfs: write bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
ix
5.3 AES encryption benchmark: average block processing time . . . . . . . . . . . . . . . . 59
5.4 AES encryption benchmark: average total processing time . . . . . . . . . . . . . . . . 60
5.5 AES encryption benchmark: average bandwidth . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Increasing gipherfs write payload: write bandwidth . . . . . . . . . . . . . . . . . . . . . 61
5.7 GPU memory operations benchmark: relative cost . . . . . . . . . . . . . . . . . . . . . 62
5.8 Improved device memory management in gipherfs: read bandwidth . . . . . . . . . . . 63
5.9 Improved device memory management in gipherfs: write bandwidth . . . . . . . . . . . 63
5.10 gipherfs optimizations: increased read bandwidth . . . . . . . . . . . . . . . . . . . . . . 64
5.11 gipherfs optimizations: increased write bandwidth . . . . . . . . . . . . . . . . . . . . . 64
x
List of Tables
2.1 Maxwell’s key properties [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 List of FUSE operations and ext4 counterparts . . . . . . . . . . . . . . . . . . . . . . . 45
xi
Abbreviations
3DES Triple Data Encryption Standard
AES Advanced Encryption Standard
AES-NI AES New Instructions
API Application Programming Interface
CBC Cipher Block Chaining
CPU Central Processing Unit
CTR Counter
CUDA Compute Unified Device Architecture
DES Data Encryption Standard
DRAM Dynamic Random Access Memory
ECB Electronic Codebook
ECC Error-Correcting Code
ext4 Fourth Extended Filesystem
FIPS Federal Information Processing Standards
FUSE Filesystem in Userspace
GBPS Gigabytes per second
GPC Graphical Processing Cluster
GPU Graphics Processing Unit
GPGPU General Purpose GPU
NIST National Institute of Standards and Technology
OS Operating System
PGA Professional Graphics Controller
xiii
PC Personal Computer
PKCS Public Key Cryptography Standards
RSA Rivest Shamir Adleman
SIMT Single-Instruction Multiple-Thread
SM Streaming Multiprocessor
SMM Maxwell Streaming Multiprocessor
SP Streaming Processor
SPA Scalable Processor Array
VFS Virtual File System
XTS XEX-based tweaked-codebook mode with ciphertext stealing
xiv
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
1.1 Motivation
Security awareness has been growing among computer users and one of the major concerns is
the confidentiality of the users’ data, which is not trivial to achieve, especially in modern multi-user
computer systems. Several mechanisms [1] [2] [3] have been proposed in order to ensure confiden-
tiality but the most typical approach, and one of the most practical ones, is to encrypt data [4] [5],
particularly that which is stored in the hard drive. The basic principle of encryption is to transform
data into an unintelligible form to an adversary. This transformation, especially when dealing with
large amounts of data, is done by symmetrical encryption algorithms, such as the Advanced Encryp-
tion Standard (AES) [7]. These algorithms are based on a secret value called the secret key, which
must be known in order to encrypt and decrypt the data. While being a relatively efficient encryption
approach, when large amounts of data need to be accessed and protected the cost of this process
can be significant, thus additionally overloading the Central Processing Unit (CPU) resources. Even
with multi-core CPUs and the introduction of specialized dedicated instructions at the hardware level,
the impact of cryptographic primitives on a system’s CPU load can still be quite noticeable. This is
especially the case when a system’s CPU load is already very high and there are little computational
resources left available. In these situations, security mechanisms, such as data confidentiality pro-
vided by the AES encryption algorithm, tend to be overlooked and considered expendable in favour
of the system’s functional requirements.
On the other hand, Graphics Processing Units (GPUs) have become pervasive in most compu-
tational systems from servers and desktops, up to mobile phones [8], and even some embedded
devices [9]. In recent years the GPU computational power has increased significantly, mainly due to
the high requirements of graphical computer applications. However, this computing power is currently
not restricted to graphical computation. Thanks to the introduction of general purpose programming
environments [10][11], other types of applications can now leverage the GPU compute resources.
This has made it a very popular parallel computing platform, especially in the scientific community,
where it enabled tackling problems previously considered intractable. Motivated by this fact, the work
proposed in this dissertation particularly focuses on exploiting the computational power of the GPU to
enhance the file system with cryptographic protection mechanisms, with the goal of not only providing
faster encryption, but also offloading the CPU resources.
In brief, this dissertation proposes a solution to the problem of building a secure system that pro-
vides confidentiality of user file data, by using the AES algorithm, the standard for data encryption,
while minimizing the penalty induced to the overall system performance, caused by the additional load
on a machine’s CPU. The approach proposed in this dissertation focuses on offloading the compu-
tationally heavy operations required by AES from the CPU to the seldom used and computationally
powerful GPU. This relocation drastically reduces the impact of the AES cryptographic primitives on
the CPU, apart from the required execution control of GPU kernel launches and data transfers. At
the same time, it aims at making the most of the usually underused GPU capabilities to support this
widely adopted data encryption standard. Moreover, the protection is added as an additional layer
2
at the file system level, so as to make it as transparent to the end user as possible. A Virtual File
System is implemented on top of ext4, the Fourth Extended Filesystem [12], using the Filesystem in
Userspace (FUSE) framework [13] [14] .
Other similar approaches have been proposed for solving this problem, in diverse settings [15]
[16]. Our approach goes beyond the existing state of the art by proposing a complete file system
design from scratch and not simply the modification of a pre-existing encrypted file system. This
approach makes it possible to tune specific aspects of the file system operation, making it a better
fit for the GPU computing paradigm. This allows to make full use of the GPU computing power, thus
maximizing the integration benefits. Our experiments further support the validity of this approach,
and results of read and write throughputs on the order of 580 MB/s and 95 MB/s, respectively, are
presented, improving on previous state of the art solutions by a factor of 2.
1.2 Goals
The importance of information security has long been recognised as a major concern in computer
science and data confidentiality is probably one of its most valued properties for the general public. No
longer is this an exclusive concern of the expert user but it has also become so for the less security-
aware one, who can easily recognise the value of keeping its private data safe from unauthorised
access.
This dissertation aims at implementing a secure file system providing confidentiality of its users
data by making it automatic and by minimising its impact on the overall system performance. Its main
goal is thus to enable widespread use of cryptographic encryption mechanisms ensuring data privacy
by making them as transparent and efficient as possible, so as to minimise any potential obstacles
hindering their use.
Confidentiality
The system must be able to keep user data private and only accessible to its legitimate owners.
Any unauthorised access attempt should be prevented by making data ineligible to those not allowed
to access it.
Performance
The system mechanisms providing privacy of user data must be efficient and their impact on the
overall system performance must be kept to a minimum. Moreover, the solution should make full use
of all the GPU computing resources available in the system in order to maximize its efficiency.
Usability
Security mechanisms tend to be disabled by its users when they fail to be easy to use. This can
be due to several reasons, such as requiring repetitive user interaction. In order to prevent these
3
shortcomings, the system must operate in a transparent way to the end user. Ideally, the user should
not even be aware of the system’s additional security features in order to leverage them.
The main goal of this dissertation is to design a solution and implement a system that fulfils the
aforementioned properties and deliver a secure system ensuring the privacy of its user data in a
efficient and transparent way.
1.3 Requirements
A formal set of requirements can be specified, based on the goals of this dissertation. The system
proposed in this document must be able to fulfill the following requirements:
1. User file data should be stored encrypted on the hard disk;
2. The cryptographic mechanisms should be executed on the GPU, when it provides better perfor-
mance than the CPU;
3. Users without administrator rights should be able to use the system;
4. The performance of the cryptographic mechanisms implemented on a GPU with sufficient re-
sources should have better performance than an equivalent version implemented on the CPU
and provide an higher access bandwidth to encrypted files;
5. The system cryptographic mechanisms should incur negligible load on the system’s CPU;
6. File data privacy should be transparent and must not require additional user interaction.
1.4 Dissertation Outline
This document describes a secure file system which leverages state of the art mechanisms for us-
ing the GPU as a cryptographic co-processor and ensure privacy of its users file data. The proposed
solution, implementation and experimental results are presented in the following Chapters, organised
as follows:
• Chapter 2 presents the background and state of the art analysis which determined the proposed
solution design. The background Section consists of the description of: cryptographic mecha-
nisms, concerned with providing security of information and more specifically data encryption
algorithms for data privacy; the use of the GPU as a general purpose processor, delivering a
large amount of computing power in diverse application fields; and user space file systems,
which can be used for implementing file systems outside kernel space. The state of the art
section presents previous attempts at using the GPU for: accelerating data encryption algo-
rithms; accelerating typical file system operations; and integrating both into GPU-accelerated
cryptographic file systems.
4
• Chapter 3 describes the proposed solution, a secure encrypted file system with transpar-
ent cryptographic mechanisms, efficiently implemented using the GPU as a cryptographic co-
processor.
• Chapter 4 discusses the implementation details of the proposed solution. It describes the im-
plementation of a FUSE file system, the AES encryption algorithm implementation on the GPU
and the integration of both into a file system with enhanced cryptographic mechanisms.
• Chapter 5 presents the experimental evaluation of the proposed solution. This chapter de-
scribes how the systems’ design and implementation achieves the goals and requirements of
the dissertation. For this purpose a comparison is made between different file systems en-
hanced with cryptographic mechanisms implemented on the CPU and the GPU.
• Chapter 6 concludes this document by summarising the work developed in the context of this
dissertation and introducing future work directions which can be followed based on the proposed
solution.
5
6
2Background and State of the Art
Contents2.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 User Space File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7
This section presents the relevant background on the main topics covered in this dissertation.
Section 2.1 provides a brief introduction to cryptography and how it can be used to provide data
confidentiality. Section 2.2 describes the GPU and its use as a general purpose co-processor used
to accelerate primitives typically implemented in the CPU. The development of file systems in user
space and the frameworks allowing it are presented in Section 2.3. Finally, Section 2.4 presents
previous state of the art solutions to the problem covered in this dissertation.
2.1 Cryptography
Cryptography [4] [17] [18] has been historically defined as the art of writing and solving codes.
Its origins date back to several centuries ago when it was first used for exchanging secret messages
between two parties in the presence of adversaries. Modern cryptography however is a much vaster
field and has become a science on its own, more than solely an art form. Its use is no longer re-
stricted to that of secret communication and it can be described nowadays as the science concerned
with problems that may arise in any distributed computation subject to internal or external attacks.
These include not only privacy but also message authentication, digital signatures and secret key
exchange and authentication protocols. Thus, modern cryptography can be more adequately de-
fined as the scientific study of techniques for securing digital information, transactions and distributed
computations.
Modern cryptography also differs from classical cryptography in respect to the entities using it.
Roughly until the 1980s, cryptography was mainly used by military and intelligence organisations.
Actually, the use of cryptography in World War II to decipher enemy communications was one of the
main drivers for the development of computer science. This is no longer the case, as cryptography is
an integral part of any modern computer system today. Even though mostly unknowingly, users rely
on cryptography for tasks such as securely accessing a website, ensuring the integrity of an e-mail
message or enforcing access control in multi-user computer systems.
Due to its wide range of applications, cryptography is currently a central topic in computer science
and its main purposes can be summarised as being the following:
• ensuring the confidentiality of private data exchanged between a sender and its intended recip-
ient;
• authenticating data so that its source is correctly identified;
• ensuring message integrity, so that its content upon retrieval by the intended recipient is the
same as when it was originally sent;
• preventing the sender of a message from refuting the claim of having sent it, also known as
non-repudiation;
• specifying and controlling access to resources;
• ensuring the availability of resources to authorised parties at all the times.
8
Figure 2.1: Private-key cipher [17]
Even though they are obviously related, these concerns do not overlap and each aims at provid-
ing distinct security properties which can then be used as the building blocks of a secure computer
system. The cryptographic mechanisms used to tackle each of these specific problems are based on
the following main concepts:
• plain text : the state in which information can be understood by the sender and recipient of a
message, but also by anyone who may intercept it;
• cipher text : obtained by transforming the original content of a message in such a way that it
becomes ineligible to anyone but its sender and intended recipient;
• encryption: the process by which plain text is transformed into cipher text ;
• decryption: reverse process, which transforms cipher text back into its original legible form;
• key : secret shared by two communicating parties, used to encrypt and decrypt data.
Despite its various potential uses, the remaining of this section will focus on the use of cryptogra-
phy to ensure confidentiality of user data in computer systems, as that is the goal of this dissertation.
Cryptographic mechanisms for ensuring data confidentiality can be divided into two main classes,
described next in the context of private communication between two parties, in the presence of ad-
versaries.
2.1.1 Symmetric-key cryptography
Symmetric-key cryptographic mechanisms rely on the existence of a secret piece of information
shared exclusively among the communicating parties in advance. This information is called a private
key and it is used by the communication parties whenever they need to exchange messages. The
sender uses the private key to encrypt the message, transforming it into cipher text, ineligible to any
adversary capable of intercepting the message. Symmetrically, the receiver uses the same private
key to decrypt the message back into its original plain text form. In this setting the possession of
9
the unique private key distinguishes the communicating parties from any adversaries. It is the sym-
metry of both the sender and the receiver using the same key that makes this class of cryptographic
mechanisms known as symmetric-key, which is depicted by Figure 2.1.
Symmetric-key schemes assume that the communicating parties are able to securely exchange
the secret private key prior to the exchange of messages. This may not always be easy to achieve in
the modern distributed setting, and is in itself a major concern limiting the applicability of symmetric-
key mechanisms. There are however several applications for which this is not a great concern, of
which disk encryption is an example. In this case the key does not actually need to be distributed
among distinct parties, as it must only be known by the owner of the data in order to encrypt and
decrypt its files.
The Data Encryption Standard (DES) [22], the Triple Data Encryption Standard (3DES) [24], the
Advanced Encryption Standard (AES) [7] and Blowfish [25] are some of the most relevant and widely
used symmetric encryption algorithms of all time. DES and 3DES are mentioned for historical rea-
sons, since their use is no longer recommended. AES and Bowlfish are widely used today and AES
is actually considered as the de facto standard for data encryption.
AES
The Advanced Encryption Standard (AES), as defined by the Federal Information Processing
Standards (FIPS) Publication #197 in 2001 [7], is the standard symmetric encryption scheme rec-
ommended by the National Institute of Standards and Technology (NIST) of the United States. It
replaced the previous standard, the Data Encryption Standard (DES), and was the winner of a 3 year
long competition, out of a total of 15 competing proposals. It is based on the Rijndael [19] algorithm,
created by John Daemen and Vincent Rijmen. The choice of holding a competition to chose the new
encryption standard has been regarded as an ingenious one because it was the perfect motivation for
competing groups, composed of the world’s best cryptanalysts, to try to attack and exploit each other
proposals, therefore greatly boosting the confidence on the security of the winner.
Since its recognition by NIST, AES has been widely regarded as secure and efficient and has been
adopted by both the government and the industry worldwide. It is the de facto symmetric encryption
standard and almost any new protocol that requires symmetric encryption supports it.
AES is a block cipher that uses a block size of 128 bits and key sizes of 128, 192 or 256 bits. It is
essentially a substitution-permutation network [17] that operates on a 4x4 array of bytes, called state,
initialised with the cipher input, and an expanded key, derived from the master secret key.
It operates by repeatedly applying rounds of transformations to the state. The number of rounds
varies with the key size and amounts to 10, 12 and 14 for 128-, 192- and 256-bit keys, respectively.
The four operations illustrated in Figure 2.2 are performed in each round of the algorithm, except
for the final round where the MixColumns stage is replaced by an additional AddRoundKey. These
operations consist of:
• AddRoundKey : XORes a 16 byte round key, derived from the master key with the state array;
10
(a) Add round key operation
(b) SBox Translation operation
(c) Shift rows operation
(d) Mix collumns operation
Figure 2.2: AES round operations
11
• SubBytes: replaces each byte of the state array, according to a single constant substitution
table, called S-box ;
• ShiftRows: left shifts the bytes in each row of the state array as many times as the row number,
starting at zero;
• MixColumns: mixes each column via an invertible linear transformation in which each column is
interpreted as a polynomial. This step can also be viewed as a matrix multiplication.
Each round of the algorithm uses a different 128-bit round key. This key is derived from the original
private master key by applying a Key Expansion algorithm and is independent of the processed data,
meaning it can be done prior to the encryption/decryption phase.
The decryption process can be implemented in two equivalent ways: the inverse cipher and the
equivalent inverse cipher, which differ in the order of the round stages and in the way that the decryp-
tion round keys are defined.
AES has undertaken intense scrutiny ever since the competition that led to its selection and the
only cryptanalytic attacks found so far [20] are for reduced-round variants of it. Reduced-round vari-
ants are not standard compliant tough. Side-channel attacks [21], a different class of attacks that
assume that the adversary is able to make measurements such as the time taken or the power con-
sumed by the processor during encryption, have been able to extract secret keys in a reasonable
amount of time. However, these kind of attacks make very strong assumptions on the power of the
adversary and are not considered feasible in the common scenario. Besides its security guarantees,
AES is free and standardised, being available in virtually any platform. It is considered an excellent
choice for almost any cryptographic implementation.
AES-NI
Since its adoption as a NIST standard, AES has become very popular and its importance is ex-
pected to continue to rise in the near future. Any possible improvements to its performance and
security are therefore regarded as highly beneficial. With this in mind, in 2010, Intel introduced a new
set of processor instructions AES-NI (AES New Instructions) aimed at providing hardware accelera-
tion for AES [6].
The architecture consists of six instructions: AESENC, AESENCLAST, AESDEC and AESDE-
CLAST, supporting the encryption and decryption processes; and AESIMC and AESKEYGENAS-
SIST, for expanding the AES key. The instructions support the 128, 192 and 256 standard key length,
using the 128 bits block size and are well suited to all common uses of AES: bulk encryption/de-
cryption using any of the ECB, CBC and CTR cipher modes; data authentication; random number
generation and authenticated encryption.
AESENC and AESDEC are short for AES encrypt and decrypt round, while AESENCLAST and
AESDECLAST are short for AES encrypt and decrypt last round. These instructions operate on
both register-register and register-memory mode and perform the longest AES encryption/decryption
sequence possible, without introducing any branch variation.
12
AESKEYGENASSIST and AESIMC are short for AES Key Generation Assist and AES Inverse
Mix Columns. They are used for assisting the generation of the encryption round keys and converting
them to a form suitable for decryption using the Equivalent AES Inverse Cipher method, respectively.
Given the input block and the AES key, it is possible to generate the required expanded key and
implement any key size variation of the AES algorithm with these set of instructions. Moreover, non
AES compliant versions of the original Rjindael algorithm, e.g. versions with a different block size,
can also be implemented using these instructions.
In addition to the obvious performance benefits provided by these instructions, when used properly
they can also boost security because they have been designed to mitigate all of the already mentioned
known AES timing and cache side channel attacks.
Modes of operation
A block cipher mode of operation determines how the blocks making up the input of the encryption
algorithm are combined during the encryption process. Several modes with distinct security properties
have been proposed over the years.
Figure 2.3: ECB encryption [26]
The Electronic Codebook (ECB) is the simplest of all modes of operation and encrypts all blocks in
the same way. It has no inter-block dependencies making it easily paralellizable. Figures 2.3 and 2.4
show ECB’s encryption and decryption process, which consist of simply applying the block algorithm
on each data block. This means that two identical plain texts will result in two identical cipher texts,
which makes it rather insecure.
Figure 2.4: ECB decryption [26]
13
Figure 2.5: CBC encryption [26]
The Cipher Block Chaining (CBC) mode of operation was developed by IBM in 1976. It is based
on XORing each block of plain text with the output of ciphering the previous block, except for the first
block which is XORed with an initialization vector, as Figure 2.5 shows. The decryption process is
similar, as is shown in Figure 2.6. CBC is more secure than ECB but the inter-dependency between
blocks prevents its paralellization.
Figure 2.6: CBC decryption [26]
Figure 2.7: CTR encryption [26]
The Counter (CTR) mode differs from the previous modes in that it introduces a nonce which is
similar to an initialization vector, but includes a counter element in order to introduce variability. CTR
operates by encrypting the nonce, incremented by one for each block, at every step and XORing it
with the plain text block, as show in Figure 2.7. The decryption process is equivalent, as is presented
in Figure 2.8. The lack of dependency between blocks makes it possible to parallelize CTR while the
nounce makes it a secure mode of operation.
14
Figure 2.8: CTR decryption [26]
2.1.2 Assymmetric-key cryptography
Assymetric-key is the other major class of cryptographic mechanisms whose primary goal, in the
context of private communication, is to solve the problem of securely exchanging the private key
among all communicating parties. It achieves this goal by actually eliminating the need of exchanging
a single unique private key and using two distinct keys for encrypting and decrypting messages. It is
able to do so by ensuring that the security of its encryption scheme is preserved even in the presence
of an adversary in possession of the encryption key. As such, instead of sharing a single private key
each communicating party has its own pair of keys, a public key and a private key. The public key is
used to encrypt messages and, as the name suggests, is made publicly available to everyone. The
private key on the other hand is kept private by its owner, and is used to decrypt messages. In this
setting, in order for two entities to communicate privately each must only be able to obtain the other’s
public key, which by definition is publicly available, and ensure that its own private key is kept secure.
The sender then encrypts the message with the receiver’s public key and the receiver uses its private
key to decrypt it. In this case, it is the asymmetry of the keys used by the sender and the receiver of
the messages that makes this class of cryptographic mechanisms known as asymmetric-key.
The development of asymmetric-key mechanisms, and more specifically how they solve the key
distribution problem, revolutionised the field of cryptography, allowing its widespread use, specially on
the Internet. Moreover, assymetric-key mechanisms made it possible to address several of the cryp-
tographic purposes listed above, such as message authentication and non-repudiation. Asymmetric-
key mechanisms are not however without some caveats, being its most prominent the susceptibility
to man-in-the-middle attacks. One such example is the case in which an adversary is able to provide
a false public key to a communicating party and is thereafter able to decipher all of its messages.
The Diffie-Hellman [27] key exchange protocol and the RSA [28] encryption algorithm are two of
the most notable and widely used asymmetric-key cryptographic mechanisms. Due to its higher com-
putational requirements when compared to symmetric-key encryption schemes, often 2 to 3 orders
of magnitude higher, the RSA encryption algorithm is actually often used for exchanging symmetric
keys, thereafter used to encrypt messages on the Internet, using an algorithm such as AES.
As a final remark, any secure cryptographic mechanism should obey the famous Kerckhoffs’ prin-
ciple [29]. This principle states that an encryption scheme’s security should not rely on the secrecy
of its operation, but rather only on that of its key. Several arguments can be presented in favor of this
15
principle: it is easier for the involved parties to maintain secrecy of a short, say 128-bit key, than doing
the same for a whole algorithm; in case of disclosure, it is much easier to replace a key than to do the
same for the encryption algorithm; most of the cryptographic community agrees that intense public
scrutiny is the best way to ensure the security and reliability of encryption schemes.
A private-key encryption mechanism is characterised by three algorithms: one for generating pri-
vate keys; another for encrypting messages; and finally one for decrypting them. The basic property
that must hold for these algorithms in order for the mechanism to be correct is that the decryption of a
ciphertext, with the same key used for encrypting it, yields the original message. Moreover, according
to the Kerckhoffs’ principle, these algorithms are not required to be secret, and their knowledge by an
adversary should not impact the mechanism’s security.
2.2 Graphics Processing Unit
The Graphics Processing Unit (GPU) is one of the most pervasive parallel processor to date, both
in commercial computer systems and in supercomputing platforms. It was first introduced by IBM
back in 1984, but the term GPU was only introduced by NVIDIA in 1999 for the first time. Since then
it has been constantly evolving. The GPU is a piece of hardware specifically designed to enable fast
graphics rendering, specially important in computer games and image processing applications. The
market’s constant demand for realtime, high-definition and ever more realistic 3D graphics, mainly
due to the video game industry, has transformed it into a highly parallel, multi-threaded, many core
processor with great computational power and memory bandwidth. Actually, today’s GPU arithmetic
throughput and memory bandwidth far surpass and have evolved more quickly than those of the CPU
in the last decade, as Figure 2.9 shows. These characteristics make the GPU the ideal processor for
tackling a large amount of data parallel problems in all sorts of scientific application fields.
Many attempts to harness the computational power made available by the GPU outside the field
of graphics processing applications have been undertaken. For as long as 2003, several data parallel
algorithms have been ported to the GPU using graphical oriented shading languages such as DirectX,
OpenGL [31] and Cg [32]. These have ranged from as diverse application domains as protein folding,
stock options pricing and SQL queries. Many of these attempts have been successful at achieving
remarkable performance speedups when run on the GPU. These early efforts at using the GPU
and its graphical APIs for general purpose computing have been designated General Purpose GPU
(GPGPU) [33] programs.
Despite the noticeable performance benefits, the mapping of general purpose algorithms into
graphical APIs isn’t an easy endeavour: deep knowledge of the APIs and the GPU architecture is
required; problems must be expressed in terms of vertex coordinates, texture and shader programs;
and common basic programming features, such as random read and writes to memory are not sup-
ported. The increased complexity and lack of appropriate tools often rendered the GPU approach
infeasible.
NVIDIA promptly recognised the potential of the GPU as a general purpose processor and intro-
16
duced two key technologies that greatly boosted its adoption: the G80 unified graphics and compute
architecture and CUDA. CUDA stands for Compute Unified Device Architecture, and is a software and
hardware architecture for programming the GPU in a variety of high level programming languages.
With the introduction of these new technologies the GPU no longer needed to be programmed as
a dedicated graphics processing unit, using specialised graphics APIs. General purpose problems
could now be solved in the GPU, using the C programming language and CUDA extensions, and
leverage a massively parallel processor.
Figure 2.9: GPU vs CPU performance evolution [30]
2.2.1 Evolution of the GPU architecture
The GPU architecture was originally based on the concept of the graphics pipeline. The graphics
pipeline is an abstraction of the stages through which graphics data is sent and its implementation
was rather consistent across all major GPU manufacturers at the time, which helped the adoption of
the GPU technology. The graphics pipeline can be thought of as a set of geometry and rendering
operations that take 3D space coordinates as input from the programmer and transforms them into
2D pixels on the computer screen. Early GPUs only implemented the rendering stages, relying on the
CPU to perform geometry transformations, but more stages were added to the GPU as its technology
evolved, freeing the CPU for other tasks. Next are some of the most relevant milestones in the history
of the GPU [34]:
• The IBM Professional Graphics Controller (PGA) [35], released in 1984, was one of the first
video cards and processed all video related tasks using an on-board Intel 8088 microprocessor.
17
Its price, around $5500, and lack of compatibility, prevented it from achieving mass-market
success. However, its approach of using a dedicated auxiliary processor marked an important
step in the evolution of the GPU;
• NVIDIA’s GeForce256 and ATI’s Radeon 7500, both released in 1999, were the first GPUs,
available at the consumer level, to implement the entire graphics pipeline. All stages of the
pipeline, were performed on the GPU by fixed-function units;
• The programmable pipeline was the next step in the evolution of the GPU. It addressed the
lack of flexibility of the fixed function pipeline and was introduced by 2001’s NVIDIA’s GeForce
3. NVIDIA’s GeForce FX and ATI Radeon 9700 followed in 2002 and were the first fully pro-
grammable graphics cards on the market;
• The unification of the programmable portions of the pipeline was the last major step in the
evolution of the GPU and was introduced by NVIDIA’s G80 architecture, with the release of
the GeForce 8800 in 2006. The G80 architecture was the first to feature a fully programmable
unified processor, called Streaming Multiprocessor (SM) that handled all kinds of computation,
be it vertex, pixel or geometry computation. The unification of the GPU hardware turned the
graphics pipeline into a purely software abstraction;
• Since the introduction of the original G80 architecture, several modifications have been made to
it by NVIDIA with the release of new generations of its GPU architecture, namely Fermi in 2010,
Kepler in 2012, Maxwell in 2015 and Pascal in 2016. A new architecture, codenamed Volta has
already been announced and is set to be released in 2018.
As manycore GPUs’ computing power continues to grow according to Moore’s law [36], the biggest
challenge posed by this type of parallel systems is to build application software that is able to trans-
parently scale its parallelism to leverage the increasing number of processor cores. CUDA is NVIDIA’s
answer to address this challenge, a general purpose parallel computing architecture. CUDA has its
own parallel programming model and instruction set architecture and leverages NVIDIA GPUs’ par-
allel compute engine to solve many complex computational problems that may not fit well under the
CPU architecture. In general, its simple programming model is designed to allow the scaling of the ap-
plication parallelism with respect to the different number of processing cores among different NVIDIA
GPU architectures.
Evolution of NVIDIA’s GPU architectures
The G80 architecture was introduced by NVIDIA in November 2006, along with CUDA. Together,
these two key technologies enabled the use of the GPU as a general purpose massively parallel
processor. G80 introduced several key innovations, which made GPU computing possible:
• support for C-like programming constructs to ease the development of GPU code;
18
• unification of the vertex and pixel pipelines into a single processor capable of running vertex,
geometry, pixel and computing programs;
• use of a scalar thread processor, eliminating the need for manually managing vector registers;
• introduction of the Single-Instruction Multiple-Thread (SIMT) execution model in which a single
instruction is concurrently executed by multiple independent threads;
• support of shared memory and barrier synchronization for inter-thread communication.
Figure 2.10: G80 architecture
Figure 2.10 shows a diagram of the G80 architecture. It is based on a Scalable Processor Array
(SPA) composed of 128 Streaming Processors (SPs) cores, organised as 16 Streaming Multiproces-
sors (SMs) composed of 8 SP cores each. Work flows from top to bottom, originating at the PCI-
Express host interface, and is handled by the SPA. An external Dynamic Random Access Memory
(DRAM) makes up the scalable memory system. Compute thread arrays are dispatched to the SPAs
by the compute work distribution block which handles multiple logical streams simultaneously. The
remaining components are dedicated to implement the graphics pipeline and perform fixed-function
operations on graphic data prior to it being fed to the SPAs.
GT200, introduced in 2008, was a major revision to G80 and featured a set of improvements,
namely:
• increased number of streaming processors, thereafter called CUDA cores, from 128 to 240;
• double the size of each processor register file, allowing the execution of a greater number of
threads at any given time;
19
• addition of hardware memory access coalescing enabling more efficient memory access;
• support for double precision floating point arithmetic, required by scientific and high-performance
computing applications.
Figure 2.11: Fermi Architecture [38]
In 2010, a new CUDA architecture generation, codenamed Fermi [38], was released based on
the groundwork of G80 and GT200. It employed a completely new approach to the GPU architecture
design and introduced improvements focused on the following key areas:
• improving the performance of double precision arithmetic and bring it up to par with single
precision floating point performance;
• adding Error-Correcting Code (ECC) support in order to protect data-sensitive applications like
medical imaging and financial options pricing from memory errors;
• introducing an improved memory cache hierarchy;
• increasing the amount of shared memory to more than 16 KB per SM and providing faster
context switching.
These requirements were met by increasing the GPUs raw computing power through architectural
innovations which dramatically improved the GPU programmability and compute efficiency. Figure
2.11 presents an overview of the Fermi architecture. It features 512 CUDA cores organised in 16 SMs
of 32 cores each; has a 384-bit memory interface, divided in 6 64-bit memory partitions supporting a
total of 6 GB of GDDR5 DRAM memory; connects with the CPU via a PCI-Express host interface; and
uses the GigaThread global scheduler to distribute thread blocks among the SM thread schedulers.
Kepler was released in 2012 and it offered even higher processing power than its predecessors by
providing new methods to optimize and increase parallel workload execution on the GPU. It provided
20
over 1 Tflop of double precision throughput, while also representing a huge leap in power efficiency,
delivering up to 3x the performance per watt of Fermi. Kepler enabled increased GPU utilisation and
simplified parallel program design. The key characteristics of the Kepler [39] architecture making
these features possible were a new SMX processor architecture, an enhanced memory subsystem
and hardware support throughout the design to enable new programming model capabilities. Be-
sides its new and improved computing capabilities, Kepler ’s improved power efficiency made GPU
computing in devices such as laptops a more affordable option.
Figure 2.12: Full Maxwell chip-block [40]
Maxwell was released in 2015 and was mainly driven by the leap in PC displays technology at
the time. After years of little or no progress, the demand for higher screen resolutions increased
dramatically with the introduction of 4K and Virtual Reality technologies, both at an affordable cost.
This resulted in a even greater need for powerful and power efficient GPUs. Maxwell aimed exactly
at delivering the performance required by this new type of technologies while still improving on the
already efficient power consumption of the previous architecture generation by a factor of 2.
21
Table 2.1: Maxwell’s key properties [40]
Figure 2.12 presents the diagram of Maxwell ’s [40] full-chip block. It consists of 4 Graphical
Processing Clusters (GPCs),16 Maxwell SMs (SMM) and four memory controllers. Each GPC has
4 SMMs and a dedicated raster engine. Each SMM is composed of 128 CUDA cores, a PolyMorph
Engine and eight texture units, making a total of 2048 CUDA cores and 16 texture units. Table 2.1
summarises some of Maxwell ’s key properties, most notably the high number of CUDA cores (2048)
and a memory bandwidth exceeding 200 GBPS.
2.2.2 CUDA Programming Model
The CUDA programming model main design goal is to enable the development of GPU programs
that transparently scale their parallelism according to the number of processing cores available, while
maintaining a low learning curve for programmers already familiar with standard programming lan-
guages, such as C. It is based on three core concepts, exposed as a minimal set of language ex-
tensions: a hierarchy of thread groups, shared memories and barrier synchronisation. These provide
both fine- and coarse-grained data parallelism at thread and task level, respectively, allowing the pro-
grammer to partition the problem into coarse sub-problems, solved in parallel by independent groups
of threads, called thread blocks. Each coarse sub-problem can then in turn be further refined into
finer sub-problems, solved cooperatively by all threads within the same block.
22
Figure 2.13: CUDA automatic load balancing
In this model, threads are able to work together in groups, solving problems cooperatively by shar-
ing data in low-bandwidth memory and by synchronizing among themselves within the same Stream-
ing Multiprocessor. At the same time, groups of threads can leverage the high number of SMs to
scale program execution by running independently and without reliance on order execution. The lack
of dependence between independent groups of threads allows them to be scheduled by the CUDA
runtime on any of the available SMs without impacting program correctness. This model relieves the
programmer from knowing the physical characteristics of the GPU, which are handled automatically
by the CUDA runtime in order to optimize workload scheduling. This mechanism achieves automatic
load balancing, as depicted by Figure 2.13, by mapping work to available GPU resources. In this
example, the same amount of work, 8 thread blocks, is assigned to 2 different GPUs: one with 2 SMs
and the other with 4 SMs. On the GPU with 2 SMs, each SM is expected to handle 4 blocks of threads,
while the one with 4 SMs is expected to only handle 2 blocks of threads per SM. The assignment of
thread blocks to the available SMs is done automatically by the CUDA runtime and enables CUDA
programs to run in a multitude of graphics cards, ranging from high performance cards to those with
less resources, without modification of the source code or additional configuration.
23
Figure 2.14: CUDA two-dimensional block grid
The work performed by each CUDA thread is represented by a kernel in the CUDA C language
extension which corresponds to a typical C function, annotated with specific CUDA directives. Kernels
are concurrently executed by N different CUDA threads which are assigned an unique thread ID, used
to orchestrate work among threads in the same group. Each group of threads, composed of at most
1024 threads, is able to share memory and synchronise its work through low-latency shared memory
and synchronisation primitives. Multiple groups of threads can be executed by different SMs resulting
in a much higher total number of threads executing simultaneously at any given time in the GPU.
Groups of threads themselves can have a hierarchical structure and can be composed of one-, two-
or three-dimensional grids. Figure 2.14 shows a two-dimensional grid of thread blocks as an example.
Finally, Figure 2.15 depicts the CUDA memory hierarchy which is composed of multiple memory
spaces, with a distinct set of characteristics. Each individual thread has access to a local private
memory which can be exclusively accessed by it. Groups of threads are able to communicate through
two distinct types of memory: low-bandwidth block memory shared among all threads in the same
group; and global memory, a larger memory accessible by all threads running on the GPU. Global
memory is persistent across kernel launches, while private and shared memory spaces have the
same lifetime as its associated thread and thread block, respectively.
24
Figure 2.15: CUDA memory hierarchy
2.3 User Space File Systems
The development of Unix in-kernel file systems is not a trivial task, for several reasons. It requires a
great understanding of complicated kernel code and data structures, making new code prone to bugs
which are usually hard to trace and fix for the unexperienced kernel developer. Moreover, writing
kernel code is much different from writing application code, and the learning curve for the typical
programmer can be a very steep one. This is due to the absence, in kernel code development,
of commonly available features when developing user space applications. When developing kernel
code there are no memory protection mechanisms in place, careful use of synchronization primitives
is required and the use of a somewhat limited version of the C programming language (without the
standard C library support) is mandatory. Moreover, debugging kernel code can be a very tedious
task, typically requiring repeatedly recompiling and rebooting the kernel and the machine.
On top of the hindrances faced during development, a fully functional in-kernel file system still
has several drawbacks. Even though the use of standard file system interfaces, such as the Virtual
File System (VFS) layer, can make the task of porting an in-kernel file system to a different Unix-like
operating system easier, it can still require significant design and implementation changes. Also, in-
kernel file systems must be mounted with superuser privileges. This can severely limit their usability
25
in centrally administered machines and their feasibility.
Most of the problems described above, faced in kernel development, are mitigated, or completely
eliminated, when developing file systems in user space. Thus research in the area of storage and
file systems has focused on the enhancement of basic file systems in user space, as opposed to
designing low-level file systems directly. When developing a feature in user space, the programmer
can leverage a wide range of familiar programming languages, third-party tools and libraries, other-
wise unavailable in the kernel. This allows the programmer to focus on the essential features that the
system must deliver without worrying about the intricacies posed by the kernel.
Regardless of its advantages, the development of file systems in user space also brings to practice
its own set of challenges. The more a file system implementation is coupled to a particular operating
system, the more effort will be required to port it to a different Unix-flavour, and even more to a different
operating system. Furthermore, typically user space code cannot run as efficiently as a counterpart
kernel version by virtue of inherent user space overheads such as processor context switches. Yet
the ever increasing processor, memory and bus speeds have made performance less of a factor in
current commodity machines.
FUSE [13] is a widely used framework for developing file systems in user space. It is available for
Unix-like operating systems and has been merged into the Linux kernel since version 2.6.14. Several
ports for other mainstream operating systems are also available, such as Mac OS’s OSXFUSE.
FUSE
FUSE, which stands for Filesystem in Userspace, is a widely used framework that allows the
development of fully functional file systems in user space. FUSE’s purpose is to solve the previously
mentioned issues faced when developing in-kernel file systems. It does so by exposing a simple API
to the developer, made of a set of familiar file system operations, which he only needs to implement
in user space. File systems can be implemented in any of the programming languages bindings
available for FUSE (e.g. Java, Python, Ruby) and leverage the large number of libraries available
for it. The simplicity of both its development and its use, since it can be mounted by regular non
privileged users, makes file system development possible to a much larger number of programmers,
vastly potentiating its application in various fields.
The FUSE framework is relatively simple and it is composed of two main components: a loadable
kernel module and a user space library, libfuse. The fusermount utility is also part of the framework
and enables non-privileged users to mount and unmount FUSE file systems. The remaining of this
section describes how a typical FUSE file system operation works in order to present the framework’s
internals.
Figure 2.16 depicts the steps involved in a FUSE operation. In this example, the userfs executable
implements the set of FUSE API callbacks and the file system is mounted in the /fuse directory using
fusermount. Once mounted, a fusefs file system is registered with VFS and all file system calls target-
ing the /fuse mount point are forwarded to the FUSE kernel module. The kernel module then invokes
the appropriate callback in the userfs executable, which handles the operation and returns the result
26
Figure 2.16: FUSE architecture [41]
to the kernel. The communication between the FUSE kernel module and the user level executable
is achieved by using the /dev/fuse character device to forward system calls and its arguments to the
libfuse library. Finally, the result is propagated back through the same device to the kernel and to the
application from which the initial request originated.
All FUSE file system I/O operations follow an identical mechanism. The read operation is pre-
sented next. When a call to read a file hits fusefs at the VFS layer, it is forwarded to the FUSE kernel
module. If the requested data is found on the page cache, it is immediately returned. Otherwise the
read callback registered in the user provided executable is invoked. The actual implementation of the
file system callback is completely up to the developer and the operations it performs may be of any
kind. In the case of the read operation, data may be obtained from a remote server, or some type
of pre-processing can be made on the requested data, leveraging other file systems for storage. The
result is then forwarded back to the kernel and to the originating application.
FUSE provides two different APIs to the file system developer, a high-level and a low-level API.
The low level API forces the programmer to directly deal with and manage inodes, perform pathnames
translations and manually fill the data structure used to reply to the kernel. This API is specially useful
for file systems developed from scratch. The high-level API allows the programmer to deal only with
pathnames, rather than inodes, mapping more directly to system calls. All the low level operations,
like inode to pathname translations and the filling and sending of the reply data structures to the
kernel, are automatically handled by libfuse. Most FUSE-based file systems use the high-level API,
as it is sufficient for most purposes and far more accessible and simple.
A wide variety of FUSE based file systems are available including traditional disk-based file sys-
27
tems, but the possibilities are limitless and several nontraditional uses have been proposed by several
projects, as discussed in the following state of the art section.
2.4 State of the art
Using the GPU as a cryptographic accelerator is not a new endeavour. The first efforts to efficiently
implement the AES encryption algorithm in the GPU date back to the time when the GPU was still a
dedicated graphics accelerator, which could only be programmed using graphical APIs and was very
limited by the graphics pipeline. This was neither an easy task nor did it provide great performance
benefits. [49] demonstrated the feasibility of implementing the AES encryption algorithm on the GPU
for the first time. However, the implementation of AES in OpenGL also showed that the APIs available
for programming the GPU at the time did not suit this this kind of problem properly. Despite the
poor results achieved for offloading the computation of AES to the GPU for the general use-case,
the approach proved worthy nonetheless for the encryption and decryption of graphical displays and
images in order to avoid temporarily storing them on the system main memory in plain text. These
results were some of the first proving the merit of using the GPU for offloading computation typically
done on the CPU for very specific requirements.
The advent of General Purpose GPU (GPGPU) [33] made it possible for a wider group of re-
searchers to exploit the Graphics Processing Unit as a cryptographic accelerator, by eliminating the
need for a deep knowledge of the GPU graphical computing pipeline details and specific graphical
programming languages. Early CUDA based implementations of AES presented evidence that the
support for general purpose programming on the GPU could be leveraged for using the GPU as a
cryptographic co-processor. The first attempts at porting the AES encryption algorithm to the GPU
using CUDA under performed when compared with optimised implementations on standard CPUs at
the time [50]. However, they were successful at demonstrating that the additional computing power
made available by relocating the computational load incurred by cryptographic primitives from the
CPU to the GPU, enabled an increased overall system throughput, by making the CPU available for
other tasks.
Eventually, GPU-accelerated implementations of AES presenting better performance than those
using only the CPU were proposed. [51] reported an AES CUDA implementation delivering a speed
up of about 20 times when compared with a modest CPU at the time and on par with solutions using
dedicated hardware. These results were the first to portrait the GPU as an efficient cryptographic
accelerator, and speedups as high as 50, achieving bandwidths of 60 GBPS on contemporary state
of the art GPUs have been reported more recently [52] .
The GPU has also been used in the context of computer file systems for diverse applications.
[53] proposed GPUfs, a POSIX-like API for GPU programs exploiting the concurrent capabilities of
the GPU for efficiency and extending the host CPU buffer cache into the GPU memory in order to
optimize GPU file access. The benefits of this type of mechanism were demonstrated in the form of
GPU programs for applications such as searching for a set of strings on the Linux kernel tree running
28
over seven times faster than an equivalent eighth-core CPU implementation.
Harnessing the GPU computational power to improve the performance, reliability and security of
distributed storage systems by offloading and accelerating computationally intensive hashing prim-
itives was proposed by [54]. The approach was successful at bringing tangible performance gains
without negatively impacting the performance of concurrently running applications for preserving data
integrity and doing online similar detection in content addressable storage systems. [55] further al-
lowed leveraging the GPU for this specific use case by designing StoreGPU, a library accelerating a
number of hashing primitives popular in distributed storage system implementations on the GPU.
Deeper integration of the GPU into the operating system has also been explored in order to provide
a further generalization of its execution model. [56] presented Gdev a GPU resource management
ecosystem allowing the use of the GPU as a first class computing resource that allows the use of
the GPU memory for swapping, virtualises it into multiple logical GPUs, enhancing isolation among
working sets of multi-tasking systems and GPU programs to leverage typical operating system mech-
anisms, such as scheduling and memory management [57].
The use of GPU for accelerating file encryption mechanisms has also been proposed before, both
for replacing the typical CPU implementation of encryption algorithms found in widely used encryption
suites, such as TrueCrypt [58], and for accelerating the cryptographic primitives used by secure file
systems. Some of these efforts are based on extending the operating system mechanisms, in order
to better support the integration of the GPU as a cryptographic accelerator [57].
The possibility of developing file systems in user space greatly reduced their implementation com-
plexity and made it possible to explore the integration of novel functionality into their operation by
leveraging established user space libraries providing all kinds of functionality. Data processing fea-
tures can be implemented as additional layers on top of pre-existing, stable and widely used standard
file systems by levering them for storage. This type of approach led to the development of several
types of user level file systems: LZOlayer_fs, FuseCompress [46], fuse.gunzip [44] and fuse-zip [45],
abstract data compression from the user and are able to save disk space by storing their files in a
compressed format; archived files can be directly accessed using rarfs [43]; sshfs [47] and DAVfs [48]
allow access to remote files through the network; and many more uses have been proposed [42]. The
integration of GPU-accelerated data processing features also becomes possible with user space file
systems.
Scientific studies are predominantly based on user level constructs to achieve encryption of file
data, typically through the use and extension of user level file systems. [59] proposes a very similar
approach to the one presented in this dissertation. It also proposes offloading the execution of the
AES encryption algorithm to the GPU, in the context of a user space file system enhanced with crypto-
graphic mechanisms, ensuring privacy of its users’ data. However, it mentions several shortcomings,
which the design of the solution proposed in this dissertation aims at avoiding from the ground up.
The solution is based on EncFS, a pre-existent FUSE-based file system implementing encryption ex-
tensions using OpenSSL. The decision of using a pre-existent file system however proved to be one
of its main limitations. Due to the relative complexity of the file system source code, the opportunity
29
for adapting its operation in order for it to fit better the GPU computing paradigm was very limited.
Our approach differs substantially in respect to this decision. A complete Virtual File System was de-
veloped from scratch, making it possible to adapt its operation and to maximise the potential benefits
of integrating the GPU as a co-processor, as is described in the following sections. This allowed for a
better use of the GPU computing capabilities, resulting in a higher I/O throughput. While [59] report
a 30% AES execution time reduction, our solution is able to further reduce it to a maximum of about
70%.
30
3Proposed Solution
Contents3.1 Automatic implicit encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Encryption mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
31
The implementation of cryptographic mechanisms rely on operations that can place a noticeable
burden on a system’s CPU and can thus impact a system’s overall performance. Even with multi-core
CPUs and the introduction of specialised dedicated instructions at the hardware level, the impact of
cryptographic primitives on a system’s CPU load can still be quite noticeable. This is especially the
case when a system’s CPU load is already very high and there are little computational resources left
available, or even none at all. In these situations, security mechanisms, such as data confidentiality,
tend to be overlooked and considered expendable in favour of a system’s functional requirements, due
to their prohibitive computational impact. This can ultimately lead to cryptographic security mecha-
nisms being disabled in a system with limited computational resources, in order to improve its overall
performance.
This dissertation presents a pragmatic solution to minimise the impact of the cryptographic mech-
anisms required to ensure protection of the file system on a system’s CPU and speedup its execution
using a GPU. The solution aims at avoiding degraded system performance caused by the penalty of
additional load on a machine’s CPU by offloading the operations required by the encryption algorithm
from the CPU, which is highly sought after, to the seldom used and computationally powerful GPU.
This relocation reduces the impact of the cryptographic primitives on the CPU to close to zero. At
the same time, it makes the most of the usually underused GPU capabilities. Additionally, it achieves
this as transparently as possible to the end user so as to maximise its adoption and minimise any
obstacles to its use.
The task of building a security system to provide confidentiality of user data can be broken down
into two major independent subtasks: determining the point at which to transform data into and from
an encrypted format, i.e. when to cipher and decipher the data; and the actual process of ciphering
and deciphering it. The proposed system is built on top of a fully functional physical file system by
incorporating the automatic and transparent encryption and decryption of the accessed files using the
GPU as a cryptographic engine. Rather than relying on low-level Operating System (OS) calls, this
solution uses FUSE, a user level framework allowing the creation of a Virtual File System (VFS) in
user space, enriched with CUDA support, allowing to integrate specific GPU cryptographic kernels,
also at the user level. FUSE was chosen for this task given its flexibility and availability in mainstream
Linux kernels since version 2.6.14.
This approach allows to focus on the system’s fundamental goal of adding cryptographic security
as an additional layer of data processing and leverage the functionality already provided by stable and
widely adopted general-purpose file systems. Moreover, the ability for this solution to be deployed
by non-privileged users is another advantage, along with the resulting transparent usage, making it
easier to adopt.
3.1 Automatic implicit encryption
Two approaches can be taken when addressing the first subtask of determining the point at which
to introduce the encryption mechanism. One is to make the process explicit to the user and the other
32
is to embed it into the normal operation of the system, making it implicit.
The explicit approach typically relies on the existence of a tool, either built into the operating
system or in the form of an external standalone application, used by the user for ciphering and de-
ciphering data. This has the advantage of giving the user total control over the encryption process,
allowing it to only be triggered when there is a real necessity and thus saving potentially valuable sys-
tem resources. It however has the disadvantage of requiring explicit action from the user, which can
make it cumbersome after some time. Moreover, since it is neither automatic nor enforced by the sys-
tem, it can be overlooked or regarded as an unnecessary additional step by less security concerned
users, sometimes rendering it ineffective.
An implicit approach usually leverages mechanisms built into the operating system and requires
no additional action from the user. In this case the user doesn’t even need to be aware of the mech-
anism’s existence, since it is automatically enforced by the system. By opposition to the explicit
approach, this has the downside of giving the user less control over the mechanism’s operation.
Systems using this approach typically only give the option of completely enabling or disabling the
mechanism, not giving the user a finer control over its operation. This may potentially result in the
consumption of more computational resources than would be actually required. However, since it is
automatically and constantly enforced, this approach ensures that the security mechanism is always
in place and data privacy is always guaranteed.
The solution presented in this section proposes the use of an implicit approach. This is the one
that best suits the system’s usability goal, presented in Section 1.2. For this purpose, a FUSE-
based Virtual File System with pre- and post- data processing extensions has been implemented.
These extensions present the perfect opportunity for adding an automatic extra layer of security for
implementing the required encryption and decryption mechanisms.
3.2 Encryption mechanism
The second task, concerning the actual encryption process, consists of picking, among the several
encryption algorithms available in the computer security community, the one that best suits the sys-
tem’s requirements. Given the dissertation’s goals of minimising the mechanism’s load on the CPU by
offloading the execution of the encryption algorithm to the GPU, the chosen algorithm should be eas-
ily paralellizable, so as to make the most of the Graphics Processing Unit capabilities. Also given the
relatively high complexity of this type of mechanism, an already available open source implementation
of the algorithm greatly favours its adoption.
The solution presented in this dissertation uses the AES algorithm, since it is a widely accepted
de facto standard, for which several implementations are available. In regard to the encryption it-
self, 128 bit keys and the ECB (Electronic Codebook) cipher mode were selected. Besides the data
transformation functions, the encryption and decryption process requires an additional component for
setting up and initializing all the AES parameters, particularly the secret key. It is important to note that
only by providing the same secret key between calls will it be possible to properly access previously
33
Figure 3.1: Virtual File System: read and write operations
created files.
The proposed system, codenamed gipherfs, consists of three processing steps, as illustrated
in Figure 3.1 for the Virtual File System’s read and write operations. In the case of a write, the
I/O operation, mapped to the virtual file system, is intercepted by FUSE in the first step. After this
interception step, the implementation transfers all the necessary data received from the call to the
GPU and calls the GPU encryption kernel. To conclude, the GPU encrypted data is transferred back
to the host (system main memory) and effectively written into the underlying physical file system by
performing the write system call to a physically mapped address. The corresponding physical address
is defined by the virtual to physical file system mapping. In the case of a read operation, a similar
three-step procedure is followed. The main difference lies in the second step, where the encrypted
data is retrieved from the physical file system and the GPU kernel is called for decrypting it before
passing it to the invoking application.
Although the interception of FUSE may impose some additional delays, the major limiting factor
for the overall system performance lies in the data transformation step, consisting of encrypting or
decrypting the data. These cryptographic operations can be performed either by the CPU or by a
cryptographic co-processor, in this case the GPU. The approach considered to speedup the cryp-
tographic operations relies on the GPU as a cryptographic co-processor, where the encryption and
decryption procedures are offloaded, after the data is intercepted by FUSE. As in any typical CUDA-
based GPU program, the implementation of AES on the GPU requires the three following steps:
1. Transfer data from the host (CPU) to the GPU
2. Process the data on the GPU
3. Transfer resulting data back from the GPU to the host
34
The data transfer steps are an obvious source of execution overheads, nevertheless required in
order to take advantage of the GPU computational power. The trade-off between this additional cost
and the performance benefits obtained from implementing AES on the GPU is one of the main aspects
of a successful GPU implementation. These tradeoffs are evaluated in Chapter 5 in order to deter-
mine the optimal amount of data processed in each GPU kernel invocation and the possible memory
management optimisations. The GPU AES code used in the proposed system, which resembles the
implementation described in [52], is open-source and was obtained from [61].
To properly evaluate the performance gains that can be achieved with the proposed GPU-based
approach, a CPU-based version of the system was also implemented, where the CPU is used for
the cryptographic processing. In this version, designated as cipherfs, the GPU is not used for the
cryptographic operations. cipherfs is identical to the giphers GPU-based version of the system but
implemented using only the CPU. In order to optimize the use of the CPU processing resources,
the CPU-based implementation of the AES cryptographic primitives was extracted from the wolfSSL
[62] cryptographic library implemented in the C programming language. This implementation takes
advantage of AES-NI (AES New Instructions), the x86 instruction set architecture extension for AES
acceleration [7], where the data blocks are sequentially processed using dedicated AES hardware
instructions.
3.2.1 Block based implementation of read and write operations
The block nature of AES requires the implementation of the read and write operations to be
adapted in order to enable its proper application. A block cipher must always process data in fixed
size block units, more specifically of 16 bytes (128 bits) each in the case of AES. Because of this,
instead of a more straightforward approach, the Virtual File System’s read and write operations also
read and write data from files in block units whose length is always a multiple of 16 bytes. Figures 3.2
and 3.3 show a comparison between a straightforward and a block based approach, respectively, for
reading data from a file. Both represent a request to read 4 bytes from a 16 byte file, starting at byte
3.
Figure 3.2: Straightforward read operation
In a straightforward approach, the file system reads data from the file exactly as requested. In
this example, it would read bytes 3 through 6 and return all of them to the requesting application.
This is depicted in Figure 3.2 in which bytes 3, 4, 5 and 6 are read from the file, and returned to the
35
application. Using this approach, only the data requested by the application is actually read by the file
system from the file.
A block based approach only reads data from the file in sizes multiple of a predefined block size.
In the example of Figure 3.3 the file is divided in four 4 byte blocks. When the same request to read
4 bytes, starting at byte 3, is made to a system following this approach, it must determine which
blocks contain the required data and read them entirely. In this case blocks 0 and 1 must be read,
which means all bytes from 0 through 7 are read by the file system from the file. However, since the
application only requested bytes 3 through 6, only those will actually be returned to the application.
Figure 3.3: Block based read operation
This may appear as a waste of additional logic, processing and I/O load for achieving the same
end results, but it is a requirement for the proper application of the AES encryption algorithm on
the file system data. Using again the example of Figure 3.2, if the file data was ciphered, it would
not be possible to decipher the requested data by only reading bytes 3 through 7. In this setting,
only by following the approach of entirely reading blocks 0 and 1, depicted in Figure 3.3, containing
the requested data, it is possible to decipher them, obtain the requested data, and return it to the
application.
3.2.2 Padding: PKCS#7
Another implication of the AES encryption algorithm block nature is that it can only handle full 16
byte data blocks and must first extend its input when it is composed of partial blocks. Padding is the
most commonly used technique to enforce this requirement and consists of extending a piece of data
by adding the necessary data units to it, until it has the required length. Assuming l and k are the
input data length and block size, in bytes, k − (l mod k) is the number of trailing bytes which must be
added to the input in order to make it a multiple of the block size.
The padding method implemented in the system presented in this dissertation is the one proposed
in PKCS#7. In PKCS#7 the value of each byte added to the original input data matches the number
of bytes added. Using again the l and k values defined above, where k = 16 in AES, PKCS#7 adds n
trailing bytes to the input data, where the value of each byte is n and n = 16 − (l mod 16). However,
for the mechanism to be reversible, it must always be possible to determine how many bytes were
padded. In order to achieve this, when the input data length is already a multiple of the block size, a
36
whole new full block is added with the value of all bytes equal to the block size. Without this exception,
it would not be possible to reverse the process. Considering a padded data chunk whose last block
value is 0x01 one would not be able to determine if the original data block length was already equal
to the block size and 0x01 was its original last byte or if the original chunk was padded with one 0x01
valued byte. Adding the aforementioned special case, the padded block last byte always matches the
number of bytes added to the block, thus making it trivial to remove the pad bytes and reverse the
process.
Figures 3.4 and 3.5 depict the operation of the mechanism implemented in the system, for the
AES 16 byte block size. Figure 3.4 shows that a 13-byte long data chunk is padded by appending
16 − (13 mod 16) = 3 bytes with the value 0x03 to the original input. The reverse operation simply
consists of reading the padded block’s last byte, and removing as many trailing bytes from it as this
value, in this case 3. Figure 3.5 presents the case when the input block length is already 16. When
this happens, a complete new 16 byte block is appended after the original one with all values equal
to 0x16. Again, removing as many bytes as the value of the padded block last byte yields the original
block.
Figure 3.4: Padding of a partial block
Besides the already mentioned adaptation required by the read and write operations, padding of
file data also requires modification of all the operations reporting file size. This information is stored
as metadata about each file on the underlying physical file system on top of which the Virtual File
System described in this dissertation is built, and will therefore include padding, which is an artificial
artefact required by AES. In order to report the actual file size, the padding bytes must be subtracted
from the size reported by the underlying physical file system. This means that it becomes no longer
possible to retrieve the size of a file without first reading and deciphering its last block. Using again
the padded file in the example of Figure 3.5, the size reported by the underlying physical file system
would be 32 bytes. By reading the file’s last byte, one determines that the last 16 bytes were added
as padding and that the actual file size is 32− 16 = 16 bytes. This logic is implemented in the getattr
FUSE I/O operation. The operation implementation starts by using the underlying stat file system
operation to obtain the physical file’s actual size, in this case 32 bytes. It then reads and deciphers
the file’s last block in order to obtain the 0x16 pad value and subtracts it from the obtained file size in
order to report the file’s length correctly.
37
Figure 3.5: Padding of a complete block
38
4Solution Implementation
Contents4.1 plainfs: a non-ciphered file system . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 cipherfs: a non-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . 454.3 gipherfs: a GPU-accelerated ciphered file system . . . . . . . . . . . . . . . . . . 49
39
The system developed in the context of this dissertation ensures confidentiality of user file data
through the use of cryptographic mechanisms. More specifically, a GPU version of the AES encryption
algorithm is used to cipher user files, in order to speedup its execution and avoid additional load on
a machine’s CPU. This is achieved through a Virtual File System, implementing automatic ciphering
and deciphering of its files on the GPU. This section starts by presenting the high-level design of a
FUSE-based user level file system with enhanced cryptographic features. The implementation details
are presented in the following subsections, which describe a set of different implementations of a set
of iterative versions resulting in the final system.
Figure 4.1 provides a high-level view of the file system’s basic operation, hereafter referred to as
gipherfs. The name gipherfs originates from the junction of the cipher text and fs keywords, standing
for ciphered data and file system, respectively. The replacement of the letter c with a g, denotes
the use of the GPU as a cryptographic accelerator. As evidenced in Figure 4.1, the cryptographic
primitives are introduced in the write and read operations. Every time a file is written, data is first
automatically ciphered by the virtual file system, before being effectively written to the underlying
physical device. Conversely, data is deciphered upon being read from the physical device and before
being returned to the invoking application on a file read operation.
Figure 4.1: Virtual File System: read and write operations
A level of indirection can be identified in Figure 4.1. Even though I/O operations are targeted at the
/mnt/private_fs directory, they are actually performed in /mnt/private_fs_back. In fact, /mnt/private_fs
is only a logical endpoint used to intercept calls to the FUSE-based file system. The user-level im-
plementation of the Virtual File System I/O operations then relies on the existence of an underlying
physical file system, such as ext4, and on the use of its mechanisms. This is achieved by redirecting
the target of the I/O operations from the Virtual File System mount point to another directory, in this
case /mnt/private_fs_back, which is mounted on a physical file system and uses its I/O API.
By using this approach, one can focus on the system’s fundamental goal which is to add encryption
as an extra layer of data processing and leverage the functionality already provided by stable and
40
widely adopted general purpose file systems. Hence the task of building such a file system can be
broken down into the 3 independent subtasks:
Subtask 1: intercepting I/O operations targeting the Virtual File System’s mount point;
Subtask 2: applying data transformation (ciphering/deciphering) to the I/O operation payload,
whenever it is required;
Subtask 3: performing I/O operations on the underlying file system.
Besides making it easier to reason about the system’s design, by making it more modular, this
decomposition makes it possible to build the system incrementally. This was the approach taken
during both the development of the system and for describing it in the remainder of this section.
The system was implemented as a set of incremental iterations, each version building on top of the
previous one by adding or replacing each of the aforementioned subtasks.
Subtasks 1 and 3 are a requirement of any functional virtual file system and their implementation is
rather static, regardless of how the system transforms its data. As described in Section 2.3, subtask 1
is provided out of the box by FUSE, by means of its architecture and how it enables the implementation
of user level file systems. However, subtask 3 needs to be implemented in the Virtual File System.
The implementation details of both of these subtasks are the topic of Section 4.1 which describes
the system’s first iteration, codenamed plainfs. plainfs consists of a simple Virtual File System, with
no transformation applied to its data, which simply intercepts calls to a virtual mount point and maps
them to another directory mounted on a fully functional general purpose file system. The codename
plainfs originates in the fact that file data is stored with no transformation applied to it, i.e. in plain text,
as it is usually referred to in the cryptographic bibliography.
Even though it is not possible to build a system of this kind without both subtasks 1 and 3, subtask
2 is the one that mostly defines the system, by allowing the developer to specify the transformation to
be applied to the file system data. This transformation can be of any type, or even none at all, as in
the case of the plainfs iteration of the system.
Section 4.2 describes cipherfs, the system’s second iteration, in which an implementation of sub-
task 2 is introduced for the first time. In this version, the transformation consists of the encryption and
decryption functions of the AES encryption algorithm implemented in the CPU. As such, the second
iteration of the system already provides confidentiality of user data, even though it still implements it
on the CPU, as it is usually the case in most security systems.
Lastly, the final iteration of the system, codenamed gipherfs, is presented in Section 4.3. Here
the CPU AES functions are replaced by their counterparts implemented on the GPU. By using the
GPU, the system is expected to provide the same security features of the previous iteration, without
the penalty on the CPU load and an improved throughput.
41
4.1 plainfs: a non-ciphered file system
The system’s first iteration, designated as plainfs was built with two main purposes in mind. First
and foremost, is to gain a deeper knowledge about the operation of FUSE and understand how it can
be used to implement user-level file systems. By only requiring the implementation of the already
described subtasks 1 and 3, plainfs makes it possible to focus on FUSE’s mechanisms and not on
any specific kind of data transformation. Moreover, the modular approach followed in the system’s
design also makes it possible to use plainfs as a bootstrap, on top of which the remaining versions of
the system can be implemented by adding specific implementations of subtask 2, consisting of a data
transformation step.
Second, in spite of its lack of additional functionality compared to the underlying general purpose
file system it is built on top of, plainfs is actually very useful in the context of this dissertation in other
aspects. It makes it possible to determine the overhead of using the FUSE interface by measuring
and comparing the time it takes to complete the same I/O operations on plainfs and directly on the
underlying physical file system. Also, plainfs can be used for setting a baseline against which to
compare the overhead of encrypting data using both the CPU and GPU based implementations of
the AES encryption algorithm, in the next iterations of the system.
As mentioned earlier, in order to provide the functionality of plainfs one needs to be able to inter-
cept I/O operations targeting the Virtual File System’s mount point and to redirect and perform them
on another directory mounted on a fully operational file system.
4.1.1 Intercepting I/O operations
The interception of I/O operations targeting the Virtual File System is provided by the fuse kernel
module, which includes the fusefs file system. Whenever a FUSE based file system is mounted on
a system directory, e.g. /mnt/test_fuse, the fuse kernel module registers an association between the
mount point and the fusefs file system at the Linux VFS (Virtual File System) layer. This is done by
using the fusermount utility, which communicates with the fuse kernel module through the /dev/fuse
character device.
When this association is established, all I/O operations targeting the /mnt/test_fuse mount point
are handled by the fusefs file system. fusefs main function is to call the corresponding user level
operation implementation, by communicating with the user level libfuse library. This again is done
using the /dev/fuse character device. This mechanism is depicted in Figure 2.16 and achieves the
subtask 1 goal of intercepting the I/O operations targeting the virtual file system’s mount point.
4.1.2 Performing I/O operations on underlying file system
Subtask 3 is implemented by using FUSE’s mechanisms for building file systems in user space. A
FUSE file system is implemented by a user level standalone application, which is linked against the
libfuse library. The application must provide the implementation of a set of file system I/O operations,
which are to be invoked by libfuse in response to requests from the fuse kernel module.
42
FUSE’s architecture resembles the client-server model. In this context, the fuse kernel module
is the client and the user defined application, together with libfuse, are the server. Whenever an I/O
operation targets the FUSE file system, it is redirected by the fuse kernel module to libfuse, which in
turn uses the callbacks defined in the user defined application to perform the operation and returns the
results back to the kernel module. The communication between kernel and user space is completely
handled by libfuse, allowing the file system developer to only focus on implementing the I/O callbacks
in the standalone application.
libfuse provides a set of function prototypes of the I/O operations that must be implemented as
callbacks by the standalone application linked against it. These are a variation of the typical I/O
system calls found on general purpose file system APIs, such as open, close, read and write. The
implementation of these callbacks is what defines the file system operation. Two different APIs are
provided by libfuse: the high- and low-level APIs. plainfs, as well as the subsequent versions of the
system, make use of the high-level API only, as it provides all the mechanisms required to implement
the systems’ functionality.
Figure 4.2 shows how a user level application is linked with libfuse, in order to implement a FUSE
file system. To simplify, the presented example only references the open, read, and write operations.
The figure’s blue box represents libfuse and the set of function prototypes it defines, in this case
composed of only open, read and write. These are implemented as a set of callbacks in the plainfs.c
C program, depicted by the green box. The bridge between libfuse and the standalone application is
the fuse_operations data structure, which pairs the functions specification and their implementation.
For clarity, the fuse_operations structure is depicted as a third external entity in Figure 4.2, but it is
typically also part of the plainfs.c program.
Figure 4.2: User level file system I/O operations implementation
The goal of subtask 3 is to redirect the I/O operations to an underlying file system and is im-
plemented by applying the same principle in the implementation of all I/O operations. In addition to
43
the file system mount point, plainfs is parameterised with the path to a backing directory, hereafter
denominated back_dir. This directory must be located under the logical mount point of the under-
lying physical file system which is to be leveraged and is where I/O operations will be redirected to.
The redirection thus consists of performing all I/O operations targeting the plainfs mount point on the
equivalent location having back_dir as root instead.
Figure 4.3: I/O operations redirection
Figure 4.3 shows how this mechanism works, by using the open operation as an example. Here,
plainfs is mounted in /mnt/plainfs and uses /mnt/plainfs_back as its back_dir. When a request to open
file /mnt/plainfs/a hits plainfs, the user-defined open implementation converts it into a call to the same
open operation but to a path where the plainfs mount point is replaced by back_dir. This results in a
call to open(/mnt/plainfs_back/a), using the underlying file system API. The same principle is used for
implementing the remaining I/O operations.
While most functions simply require calling its equivalent version on back_dir and filling the FUSE
data structures used to pass results back to the kernel, others are more complex. FUSE operations
are an abstraction of the actual I/O operations found on physical file systems, and their implemen-
tation on top of a physical file system must reflect that file system’s specificities. Table 4.1 lists all
FUSE operations implemented in plainfs, and the corresponding ext4 counterparts supporting their
implementation, as it is the preferred physical file system used in the system.
44
FUSE operation ext4 operation Notesgetattr lstat must decipher last file block in oder to compensate for paddinggetxattr getxattrreaddir opendirrmdir rmdiropen openflush dup and close call to dup followed by closefsync fdatasyncread read must decipher datawrite write must cipher data
release closechmod chmodutimes utimeschown lchown
truncate truncatemkdir mkdircreate creatunlink unlink
rename renameaccess access
Table 4.1: List of FUSE operations and ext4 counterparts
4.2 cipherfs: a non-accelerated ciphered file system
cipherfs introduces file data confidentiality into the system. It does so by leveraging the already
functional plainfs iteration of the virtual file system, described in the previous section, and introducing
an implementation for subtask 2. In short, subtask 2 consists of transforming the file system data
actually stored on the physical medium. In this case, data must be transformed into an illegible form
to an attacker, i.e. ciphered, before being stored, and transformed back into its original form, i.e.
deciphered, upon being retrieved from storage.
Among the three subtasks identified as the ones composing the implementation of this type of
system, subtask 2 can be described as the most fundamental. While the other 2 are essential in order
to implement it correctly, subtask 2 is the one that defines what the system does, by determining how
it transforms the data. Without subtask 2, the Virtual File System is practically indistinguishable from
its underlying physical file system as is the case of plainfs. In short the transformation applied to the
file system data is what best defines it.
For the sake of consistency, plainfs can be described as implementing subtask 2, thus actually
including a data transformation step, which in this case does nothing. As such, cipherfs can be seen
as a modification of plainfs, which in subtask 2 performs the AES encryption or decryption, thus
maintaining most of the previous structure and implementation.
One of the main advantages of the modular approach followed in the system’s design is its de-
composition into well defined and self-contained functional blocks. This organisation makes it easy
to locate and modify each functional block, independently of each other, which also reflects on the
system’s source code. As a result, the replacement of the transformation applied to the system’s data
affects only a small portion of the system’s code, mainly the read and write file system operations.
45
Moreover, the fundamental structural modifications required to implement the AES-based data trans-
formation in cipherfs are the same as those required for gipherfs. The only difference between the two
reside on the AES implementation which is CPU-based in cipherfs and GPU-based in gipherfs. This
means that most of the required system structural modifications will be implemented in this iteration.
This allows to fully focus on the specificities of moving the execution of AES to the GPU, on gipherfs.
The introduction of the AES-based data transformation into the system actually requires data to
be transformed twice. First, as already mentioned, AES can only handle data in block units, thus
requiring all its input data size to be a multiple of its block size. The padding technique described in
Section 3.2.2, can be used in order to enforce this requirement and apply a first transformation step
to the data processed by the system, extending it to the appropriate length. Second, the actual AES
encryption and decryption functions must be applied to the data. Both of these steps must be applied
together in both directions of the transformation, in order for AES to be properly implemented.
Figure 4.4: Encryption transformations: adding padding and ciphering
The encryption direction thus consists of padding the original operation payload and applying the
AES encryption function to the resulting data. These steps, depicted in Figure 4.4, must be applied
whenever data is to be written to the underlying physical file system, and are implemented on the VFS
write operation. Conversely, Figure 4.5 shows the decryption direction which consists of applying the
AES decryption operation to the ciphered data and removing the padding, previously added. These
must be implemented in the VFS read operation, whenever data is retrieved from the physical storage
medium.
Figure 4.5: Decryption transformations: deciphering and removing padding
46
Besides the actual data transformation functions, the introduction of AES requires an additional
component. The function of this component it to setup and initialise all the required AES parameters,
namely the symmetric secret key, used for ciphering and deciphering data. This only needs be done
once, during the system’s bootstrap. It is important to note that only by providing the same secret key
between system invocations will it be possible to gain access to previously created files.
The focus of the proposed system is to assess the feasibility and measure the benefits of offload-
ing the execution of the AES encryption algorithm from the CPU to the GPU, and not on the key
management. As such, the AES secret key has been hardcoded into the system code in order to
simplify this management and the same key is used in all executions of the system. A more secure
management of the AES secret key can be implemented on the system regardless of its computation.
The functional decomposition of the system reduces the effort of implementing a complex secret key
management approach, given its isolation from the remaining system functional blocks.
4.2.1 Block based I/O
Section 3.2.1 presented the block based approach used for implementing the Virtual File System
read and write operations. The description illustrated the importance of handling I/O operations in
blocks for introducing the AES data transformation functions into system. This is true due to the fact
that AES must always handle data in blocks of a minimum of 16 bytes. By already handling data in
aligned block units, of at least 16 bytes each, it becomes trivial to handle AES blocks.
Had the read and write operations not been implemented in this manner, additional logic would be
required for correctly handling them. When reading, it would be possible to fetch partial AES blocks,
which would not be decipherable without further access to the underlying physical file system in order
to obtain the missing AES block bytes. The same would be true for writes, which would require
fetching additional data from the underlying physical file system in order to properly cipher the full
AES block. By taking the block nature of the encryption algorithm into account and always handling
data in a multiple of the AES block size, it becomes impossible to either retrieve or try to write partial
AES blocks, thus greatly simplifying the read and write operations implementation. When reading or
writing a file, the operation payload will always be made of aligned 16 byte blocks. It then becomes
trivial to split the payload into individual AES blocks which are then fed sequentially, one at time, to
the AES decryption and encryption functions. Figure 4.6 summarizes the implemented mechanism,
using a 4 KB data payload as an example, which is split into 256 individual AES blocks processed
sequentially, one at a time, by the CPU.
A possible modification to cipherfs would be to process several AES blocks at a time, taking ad-
vantage of current commodity multicore CPUs. This was not implemented mainly because it would
make the implementation of ciphefs more complex, which is not the focus of this dissertation. More-
over, one of the goals of the work proposed in this dissertation is to improve the encryption’s process
performance under an already heavy loaded CPU. In this setting, the use of additional cores would
not necessarily increase the overall system performance, since they would all probably be under a
high load. For these reasons, a multicore version of this system’s iteration was not implemented.
47
Figure 4.6: Handling a 4096 bytes payload, made of 256 AES blocks
4.2.2 AES functions
cipherfs AES data transformation functions, i.e. its encryption and decryption functions, are based
on an optimised implementation of AES which takes advantage of AES-NI. From a purely functional
perspective, and to a certain degree, the performance factor would be irrelevant for the selection of
the CPU-based AES implementation. However, AES-NI has been available in commodity processors
already for some years now and is currently commonly found in many CPUs. Therefore, an AES-NI-
enabled implementation of AES was chosen, in order to make a fair comparison between the CPU
and GPU versions of the system, without increasing its complexity.
It is important to mention once more that the focus of the solution presented in this document is
on building a system providing confidentiality of user file data with minimal impact on the CPU and
not on improving the performance of AES by proposing any modifications to it. Therefore, the imple-
mentations of the AES encryption algorithm used on both cipherfs and gipherfs are open source and
were made freely available by its developers to the community. More specifically, the AES implemen-
tation used in cipherfs was extracted from the wolfSSL cryptographic library for the C programming
language [62], which has an open source license for non-commercial purposes.
The wolfSSL implementation was chosen mainly due to its ease of use, specially when com-
pared to other cryptographic libraries. It provides a comprehensive documentation and has seen
wide adoption in the cryptographic community. This library includes an example on how to use the
AES implementation which was used as a prototype for the cipherfs data transformation functions, as
well as the auxiliary initial setup operation, used to generate the required AES parameters, including
the symmetric secret key.
One of the most important aspects about the operation of block ciphers, such as AES, is how
blocks are combined during the algorithm’s execution. This is called the algorithm’s mode of operation
and many different modes have been proposed over the years, each providing different types of
48
properties such as confidentiality, authentication and integrity.
Besides the obvious concern about security, the choice among the recommended modes for AES
in the context of this system was mainly driven by the mode’s suitability for parallelization. Due to the
ultimate goal of removing the computation of AES from the CPU, the most fundamental property of
any successful GPU implementation is the use of a mode of operation which does not rely on serial
execution. As a result, this also impacted the selection of the CPU-based implementation mode of
operation. Again, in order for the two versions to be as similar as possible and its comparison more
relevant, the same mode was chosen for both.
Given the main goals of proving the system’s feasibility and maximize the opportunity to paralellize
the execution of AES functions, the Electronic Codebook (ECB) mode was the one selected for both
cipherfs and gipherfs. The paralellization of ECB is straightforward, as the processing of each block is
completely independent of all the others. Moreover, it enables random access to ciphered data, due
to the lack of dependency between blocks.
The Counter (CTR) mode is also mentioned here for reference as an alternative to ECB. CTR
offers the same opportunities for parallelisation and random access as ECB, while providing a better
set of security guarantees by introducing a random counter which is mixed with each block before
applying the AES transformation.
4.3 gipherfs: a GPU-accelerated ciphered file system
gipherfs is the final iteration of the system described in this document and focus on the task of
offloading the computation required by the AES cryptographic functions from the CPU to the GPU.
It builds on the structure of cipherfs and replaces its data encryption extensions implemented on the
CPU with an equivalent implementation on the GPU, in order to speed it up and minimise its impact
on the CPU and the system’s overall performance. The integration of the GPU-accelerated implemen-
tation of the AES encryption and decryption mechanisms and its benefits are the core of the solution
proposed in this dissertation and also pose the greatest challenges. The modular approach followed
in the system design makes it possible to leverage an already fully functional Virtual File System
and to concentrate only on the specific challenges of using the GPU to accelerate the computation
intensive AES primitives.
Going back to the decomposition of the system into three independent subtasks, gipherfs is ob-
tained from cipherfs by replacing the implementation of subtask 2. In this case, the AES data trans-
formation functions are replaced by their GPU equivalent, performing exactly the same transformation
and using the GPU.
The description of the gipherfs implementation starts by describing the mapping of the encryption
and decryption functions of AES into the CUDA programming model. It then focuses on the main
aspects and tradeoffs to efficiently implement the required cryptographic mechanisms on the GPU
using CUDA and integrating it within the FUSE file system.
49
4.3.1 GPU implementation of AES
As described above, the CUDA implementation of AES used in gipherfs is open sourced and was
obtained from an online repository [61]. Like any GPU program, the implementation of AES on the
GPU requires the three following steps:
1. Transfer data from the host to the GPU
2. Process the data on the GPU
3. Transfer resulting data back from the GPU to the host
From a performance perspective, steps one and three consist of transferring data from the host
to the GPU, and vice versa, through the system I/O bus. It is thus a source of overhead, required in
order to take advantage of the GPU computational power. The trade-off between the additional mem-
ory transfer overheads and the performance benefits obtained from running the AES cryptographic
functions on the GPU is the first aspect to consider in an efficient GPU implementation and greatly
determines the merit of this type of approach. The impact of the memory transfer overheads is eval-
uated in Section 5.2, in order to determine the minimum amount of data processed in each GPU
invocation so that the integration of the GPU into the system becomes beneficial.
The other main aspect of an efficient implementation of AES in the GPU is the ability to prop-
erly map its operations into the CUDA programming model. Only by decomposing the processing
steps of AES into a set of both coarse- and fine- grained processing tasks, as described in Section
2.2, it is possible to make the most of the GPU computing capabilities. Coarse-grained tasks must
be independent from each other, in order to take advantage of the high number of Streaming Multi-
processors and CUDA cores available in current GPUs. All data dependencies between tasks must
be handled in finer-grained tasks and kept to an absolute minimum in order to fully exploit the GPU
parallel processing capabilities.
gipherfs uses the ECB mode of operation in order to minimise the amount of data dependencies
in the AES encryption algorithm. In this way, there are no data dependencies between individual AES
blocks. Each block is processed independently of all the others, which maps perfectly to the CUDA
programming model and is highly beneficial for an efficient implementation on the GPU. As such, the
CUDA implementation of AES starts by decomposing the application of the encryption and decryption
functions to a data payload of n bytes into n/16 coarse tasks, consisting of processing each AES block
independently.
However, the processing of each AES block can not be further refined into a set of parallel opera-
tions. The AES algorithm, as described in Section 2.1.1, applies rounds of successive transformations
to the input data block. The output of the previous round is used as the input of the next one, making
it impossible to parallelise the application of rounds. This is reflected in the CUDA implementation
by mapping CUDA threads to AES blocks, where each thread is responsible for processing a single
AES block. Therefore, a data payload of n bytes is split into n/16 AES blocks which are concurrently
processed by n/16 CUDA threads.
50
A further optimisation can be applied to the implementation of AES by taking advantage of the
CUDA thread block abstraction and by enforcing thread cooperation inside a thread block. In detail,
CUDA threads can be grouped together in CUDA thread blocks. Threads inside the same thread
block are guaranteed to execute on the same Streaming Multiprocessor and are able to share a
limited amount of data and synchronize their execution. Moreover, the CUDA memory hierarchy is
composed of several levels, each with its own characteristics. In particular, shared memory is a low-
level high bandwidth memory space, shared among all threads within a block, which can be highly
beneficial for any efficient GPU implementation.
The gipherfs CUDA implementation of AES, also makes use of a fine-grained task decomposition
by having threads inside the same thread block cooperating among themselves. This is achieved
by having a single thread within the block copying the Substitution Box translation table and Key
Schedule, used by all threads in the rounds of AES, from the device main memory into the thread
block’s shared memory. The remaining threads must thus synchronise and wait for this transfer to
complete before proceeding. Still, the improved memory access is expected to make up for this
synchronisation point.
In order to maximize the benefits yielded by this mechanism, the number of threads per block
should be as high as possible. The theoretical maximum number of threads per block in the Maxwell
architecture is 1024. However, this value is limited by the amount of shared memory and the number
of registers used by the AES kernels, which limits the effective number of threads per block to 256 in
the AES CUDA implementation. When the payload length is higher than 4 KB, and the corresponding
number of processing threads exceeds 256, multiple thread blocks are launched at a time. A data
payload of 32 KB, for example, would result in launching 8 thread blocks simultaneously, which would
be assigned to 8 different Streaming Multiprocessors. The case when the number of scheduled thread
blocks exceeds the number of available SMs is automatically handled by the CUDA runtime, which
queues thread blocks for execution as soon as resources become available.
Considering a payload of 4 KB, it will be split into 4096/16 = 256 independent AES blocks. The
payload is then processed by a single CUDA thread block, made up of 256 CUDA threads, who
cooperatively process it concurrently. Each AES block is assigned to a single CUDA thread which
implements the AES block encryption or decryption operation. After being processed by the GPU, the
AES data blocks are merged together again to obtain the final ciphered/deciphered payload.
4.3.2 Maximizing payload length: FUSE big-writes mode
Compute-intensive tasks are the ones that benefit most of the GPU computing paradigm. The
high density of computing units found on the GPU greatly favours applications which spend most of
their time doing computation. In contrast, I/O-intensive tasks, which spend most of their time doing
I/O operations do not fit well under the GPU computing model. This becomes rather evident from the
first aspect of an efficient GPU implementation mentioned previously, regarding the memory transfer
overheads imposed by the requirement of transferring data between the host and the GPU.
The operations making up the AES encryption algorithm can be categorized as compute-intensive
51
as they consist of the application of several rounds of data transformation functions to the same
piece of data, the AES block. However, the overall performance of the AES encryption algorithm will
only be able to benefit from the relocation of its computation from the CPU to the GPU if its able to
compensate for the penalty of transferring the input and output data to and from the GPU. This can
only be achieved by maximizing the amount of data processed by each GPU invocation.
The amount of data processed by gipherfs in each GPU invocation is determined by the length
of FUSE’s read and write operations payload. The payload ultimately depends of how the client
application uses the file system API, but FUSE also plays a major role in it by modifying the original
request and handling it in smaller data chunks in the case of the write operation. By default, FUSE
handles the payload of incoming write operations in data chunks of at most 4 KB. This maps to
a single CUDA thread block composed of 256 threads, and is not sufficient to compensate for the
memory transfer overhead, as is demonstrated in Section 5.2.
FUSE provides an alternative mode which makes it possible to overcome this limitation and in-
crease the maximum payload of write operations. The mode is called big-writes and increases the
maximum write operation payload by a factor of 16, i.e. from 4 KB to 64 KB. As verified in Section 5.2,
this amount of data is already sufficient to compensate the overheads incurred by the host-to-device
memory transfers, thus greatly boosting the performance of gipherfs.
4.3.3 Improving device memory management
Some of most expensive operations in any GPU program implementation are the ones related to
memory management and transfer. Memory transfers to and from the GPU are perhaps the most
notable and introduce a significant amount of execution overhead. These can actually completely
render the approach of using the GPU for accelerating compute intensive tasks ineffective by incurring
an excessive amount of overhead over doing the computation directly on the CPU. Even though recent
GPU architectures already support overlapping of memory transfers and kernel execution, it is still not
possible to avoid them completely.
There are however some operations whose cost can be minimized, or even completely eliminated,
by employing more efficient memory management mechanisms. This is the case of device memory
allocation. In more detail, a typical memory transfer operation between the host and the GPU, required
whenever a kernel is launched on the GPU, consists of the following steps, assuming data is already
available at the host:
1. Allocate device memory
2. Copy data from the host to the device
3. Copy data from the device to the host
4. Free device memory
Steps two and three can not be avoided at all, since memory must be transferred to the device for
processing and back to the host for collecting results. However, when a kernel is launched repeatedly,
52
the impact of steps one and four can be minimized by performing them only a single time. Memory
can be allocated once before launching kernels, and it can be freed only after all kernels finish, thus
reusing previously allocated memory on the device. However, this mechanism can only be applied to
systems where the required amount of allocated memory on the device can be estimated with a high
degree of confidence. Otherwise the program will either fail due to lack of memory or the reallocation
of device memory will render the mechanism ineffective.
This is the case of gipherfs where the maximum amount of required allocated memory on the
device is limited by the maximum amount of data FUSE tries to read or write, i.e. decipher or cipher,
simultaneously on the GPU. These can be determined by analyzing how FUSE works. In respect to
writes, and even though gipherfs uses FUSE in multithreaded mode, only a single write operation is
handled by FUSE at any given time. This is due to a lock on the fuse kernel module, which serializes
write operations. As a result, the maximum amount of memory required by a write operation will be
the maximum payload handled by a single write operation, which is 64 KB in FUSE big-writes mode
plus 16 bytes of potential padding, generated whenever write operations extend the previous file size.
On the other hand, it is possible to determine, analysing the operation of FUSE for different payloads,
that at most four read operations can execute concurrently, each capable of handling a maximum
payload of 128 KB. Padding need not be considered in this case, since it is not generated by read
operations.
gipherfs implements an efficient memory allocation mechanism by pre-allocating and reusing a
pool of device memory blocks, sized according the previously mentioned observations. When the
system starts, 64 128 KB blocks are pre-allocated on the device. Even though, at most 5 concurrent
operations may be executing on the device according to the above mentioned observations, blocks
are over allocated in order to prevent starvation in the case the system actually tries to allocate
more memory on the device. Whenever there is a need to allocate memory on the device, a pointer
to one of the available device memory blocks is obtained from the pool, instead of performing the
actual memory allocation. Conversely, after kernel execution, instead of freeing the block from device
memory, the no longer required pointer to the device memory is returned to the pool. In the case
when more than 64 blocks are required at the same time (which is highly not expected), the thread
requiring the block will wait until one is available. The performance improvements provided by this
mechanism are demonstrated in Section 5.3, where a comparison is made between the system with
and without this mechanism in place.
53
54
5Solution Evaluation
Contents5.1 Direct integration of GPU AES acceleration . . . . . . . . . . . . . . . . . . . . . . 565.2 Enabling big-writes in the proposed solution . . . . . . . . . . . . . . . . . . . . . 585.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . . 61
55
This section presents the evaluation of gipherfs, a secure Virtual File System implementing effi-
cient data confidentiality of its files by offloading the computation of the AES encryption algorithm from
the CPU to the GPU, according to the solution proposed in Section 3. The system evaluation is per-
formed by comparing the performance of gipherfs against both of its intermediate iterations cipherfs
and plainfs, in order to determine the advantages of offloading the AES computation to the GPU, as
well as the overhead of integrating GPU-accelerated cryptographic mechanisms into the operation
of a Virtual File System. The effectiveness of the optimizations enabling an efficient integration of
GPU computing mechanisms, proposed in Section 4.3, are analyzed in detail by considering different
modes of gipherfs with and without including them. This section also provides evidence that the in-
troduction of GPU-accelerated cryptographic mechanisms into the operation of a Virtual File System
without any security features only reduces its performance by a half in the worst case scenario while
relocating the computation required by this kind of mechanism from the CPU to the GPU can boost
its performance by a factor as high as 3.5 times.
All experimental results presented in this Section were obtained in a computing platform consisting
of an Intel Core i7-5960X CPU (3 GHz) with 32GB of main memory (DRAM) and an NVIDIA GeForce
GTX 980 GPU, powered by NVIDIA’s Maxwell GPU computing architecture, with Compute Capability
5.2.
5.1 Direct integration of GPU AES acceleration
The evaluation of the system starts by evaluating the performance of a naive implementation of
gipherfs, which corresponds to the first original system implementation without including any of the
optimizations described in Section 4.3. Even though these could have been included into the system
design since its inception, it was deemed worthy to present the evaluation of the system incrementally,
by adding each of the optimizations one at a time. This organisation provides an in-depth evaluation
of the performance improvements introduced with each optimization and makes it easier to highlight
their overall impact.
Figures 5.1 and 5.2 present the bandwidth, in MB/s, obtained when copying a file into and out of a
directory mounted on three different Virtual File Systems. The referred file systems correspond to the
three incremental iterations described in Chapter 4, namely: plainfs, cipherfs and gipherfs, where the
latter one corresponds to a naive GPU implementation without any optimizations. Figure 5.1 presents
the results obtained when copying a file out of the file system, and can be used to determine the
file system’s read bandwidth. Figure 5.2 provides the evaluation of the opposite operation, i.e. when
copying a file into a directory mounted in the file system, and can thus be used to determine the file
system’s write bandwidth. The results were obtained by measuring the time taken to perform a copy
operation, by using the cp bash command and by considering different file sizes, ranging from 64 KB
to 128 MB. The transfer time was averaged across 100 iterations of the copy operation performed in
each direction, in order to obtain a stable value, which was used to compute the average bandwidth.
56
Figure 5.1: Naive gipherfs: read bandwidth
The obtained results match the expected performance for each of the evaluated file systems.
plainfs is the fastest among them by a large margin (in both directions), by yelding about 6 and
20 times higher bandwidth for read and write operations, respectively. This is obvious given the
absence of any data transformation applied to its file data, since plainfs does not provide any security
features and stores its files data in plain text. The discrepancy becomes even more apparent for write
operations, which are serialized and limited to process at most 4 KB at a time by FUSE in this mode.
Focusing on the performance of cipherfs and gipherfs, it can be noted that both systems attain
similar read bandwidth, with gipherfs providing slightly higher bandwidth for bigger files. This is ex-
plained by the benefits of processing bigger chunks of data in the GPU. As the file grows, the amount
of data processed by FUSE in each read operation also increases, up to a maximum of 128 KB per
operation. This amount of data is already enough to compensate for the GPU memory transfer over-
heads incurred in each GPU invocation. As such, offloading the computation of AES to the GPU
already pays off in this situation, even if only by a small margin. In contrast, the limitations imposed by
FUSE on write operations specially hurt the performance of gipherfs. It is almost 5 times slower than
cipherfs due to the fact that processing at most 4 KB at a time in each GPU invocation is not enough
to compensate for the memory transfer overheads between the CPU and the GPU, required for each
kernel invocation.
Another interesting property that can be deduced from the presented evaluation is that, in respect
to read operations, only plainfs benefits from processing bigger files, which is evidenced by the almost
linear increase in the attained bandwidth with the increase of the file size. The same observation
does not hold for gipherfs and cipherfs. This can be explained by plainfs not being limited by any data
transformation computation, in contrast to the other two systems. The increased bandwidth of plainfs
means that the data processing time is not proportional to the total amount of data processed, thus it
benefits more from processing a larger amount of data. gipherfs and cipherfs however are limited by
57
the time taken to perform the AES transformations, which also limits their maximum bandwidth.
Figure 5.2: Naive gipherfs: write bandwidth
5.2 Enabling big-writes in the proposed solution
The naive integration of the GPU accelerated version of AES in the proposed system yields rather
disappointing results when compared with a typical implementation that relies only on the CPU. Even
though it provides equivalent read performance, it attains a much lower write bandwidth. This can
be explained by how FUSE handles write operations by default, in which payloads are split into data
chunks of at most 4 KB, which are serially processed. This mode of operation does not fit well under
the GPU programming paradigm, which favours compute intensive tasks and is highly penalized by
I/O operations. The limited amount of data processed by each call to the GPU and an excessive
amount of relatively small data transfers between the host and the GPU, severely impact the write
bandwidth of gipherfs, as evidenced in Figure 5.2.
It seems plausible that increasing the amount of data processed by each write operation, and con-
sequently by each GPU invocation, would greatly benefit the write performance of gipherfs. In order
to determine the potential benefits of increasing the size of the FUSE write payload, as described in
Section 4.3.2, the bandwidth of the isolated AES encryption operation was benchmarked on both the
CPU and the GPU. For this purpose, the impact of increasing the amount of data processed by each
individual call was measured. This experimental evaluation consisted of processing an 8 MB file by
increasing the size of processed data chunks, ranging from 2 to 512 KB, both in the CPU and on the
GPU and comparing the obtained bandwidth.
58
Figure 5.3: AES encryption benchmark: average block processing time
Figure 5.3 presents the average processing time of each data block across 100 iterations. It
shows that the processing time of each data block increases with its size in the CPU version, while
it remains relatively constant in the GPU. More specifically, as the block size doubles, also does the
CPU processing time. This can be explained by the sequential nature of the CPU AES encryption
implementation, in which each AES block is processed serially, one at a time. In contrast, the GPU
AES encryption implementation processes all AES blocks concurrently, thus incurring little overhead
when processing a larger amount of data. Actually, increasing the amount of data processed in a
single GPU call by a factor of 256 times, i.e. from 2 to 512 KB, only increases processing time by a
factor of 2. Figure 5.3 also shows that processing at least 64 KB of data in each GPU call already
makes the GPU implementation of the AES encryption operation more efficient than the one that
strictly relies on the CPU.
59
Figure 5.4: AES encryption benchmark: average total processing time
Figure 5.5: AES encryption benchmark: average bandwidth
Figure 5.4 presents the total time taken by each implementation to process the complete 8 MB file.
The obtained results are consistent with the results presented for the average block processing time in
Figure 5.3 , and show that the GPU performance is highly penalized when the file is processed in small
data chunks. The CPU version on the contrary is unaffected by this variation, as its total processing
60
time remains constant. However, the GPU implementation is able to outperform the CPU one when
the block size is at least 64 KB. Figure 5.5 presents an alternative perspective of the complete file
processing performance, based on the average bandwidth. These results are consistent with the ones
presented in Figure 5.2 and further sustain the argument for the poor write performance of the naive
implementation of gipherfs. Recalling that FUSE splits write operations in chunks of 4 KB by default,
the low write bandwidth of gipherfs becomes evident by comparing the bandwidth of the CPU and
GPU versions of the AES encryption operation for 4 KB data chunks in Figure 5.5. The benchmark
results further sustain the hypothesis formulated in Section 4.3.2 that increasing the amount of data
processed in each FUSE write operation would benefit the performance of gipherfs. This can be
observed in Figure 5.4 for block sizes starting at 64 up to 512 KB, which enable the execution of the
GPU to outperform its counterpart on the CPU by a factor of up to 3 times.
Figure 5.6 presents the write performance of an optimized version of all Virtual File Systems,
using FUSE in big-writes mode. In this mode, each write operation is able to process at most 64
KB of data at a time, instead of the default maximum of 4 KB. By comparing these results with the
ones presented in Figure 5.2, it becomes clear that all iterations benefit from the introduction of big-
writes mode. plainfs and cipherfs write bandwidth increases by a factor of about 0.5 and 2 times,
respectively. However, gipherfs is the one that benefits the most. Its write bandwidth is boosted by a
factor of 10 for big files, and brings it to par with the performance of cipherfs for both read and write
operations, with writes actually being consistently more efficient in gipherfs.
Figure 5.6: Increasing gipherfs write payload: write bandwidth
5.3 Improving device memory management
The last GPU integration optimization proposed in Section 4.3.3 aims at minimizing the memory
management overheads required by GPU programs. More specifically, eliminating the device memory
allocation and deallocation steps. As in the previous experimental evaluation, a specific benchmark
61
was developed to evaluate the potential benefits of this optimization. For this purpose, all the individual
function calls making up a complete call to the GPU were analyzed in detail, in order to determine their
relative cost. The analysis was again based on the encryption of a 8 MB file, processed in different
block sizes ranging from 2 KB to 512 KB.
Figure 5.7: GPU memory operations benchmark: relative cost
Figure 5.7 presents the relative time spent when allocating memory on the GPU (MemAlloc);
transferring data from the host to the GPU (MemH2D); executing the AES encryption kernel on the
GPU (ExecKernel); transferring the results from the GPU to the host (MemD2H); and releasing the
GPU memory (MemFree). The results show that most of the time is spent allocating memory on
the device, followed by releasing it, especially for smaller block sizes. As the block size increases,
memory transfers and kernel execution amortize the cost of these operations. However, they always
represent at least 50% of total GPU execution time. The percentages for 64 and 128 KB block sizes
are specially relevant, as these are the maximum payloads handled by FUSE in big-writes mode for
write and read operations respectively. In this setting, device memory allocation and deallocation take
about 80% of the total GPU execution time. These results support the expected benefits introduced
by diminishing the impact of the memory allocation and deallocation steps in GPU invocations by
pre-allocating and reusing device memory buffers between GPU invocations, described in Section
4.3.3.
Figures 5.8 and 5.9 present a final comparison between the three iterations of the system, using a
version of gipherfs implementing all optimization steps described in Sections 4.3.2 and 4.3.3. Focus-
ing first on the benefits enabled by avoiding device memory allocation, gipherfs performance is further
boosted by a factor of 2 for both read and write operations. This comes as a result of eliminating the
operations responsible for about 80% of the total GPU execution time in each kernel invocation, as
62
evidenced in Figure 5.7. Moreover, the final version of gipherfs outperforms cipherfs by a factor of
3.5 and 3 for read and write operations, respectively. This provides evidence of the benefits when
offloading the computation of the AES algorithm from the CPU to the GPU, especially when large
amounts of data are processed. Finally, comparing the obtained performance of gipherfs with plainfs,
it is possible to determine that the impact of the security guarantees offered by the proposed solution
only reduce the performance of a default non-secure file system by a half, in the worst case scenario.
Figure 5.8: Improved device memory management in gipherfs: read bandwidth
Figure 5.9: Improved device memory management in gipherfs: write bandwidth
63
Figure 5.10: gipherfs optimizations: increased read bandwidth
Figures 5.10 and 5.11 summarize the benefits introduced by each optimization to gipherfs in or-
der to better highlight their impact. Due to the concurrent nature of FUSE read operations, the read
bandwidth of gipherfs is consistently higher than its write bandwidth. Read operations do not ben-
efit from the introduction of big-writes mode, as it affects only the write operation, but the improved
device memory management boosts its performance by a factor of almost 3. On the other hand,
gipherfs write operations benefit the most from the introduction of big-writes mode, which increases
its throughput by a factor of more than 10, from 4 MB/s to 46 MB/s in the best case. When the im-
proved device memory management is introduced, the write bandwidth is further improved by a factor
of 2, approximately. These results demonstrate the relevance and the impact of the optimizations
introduced into the system, which provide different benefits for different types of I/O operations, but in
a balanced manner.
Figure 5.11: gipherfs optimizations: increased write bandwidth
64
6Conclusions and Future Work
65
The work developed in this dissertation aimed at implementing a secure file system, using the AES
standard encryption algorithm to provide confidentiality to a file system and respective data in an effi-
cient and transparent manner. A Virtual File System based on FUSE and extended with cryptographic
mechanisms accelerated by a Graphics Processing Unit was implemented in order to achieve the de-
sired goals. Using the high performance capabilities of the GPU for executing the compute intensive
cryptographic primitives of the AES algorithm allowed to speedup its execution by increasing the file
processing throughput. At the same time, making use of the GPU as a cryptographic co-processor
maximized its utility and minimized the impact of the additional security features on the CPU and
on the overall system performance as a whole. Moreover, the decision of automatically integrating
the encryption mechanism into the operation of a Virtual File System facilitates its adoption by not
requiring explicit user intervention once it is in place.
The implemented system was evaluated by comparing its performance with an equivalent alter-
native, using only the CPU, in order to determine the benefits of offloading the computation to the
GPU.
The experimental results were obtained by measuring the Virtual File System’s read and write
bandwidth when performing typical file system operations, namely accessing files. The obtained
results, using state of the art GPUs, confirm the advantages and merit of this approach by showing
a consistent increase of the Virtual File System file processing throughput, specially for big files of
around hundreds of Megabytes. The system was able to increase write and read bandwidths by as
much as a factor of 3 and 3.5 respectively, when compared with an equivalent CPU only alternative.
Moreover, the results also show that the introduction of encryption primitives accelerated by the GPU
into the Virtual File System only reduces its throughput by a half on the worst case.
Future work
Several directions could be followed in the future work. In terms of the system’s security, a better
key management mechanism and a more secure mode of operation, such as CTR [63] or XTS [64], for
the AES encryption algorithm may be used, increasing the system’s security, without compromising
its performance. Additionally, other security concerns such as data integrity and authentication may
also be introduced into the system, using a similar GPU based approach. In terms of performance,
more advanced CUDA mechanisms, such as CUDA streams, may be leveraged in order to further
increase the overall system’s throughput. FUSE’s limitations, namely the serialization imposed on
write operations could also be addressed in order to further improve the system’s write bandwidth.
66
Bibliography
[1] J. Barkley, “Comparing simple role based access control models and access control lists”, Pro-
ceedings of the second ACM workshop on Role-based access control, 1997, pages 127-132,
ACM Press
[2] Y. Cherdantseva and J. Hilton, "A Reference Model of Information Assurance & Security”, 2013
International Conference on Availability, Reliability and Security, Regensburg, 2013, pp. 546-555
[3] J. Saltzer, and M. Schroeder, “The protection of information in computer systems”, Proceedings
of the IEEE 63, 1975, pp. 1278-1308
[4] Ronald L. Rivest., J. Van Leeuwen, Cryptography, Handbook of Theoretical Computer Science,
Chapter 13, Elsevier, 1990, pp. 717-755
[5] Jawahar Thakur, Nagesh Kumar, DES, AES and Blowfish: Symmetric Key Cryptography Algo-
rithms Simulation Based Performance Analysis, International Journal of Emerging Technology
and Advanced Engineering, December 2011
[6] Shay Gueron, Intel R© Advanced Encryption Standard (AES) New Instructions Set, Intel Archi-
tecture Group, Israel Development Center, Revision 3.01, 2012, https://software.intel.com/
sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf
[7] FIPS-197: Advanced Encryption Standard (AES), Federal Information Processing Standards
Publication, National Institute of Standards and Technology (NIST), November, 2001, http:
//csrc.nist.gov/publications/fips/fips197/fips-197.pdf
[8] K. T. Cheng and Y. C. Wang, Using mobile GPU for general-purpose computing – a case study
of face recognition on smartphones, In Proceedings of 2011 International Symposium on VLSI
Design, Automation and Test, pp. 1-4, Hsinchu, 2011
[9] J. Leskela, J. Nikula, M. Salmela, OpenCL Embedded Profile Prototype in Mobile Device, IEEE
Workshop on Signal Processing Systems, pp. 279-284, October 2009
[10] NVIDIA. CUDA 4.0., http://developer.nvidia.com/cuda-toolkit-40, 2011
[11] Stone, John E., Gohara D., Shi G., OpenCL: A Parallel Programming Standard for Heteroge-
neous Computing Systems. Computing in science & engineering, 2010, pp. 66-72
[12] Christoph Mitasch, https://www.thomas-krenn.com/en/wiki/Ext4_Filesystem
67
[13] M. Szeredi, File system in user space, http://fuse.sourceforge.net
[14] FUSE, File system in user space, http://northstar-www.dartmouth.edu/~richard/
WhitePapers/FUSE.html
[15] Z. Wang, R. Murmuria and A. Stavrou, "Implementing and Optimizing an Encryption Filesystem
on Android," 2012 IEEE 13th International Conference on Mobile Data Management, Bengaluru,
Karnataka, 2012, pp. 52-62
[16] Weibin Sun, Robert Ricci, and Matthew L. Curry, “GPUstore: harnessing GPU computing for
storage systems in the OS kernel”, In Proceedings of the 5th Annual International Systems and
Storage Conference, SYSTOR ’12, ACM, New York, NY, USA, 2012, Article 9, 12 pages
[17] Katz J., Lindell Y., Introduction to Modern Cryptography, Second Edition, 2014, ISBN
9781466570269, 2nd edition, Chapman & Hall/CRC
[18] Mitali, Vijay Kumar, Arvind Sharma, “A Survey on Various Cryptography Techniques”, Interna-
tional Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Volume 3, Issue
4, July-August 2014, ISSN 2278-6856.
[19] Daemen, Joan; Rijmen, Vincent, "AES Proposal: Rijndael”, National Institute of Standards and
Technology, March 9, 2003
[20] Gilbert, Henri and Peyrin, Thomas, Super-Sbox Cryptanalysis: Improved Attacks for AES-Like
Permutations, In Fast Software Encryption: 17th International Workshop, FSE 2010, Seoul, Korea,
February 7-10, 2010, pp. 365-383 Korea, 2010
[21] YongBin Zhou and DengGuo Feng, Side-Channel Attacks: Ten Years After Its Publica-
tion and the Impacts on Cryptographic Module Security Testing, In IACR Eprint archive,
url=http://eprint.iacr.org/2005/388, 2005
[22] Tuchman, Walter, A Brief History of the Data Encryption Standard, 1998, pp. 275-280, ACM
Press/Addison-Wesley Publishing Co., New York, NY, USA
[23] FIPS Pub 46-3, Federal Information Processing Standards Publication: Data Encryption Stan-
dard (DES), Obtober 1999
[24] Grabbe J., Data Encryption Standard: The Triple DES algorithm illustrated Laissez faire city time,
Volume: 2, No. 28, 2003
[25] B. Schneier, “Fast Software Encryption”, Cambridge Security Workshop Proceedings, December
1993, pp. 191-204
[26] Wikipedia, “Block cipher mode of operation”, https://en.wikipedia.org/wiki/Block_cipher_
mode_of_operation
[27] Diffie, W. and Hellman, M., New Directions in Cryptography, IEEE Trans. Inf. Theor., November
1976
68
[28] Rivest, R. L. and Shamir, A. and Adleman, L., “A Method for Obtaining Digital Signatures and
Public-key Cryptosystems”, Commun. ACM, Feb. 1978, pp. 120-126
[29] Auguste Kerckhoffs, "La cryptographie militaire" Journal des sciences militaires, vol. IX, pp. 5–83,
January 1883
[30] Michael Galloy, “CPU vs GPU performance”, http://michaelgalloy.com/2013/06/11/
cpu-vs-gpu-performance.html
[31] Mark Segal, Kurt Akeley, The OpenGL Graphics System: A Specification, Version 4.0, March 11,
2010
[32] Fernando, Randima and Kilgard, Mark J., The Cg Tutorial: The Definitive Guide to Programmable
Real-Time Graphics, 2003
[33] J. Owens, D. Luebke, M. Harris, J. Krüger, A. Lefohn, T. Purcell, A Survey of General-Purpose
Computation on Graphics Hardware, in Computer Graphics Forum, March 2007
[34] C. McClanahan, “History and Evolution of GPU Architecture,” Paper Survey, 2010
[35] IBM Personal Computer Professional Graphics Controller Technical Reference, August 1984
[36] Schaller, Robert R., Moore’s Law: Past, Present, and Future, IEEE Spectr., June 1997,
[37] Akeley, Kurt, Reality Engine Graphics, Proceedings of the 20th Annual Conference on Computer
Graphics and Interactive Techniques, SIGGRAPH ’93, 1993, pp. 109–116
[38] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, White Pa-
per, http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_
architecture_whitepaper.pdf, 2009
[39] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Ke-
pler GK110, White Paper, https://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012
[40] NVIDIA, NVIDIA GeForce GTX 980, White Paper, https://international.download.nvidia.
com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF, 2014
[41] Rajgarhia, Aditya and Gehani, Ashish, Performance and Extension of User Space File Systems,
In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, pp.
206–213, 2010
[42] List of FUSE file systems https://github.com/libfuse/libfuse/wiki/Filesystems
[43] rarfs, https://github.com/vadmium/rarfs
[44] gunzip, http://freecode.com/projects/fusegunzip
[45] fuse-zip, https://linux.die.net/man/1/fuse-zip
69
[46] fusecompress https://github.com/tex/fusecompress
[47] sshfs https://wiki.archlinux.org/index.php/SSHFS
[48] davfs https://wiki.archlinux.org/index.php/Davfs
[49] D. Cook and J. Ioannidis and A. Keromytis and J. Luck, CryptoGraphics: Secret Key Cryptogra-
phy Using Graphics Cards, In RSA Conference, Cryptographer’s Track (CT-RSA), February 2005
[50] Harrison, O. and Waldron, J., AES Encryption Implementation and Analysis on Commodity
Graphics Processing Units, in Cryptographic Hardware and Embedded Systems - CHES 2007:
9th International Workshop, pp. 209-226, Vienna, Austria, September 10-13, 2007
[51] S. Manavski et al., Cuda compatible gpu as an efficient hardware accelerator for aes cryptogra-
phy, IEEE Int. Conf. on Signal Processing & Communications, pp. 65-68, 2007
[52] Q. Li, C. Zhong, K. Zhao, X. Mei and X. Chu, Implementation and Analysis of AES Encryption
on GPU, 2012 IEEE 14th International Conference on High Performance Computing and Com-
munication & 2012 IEEE 9th International Conference on Embedded Software and Systems, pp.
843-848, Liverpool, 2012
[53] Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett, GPUfs: Integrating a
File System with GPUs, SIGARCH Comput. Archit. News, pp. 485-498, March 2013
[54] Gharaibeh, Abdullah and Al-Kiswany, Samer and Gopalakrishnan, Sathish and Ripeanu, Matei,
A GPU Accelerated Storage System, In Proceedings of the 19th ACM International Symposium
on High Performance Distributed Computing, pp. 167-178, Chicago, Illinois, 2010
[55] Al-Kiswany, Samer and Gharaibeh, Abdullah and Santos-Neto, Elizeu and Yuan, George and Ri-
peanu, Matei, StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage
Systems, In Proceedings of the 17th International Symposium on High Performance Distributed
Computing, pp. 165-174, Boston, MA, USA, 2008
[56] Shinpei Kato and Michael McThrow and Carlos Maltzahn and Scott Brandt, Gdev: First-Class
GPU Resource Management in the Operating System, Presented as part of the 2012 USENIX
Annual Technical Conference (USENIX ATC 12), pp. 401-412, Boston, MA, 2012
[57] Rossbach, Christopher J. and Currey, Jon and Silberstein, Mark and Ray, Baishakhi and Witchel,
Emmett, PTask: Operating System Abstractions to Manage GPUs As Compute Devices, In Pro-
ceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, Cascais, Portu-
gal, 2011
[58] G. Agosta, A. Barenghi, F. D. Santis, A. D. Biagio and G. Pelosi, Fast Disk Encryption through
GPGPU Acceleration, 2009 International Conference on Parallel and Distributed Computing, Ap-
plications and Technologies, Higashi Hiroshima, pp. 102-109, 2009
70
[59] Eimot, Magne, Offloading an encrypted user space file system on Graphical Processing Units,
Master Thesis, University of Oslo, Department of Informatics, Oslo, Norway, 2009
[60] PKCS #7, Cryptographic Message Syntax Standard, RSA Laboratories, Version 1.5, Nov 1993
[61] https://github.com/nablahero/cuda_aes
[62] https://www.wolfssl.com
[63] W. Diffie and M. E. Hellman, Privacy and authentication: An introduction to cryptography, in
Proceedings of the IEEE, vol. 67, no. 3, pp. 397-427, March 1979
[64] Standard for Cryptographic Protection of Data on Block-Oriented Storage Devices, IEEE
P1619/D16, p. 34, 2007
71
72