## OpenMP 4.x and GPUs

OpenMP provides a directive-based language that allows a programmer to mark parts of its code to take advantage of the features of the machine or system that will execute it, most notably parallelism at the thread level. OpenMP 4.0 included for the first directives to control offloading to a device other than the host, i.e. an accelerator, and OpenMP 4.5 refined a few things so that it become more user friendly.

A natural target for this offloading model is a heterogeneous system that consists of a host CPU, e.g. PowerPC, with GPUs attached. This requires specific compiler and runtime library support to handle the mapping of the data from host to device and vice-versa, as well as implement the programming model with a native threading model that do not exactly match. I've been investigating ways to accomplished that and I'm actively working on contributing a fully functional implementation to LLVM itself and two related projects: Clang and OpenMP.

#### Related Publications:

- Bercea, G-T., Bertolli, C., Antão, S., Jacob, A. C., Eichenberger, A. E., Chen, T., Sura, Z., Sung, H., Rokos, G., Appelhans, D., O’Brien, K. (2015). Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application. Performance Modeling, Benchmarking and Simulation of HPC Workshop - International Conference for High Performance Computing, Networking, Storage, and Analysis – SC 2015 (pp. 2:1-2:11). Austin, TX, USA: ACM;

- Bertolli, C., Antão, S., Bercea, G-T., Jacob, A. C., Eichenberger, A. E., Chen, T., Sura, Z., Sung, H., Rokos, G., Appelhans, D., O’Brien, K. (2015). Integrating GPU Support for OpenMP Offloading Directives into Clang. LLVM Compiler Infrastructure in HPC Workshop - International Conference for High Performance Computing, Networking, Storage, and Analysis – SC 2015 (pp. 5:1-5:15). Austin, TX, USA: ACM;

- Bertolli, C., Antão, S., Eichenberger, A. E., O’Brien, K., Sura, Z., Jacob, A. C., Chen, T., & Sallenave, O. (2014). Coordinating GPU Threads for OpenMP 4.0 in LLVM. LLVM Compiler Infrastructure in HPC Workshop - International Conference for High Performance Computing, Networking, Storage, and Analysis – SC 2014 (pp. 12–21). New Orleans, LI, USA: ACM

## Cryptography and the RNS

The Residue Number System (RNS) is a powerful tool to obtain parallel versions of algorithms. RNS relies in alternative representations of integers, which are operands of several algorithms including the cryptographic ones. In this research thread I focus cryptographic algorithms related to Elliptic Curve (EC) cryptography and I have been rewriting/optimizing these algorithms to obtain efficient parallel versions of them. Specifically, I have been bringing up several key RNS algorithms, including the so called Basis Extension methods. The advantages in obtaining such optimized parallel versions are the following: the computing platforms are evolving to a more yet simple processing cores paradigm, thus these algorithms can extract full advantage of such platforms. An example of such platforms are the Graphical Processing Units (GPUs), but other platforms such as FPGAs can be targeted.

#### Related Publications:

- Antão, S., Bajard, J.-C., & Sousa, L. (2011). RNS based Elliptic Curve Point Multiplication for Massive Parallel Architectures. The Computer Journal 2011 - Oxford Journals, 1-19. Oxford University Press. doi:10.1093/comjnl/BXR119 (bibtex)

- Antão, S., Bajard, J.-C., & Sousa, L. (2010). Elliptic Curve point multiplication on GPUs. IEEE International Conference on Application-specific Systems Architectures and Processors - ASAP (pp. 192–199). Rennes: IEEE. doi:10.1109/ASAP.2010.5541000 (bibtex)

## Cryptographic coprocessors

Cryptography is often a built-in element in several systems so that securing data during data communication, and other security related procedures such as authentication, are possible. The algorithms that underlie these cryptographic procedures are quite complex and computationally demanding. Therefore, the utilization of coprocessors to accelerate these tasks is a must. In this research trend, I focus the design of new algorithms and architectures to support cryptographic coprocessors for different technologies, including ASIC and FPGA, taking full advantage of the characteristics of the latter, namely runtime configuration. The Elliptic Curve (EC) cryptography is the main subject of my research. Nevertheless, other protocols and functionalities have also been addressed, including the AES and random number generations.

#### Related Publications:

- Antão, S., Chaves, R., & Sousa, L. (2009). AES and ECC Cryptography Processor with Runtime Configuration. International Conference on Advanced Computing and Comunications - ADCOM. Bangalore: IEEE. (bibtex)

- Antão, S., Chaves, R., & Sousa, L. (2009). Compact and Flexible Microcoded Elliptic Curve Processor for Reconfigurable Devices. IEEE Symposium on Field Programmable Custom Computing Machines - FCCM (pp. 193-200). Napa - CA: IEEE. doi:10.1109/FCCM.2009.18 (bibtex)

- Antão, S., Chaves, R., & Sousa, L. (2008). Efficient FPGA Elliptic Curve Cryptographic Processor over GF(2
^{m}). International Conference on Field-Programmable Technology - ICFPT (pp. 357-360). Taipei: IEEE. doi:10.1109/FPT.2008.4762417 (bibtex)

## RNS Reverse Converters

The Residue Number System (RNS) is an alternative representation of integers that replaces the binary weighted representation by a different one by getting the integer's moduli for a given moduli set of pairwise coprime integers. Bypassing the mathematical details, the operations in RNS usually comprise three steps:

i) a conversion from binary to the RNS representation (forward conversion),

ii) the computation in parallel over the RNS representation, and

iii) a final conversion from the RNS representation to weighted binary (reverse conversion).

While the first two steps are often straightforward, the last step can be a demanding operation. Namely, the reverse conversion, if not efficiently implemented, can be a bottleneck for the performance of a system based on RNS. Therefore, in order to implement efficient computing circuits based on RNS, special attention must be paid to the design of the reverse converters. Several conversion methods exist, suiting a wide range of RNS moduli sets. My research focus both on the design of converters for existent moduli sets and the on the utilization of new moduli sets that can provide increased conversion efficiency.