Video presentations

Demo of the world’s fastest inference engine for Arm Cortex-M

ARM AI Tech Talk Mar 11, 2022 Link to video

Demoing the world’s fastest inference engine for Arm Cortex-M

TinyML talks Jan 4, 2022 Link to video

Handouts of talks

CLBlast: A Tuned BLAS Library

IWOCL '18, Oxford, UK May 16, 2018 Link to program and Handouts

CLBlast: A Tuned BLAS Library for Faster Deep Learning

GTC '17, San Jose, CA May 11, 2017 Link to program and Handouts

GPU Programming 101

C++ Meetup, Amsterdam, NL August 25, 2016 Link to program and Handouts

Better Than All the Rest: Finding Max-Performance GPU Kernels Using Auto-Tuning

GTC '16, San Jose, CA April 7, 2016 Link to program and Handouts

CLTune: A Generic Auto-Tuner for OpenCL Kernels

MCSoC '15, Torino, Italy September 24, 2015 Link to program and Handouts

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

MuCoCoS '14, Porto, Portugal August 26, 2014 Link to program and Handouts

A Detailed GPU Cache Model Based on Reuse Distance Theory

HPCA '14, Orlando, US February, 2014 Link to program and Handouts

Algorithmic Species Revisited: A Program Code Classification Based on Array References

MuCoCoS '13, Edinburgh, UK September 7, 2013 Link to program and Handouts

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

APPT '13, Stockholm, Sweden August 28, 2013 Link to program and Handouts

Auto-Tuning OpenCL Matrix-Multiplication: K40 versus K80

GTC '15, San Jose, CA March 16, 2015 Link to program and Poster PDF