Video presentations
Demo of the world’s fastest inference engine for Arm Cortex-M
Demoing the world’s fastest inference engine for Arm Cortex-M
Handouts of talks
CLBlast: A Tuned BLAS Library
CLBlast: A Tuned BLAS Library for Faster Deep Learning
GPU Programming 101
Better Than All the Rest: Finding Max-Performance GPU Kernels Using Auto-Tuning
CLTune: A Generic Auto-Tuner for OpenCL Kernels
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
A Detailed GPU Cache Model Based on Reuse Distance Theory
Algorithmic Species Revisited: A Program Code Classification Based on Array References
Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification
Posters
Auto-Tuning OpenCL Matrix-Multiplication: K40 versus K80