Some thoughts on GHC's SIMD primitives
The use of parallelization is essential to exploit the performance of modern computers. In particular, we want to take advantage of SIMD, multi-core, and GPUs. This article will focus on SIMD — single instruction, multiple data.
GHC has provided primitives for SIMD for years, but they have not been effectively utilized.
In this article, I’d like to discuss what can be done to promote the utilization of GHC’s SIMD primitives.
Disclaimer: I’m not a high-performance computing expert — just an enthusiast who enjoys experimenting with hardware capabilities for fun.
Current Status
Paper
I must mention the following paper, which was written by the people who implemented SIMD primitives into GHC.
Geoffrey Mainland, Roman Leshchinskiy and Simon Peyton Jones, Exploiting Vector Instructions with Generalized Stream Fusion, ICFP 2013
In addition to GHC’s SIMD primitives, they seem to have considered integration with the vector
package and Data Parallel Haskell (DPH).
However, integration with the vector
package doesn’t seem to have happended, and DPH did not survive.
Design
When exposing SIMD features to programming languages, there is the question of how to abstract (or not to abstract) the architecture-dependent vector register width.
GHC provides 128-bit, 256-bit, and 512-bit wide types, respectively. They correspond directly to the corresponding CPU registers. In other words, on x86 systems, 512-bit wide vector types correspond to ZMM (AVX-512) registers and do not use two YMM registers or four XMM registers.
In contrast, LLVM’s abstraction allows vectors of arbitrary length to be written. Vectors exceeding the machine’s SIMD width are compiled into multiple instructions or registers.
Under GHC’s current design, it is up to libraries to abstract away the architecture-dependent SIMD width.
Library
Since GHC’s SIMD primitive types are unboxed, one would want boxed wrappers for daily use. With a Hackage search, I found two packages providing boxed wrappers:
The simd
package is based on type families:
module Data.SIMD.SIMD4 where
class SIMD4 a where
data X4 a
plusX4 :: X4 a -> X4 a -> X4 a
indexArrayAsX4 :: ByteArray -> Int -> X4 a
broadcastX4 :: a -> X4 a
-- ...
class SIMD4 a => SIMD4Float a where
divideX4 :: X4 a -> X4 a -> X4 a
instance (Show a, SIMD4 a) => Show (X4 a)
instance (Num a, SIMD4 a) => Num (X4 a)
instance (Fractional a, SIMD4Float a) => Fractional (X4 a)
instance SIMD4 Float
instance SIMD4 Double
-- ...
vectorizeUnboxedX4 :: (SIMD4 a, VU.Unbox a) => VU.Vector a -> VU.Vector (X4 a)
unVectorizeUnboxedX4 :: (SIMD4 a, VU.Unbox a) => VU.Vector (X4 a) -> VU.Vector a
vectorizeStorableX4 :: (SIMD4 a, Storable a, Storable (X4 a)) => VS.Vector a -> VS.Vector (X4 a)
unVectorizeStorableX4 :: (SIMD4 a, Storable a, Storable (X4 a)) => VS.Vector (X4 a) -> VS.Vector a
This package organizes SIMD types by the number of elements, such as X4
, X8
, and X16
.
One thing to note is that X4 Double
is a wrapper for DoubleX4#
and therefore requires AVX.
If you want to map over Vector Float
, the vector’s length must be a multiple of 4.
Let’s see the other package, primitive-simd
.
module Data.Primitive.SIMD where
class (Num v, Real (Elem v)) => SIMDVector v where
type Elem v
type ElemTuple v
broadcastVector :: Elem v -> v
mapVector :: (Elem v -> Elem v) -> v -> v
-- ...
data FloatX4
data FloatX8
data FloatX16
data DoubleX2
data DoubleX4
data DoubleX8
-- ...
It provides individual data types, not type families, for the vector types (it uses type families for obtaining the element type from vector types).
One interesting point is, DoubleX4
does not always map to DoubleX4#
and can be configured to use two DoubleX2#
.
It allows to use DoubleX4
on SSE-only machines, or NEON-only Arm machines.
I like the idea of using type families like X4
/X8
/X16
, and I like the idea of using two DoubleX2#
for X4 Double
/DoubleX4
.
I’ll come back to this point later.
Improvements to GHC
This section explores ways to make GHC’s SIMD primitives more accessible and practical for users.
Expanding Backend Support
Historically (in GHC < 9.8), using SIMD primitives required:
- The LLVM backend
- An x86(-64) machine
This situation was not great because
- you need to install LLVM toolchain in addition to GHC, and
- some of the modern computers are not x86-based (for instance, I’m writing this article with an Apple Silicon Mac).
With SIMD primitives limited to few backends, it’s no surprise that a package authors have been hesitant to adopt them. To encourange broader use, GHC should support SIMD primitives across more backends.
Recently, some progress is being made in this area. The following sections will detail these advancements.
x86(-64) NCG
Currently, the most popular architecture for PCs is x86-64. Therefore, it would be beneficial to make the x86(-64) NCG support SIMD primitives.
Now, GHC 9.12 added some support for SIMD with x86-64 NCG. The work originally started as a GSoC project in 2018 and was once merged in 2019, but later reverted because of incomplete handling of register spilling.
- GSoC final evaluation
- #7741: Add SIMD support to x86/x86_64 NCG · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !1306: AVX/SSE4 SIMD support for the native code generator · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
- !1379: Back out SIMD NCG support · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
Last year (2024), @sheaf and @jaro continued the work and managed to get basic support of SIMD into x86(-64) NCG.
- !12860: Add SIMD support to the X86 NCG · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
- !12857: Add Int64X2 SIMD operations · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
The initial implementation was limited to FloatX4#
, DoubleX2#
(and packing/unpacking of Int64X2#
) and some operations required SSE4.1 instructions (by default, GHC refuses to use SSE instructions beyond SSE2).
Since then, I have worked to make NCG emit SSE1 instructions for basic FloatX4#
operations:
- #25441: x86 NCG SIMD: Primitives like insertFloatX4# should be available without -msse4 · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !13542: x86 NCG SIMD: Lower packFloatX4#, insertFloatX4# and broadcastFloatX4# to SSE1 instructions · Merge requests · Glasgow Haskell Compiler / GHC · GitLab (backported to GHC 9.12)
and I implemented packing/unpacking of other integer vector types:
- #25487: x86 NCG SIMD: Implement pack/insert/broadcast/unpack for integer vectors · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !13602: x86 NCG SIMD: Support pack/insert/broadcast/unpack of 128-bit integer vectors · Merge requests · Glasgow Haskell Compiler / GHC · GitLab (will be in GHC 9.14)
and now I’m working on to support arithmetic operations of integer vector types:
- #25643: x86 NCG SIMD: Implement arithmetic operations for integer vectors · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !13810: Draft: x86 NCG SIMD: Implement 128-bit integer vector arithmetics · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
Although there are many tasks to be done (256-bit and 512-bit support via AVX/AVX-512/AVX10, better code generation, etc1), the progress is promising.
AArch64 LLVM/NCG
The next most popular architecture for PCs is probably AArch64 (Arm64). Therefore, it should be supported as well.
AArch64 offers 128-bit wide ASIMD (NEON) instructions and optional SVE. NEON aligns well with GHC’s design, so it is the first target.
Supporting NEON via LLVM should be easy, and actually I did it in 2023:
- #23410: Support 128-bit SIMD on AArch64 via LLVM backend · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !10456: Support 128-bit SIMD on AArch64 via LLVM backend · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
So GHC 9.8 has support for SIMD on AArch64 via LLVM backend.
If you are using ghcup on macOS, you can enable LLVM backend (-fllvm
) via the following command:
$ brew install llvm@15
$ OPT=$(brew --prefix llvm@15)/bin/opt \
LLC=$(brew --prefix llvm@15)/bin/llc \
ghcup install ghc --force 9.8.4
(See also How to use GHC’s LLVM backend)
Support via the NCG backend is also desirable, but it would be a non-trivial task.
WebAssembly
WebAssembly has 128-bit wide SIMD, so it would be good to have support for them.
There is an issue: #22618: Add SIMD128 support to the wasm backend · Issues · Glasgow Haskell Compiler / GHC · GitLab
Other Targets
Some of the other targets have support for SIMD, but with some notes.
PowerPC has 128-bit AltiVec, but does not support double precision floating-points.
Should GHC emulate DoubleX2#
via scalar operations, or choose not to support it?
32-bit Arm has NEON as an extension, but the floating-point vector operations do not support subnormals (they always use flush-to-zero mode).
Is this acceptable?
Should GHC use floating-point vector instructions only if -ffast-math
is set?
RISC-V has vector extension, but I don’t know how close its design is to GHC’s design. There is some effort:
- #25331: RISC-V vector extension support (SIMD) · Issues · Glasgow Haskell Compiler / GHC · GitLab
- !13467: Draft: RISC-V vectors · Merge requests · Glasgow Haskell Compiler / GHC · GitLab
Other backends (like via-C or JS) has no SIMD support.
More Primitives
The number of operations implemented in current GHC is quite limited. Traditionally (GHC ≤ 9.10), the available operations were the following:
- Basic operations
pack
insert
unpack
broadcast
index
/read
/write
for arrays and pointers
plus
minus
times
divide
(for floating-points)quot
/rem
(for integers)- Interestingly, there is no SIMD instructions for integer division on x86.
negate
(for signed integers and floating-points)
In GHC 9.12, additional primitives such as min
, max
, FMA, and shuffle
were introduced.
Still, some operations that are frequently used are missing:
sqrt
abs
- Bitwise operations
In fact, the LLVM backend can generate these instructions without dedicated vector primitives. For example, LLVM’s optimizer would fuse the following code into a vector instruction:
# :: FloatX4# -> FloatX4#
sqrtFloatX4# v = case unpackFloatX4# v of
sqrtFloatX4# x0, x1, x2, x3 #) ->
(# (# sqrtFloat# x0
packFloatX4# x1
, sqrtFloat# x2
, sqrtFloat# x3 #) , sqrtFloat
Nevertheless, adding dedicated vector primitives for these operations would be desirable, because implementing such optimizations in the NCG would not be pleasant.
In addition to these, primitives related to conditionals (or masks) are currently missing.
Comparison operations, logical operations, conditional selection, and so on.
It is worth considering how to represent masks.
Should we introduce a generic type like BoolX4#
that is independent of the vector’s element type (like LLVM’s abstraction), or should we define type-specific masks such as FloatX4Mask#
for each vector type?
FFI
With the UnliftedFFITypes
extension enabled, vector types can be used in FFI.
However, on Windows, the default C calling convention requires all 128-bit vectors to be passed by reference.
To address this, it may be beneficial to support the vectorcall
calling convention, which allows vector arguments to be passed directly in registers.
Designing a SIMD Library
SIMD primitives are low-level features and not for end-users. As mentioned earlier, libraries should provide user-friendly wrappers. But what should such a library look like?
Boxed Wrappers
The use of type families to define wrappers based on the number of elements, as seen in the simd
package, is an elegant approach.
This design has the advantage of being isomorphic to fixed-length vector types, for example:
-- Pseudocode
data X4 a = X4 !a !a !a !a
Another appealing idea is the approach taken by the primitive-simd
package, which uses two DoubleX2#
values to represent a vector of four Double
s.
This allows for flexibility in supporting platforms with varying SIMD capabilities.
Combining these ideas, my proposed wrapper types would look like this:
data X4 a
data instance X4 Float = FloatX4 FloatX4#
#if MAX_VEC_LEN == 128
data instance X4 Double = DoubleX2X2 DoubleX2# DoubleX2#
#elif MAX_VEC_LEN >= 256
data instance X4 Double = DoubleX4 DoubleX4#
#endif
Type Classes
Let’s consider the type classes. I want to have instances like these:
instance (??? a) => Num (X4 a)
instance (??? a) => Fractional (X4 a)
What should the superclass constraints (??? a
) be?
One approach is to define specialized classes for X4
:
class BroadcastX4 a where
broadcastX4 :: a -> X4 a
instance BroadcastX4 Float where ...
instance BroadcastX4 Double where ...
class BroadcastX4 a => NumX4 a where
plusX4 :: X4 a -> X4 a -> X4 a
-- ...
instance NumX4 Float where ..
instance NumX4 Double where ..
instance (Num a, NumX4 a) => Num (X4 a) where
+) = plusX4
(-- ...
A more general approach would allow for other vector lengths, such as X8
or X16
:
class Broadcast f a where
broadcast :: a -> f a
instance Broadcast X4 Float where ...
instance Broadcast X4 Double where ...
-- instances for X8, X16 could be added
class Broadcast f a => NumF f a where
plusF :: f a -> f a -> f a
minusF :: f a -> f a -> f a
timesF :: f a -> f a -> f a
absF :: f a -> f a
signumF :: f a -> f a
negateF :: f a -> f a
instance NumF X4 Float where ...
instance NumF X4 Double where ...
-- instances for X8, X16 could be added
instance (NumF X4 a, Num a) => Num (X4 a) where
+) = plusF
(-- ...
Mapping a Long Vector
To process a (long) vector using SIMD data types, let’s define Storable
-like classes:
class Storable a => StorableF f a where
peekElemOffF :: Ptr a -> Int -> IO (f a) -- index in scalar elements
pokeElemOffF :: Ptr a -> Int -> f a -> IO () -- index in scalar elements
instance StorableF X4 Float where ...
instance StorableF X4 Double where ...
Using these, a map
function that supports SIMD could be implemented like this:
mapStorable :: (StorableF X4 a, StorableF X4 b)
=> (X4 a -> X4 b) -> (a -> b)
-> VS.Vector a -> VS.Vector b
!v = unsafePerformIO $ do
mapStorable fv f let !n = VS.length v
!result <- VSM.unsafeNew n
$ \ !inputPtr ->
VS.unsafeWith v $ \ !resultPtr -> do
VSM.unsafeWith result let loopVector !i | i + 4 > n = loopScalar i
| otherwise = do
!a <- peekElemOffF inputPtr i
pokeElemOffF resultPtr i (fv a)+ 4)
loopVector (i !i | i >= n = pure ()
loopScalar | otherwise = do
!a <- peekElemOff inputPtr i
pokeElemOff resultPtr i (f a)+ 1)
loopScalar (i 0
loopVector
VS.unsafeFreeze result{-# INLINE mapStorable #-}
For simplicity, fusion-like optimizations are not implemented.
A use of this function would look like:
-> (v + 1)^(10 :: Int))
mapStorable (\v -> (x + 1)^(10 :: Int))
(\x 0..10] :: VS.Vector Float) (VS.fromList [
Unifying the SIMD and Scalar Parts
The mapStorable
function that we just defined requires separate functions for SIMD (X4 a -> X4 b
) and scalar (a -> b
) operations, leading to code duplication.
Can we avoid this?
If the operation is f x = (x + 1)^10
, the function type could be generalized to Num a => a -> a
, and we could have:
mapStorableNum :: (StorableF X4 a, Num a, NumF X4 a)
=> (forall v. Num v => v -> v)
-> VS.Vector a -> VS.Vector a
= mapStorable f f mapStorableNum f
However, this approach requires separate map
functions for each constraints (Num
, Fractional
, Bits
, etc.), which is undesirable.
Let’s consider another approach.
The scalar function is isomorphic to Identity a -> Identity b
, so both the SIMD part and scalar part could be written as f a -> f b
.
I mean, if we have a type class SIMD
with X4
and Identity
as instances, then we could write:
mapStorable' :: (StorableF X4 a, Storable X4 b)
=> (forall f. SIMD f => f a -> f b)
-> VS.Vector a -> VS.Vector b
= mapStorable f (runIdentity . f . Identity) mapStorable' f
What should the SIMD f
constraint be?
f
should have instances like Num (f Float)
, Num (f Double)
.
Ideally, we would like to have Num (f a)
for any Num a
.
That is:
class (forall a. Num a => Num (f a)) => SIMD f
instance SIMD Identity
instance SIMD X4
…except that it’s impossible.
We defined X4 a
only for a ~ Float
and a ~ Double
.
We are not going to have X4 Integer
or X4 Rational
.
So, let’s add the additional constraint:
class (forall a. (Num a, NumF X4 a) => Num (f a)) => SIMD f
instance SIMD Identity
instance SIMD X4
If we are going to have types for other element-counts like X8
or X16
, we can add them:
class ( forall a. (Num a, NumF X4 a, NumF X8 a, NumF X16 a) => Num (f a)
=> SIMD f
) instance SIMD Identity
instance SIMD X4
instance SIMD X8
instance SIMD X16
If we are going to support other type classes, we can add them:
type NumElement a = (Num a, NumF X4 a, NumF X8 a, NumF X16 a)
type FractionalElement a = (Fractional a, FractionalF X4 a,
FractionalF X8 a, FractionalF X16 a)
type FloatingElement a = (Floating a, FloatingF X4 a,
FloatingF X8 a, FloatingF X16 a)
type MonoidElement a = (Monoid a, MonoidF X4 a, MonoidF X8 a, MonoidF X16 a)
class ( forall a. NumElement a => Num (f a)
forall a. FractionalElement a => Fractional (f a)
, forall a. FloatingElement a => Floating (f a)
, forall a. MonoidElement a => Monoid (f a)
, => SIMD f
) instance SIMD Identity
instance SIMD X4
instance SIMD X8
instance SIMD X16
The user can write as
f :: Num a => a -> a
= (x + 1)^(10 :: Int)
f x
0..10]) mapStorable' f (VS.fromList [
or
g :: (SIMD f, NumElement a) => f a -> f a
= (x + 1)^(10 :: Int)
g x
0..10]) mapStorable' g (VS.fromList [
A Working Example
I have a working example for SIMD at haskell-dsl-example/simd.
$ git clone https://github.com/minoki/haskell-dsl-example.git
$ cd haskell-dsl-example/simd
I use an Apple Silicon Mac, so I need to use GHC 9.8+ with LLVM backend enabled.
$ cabal configure -w ghc-9.8.4 --ghc-options=-fllvm
A successful run of cabal run
looks like:
$ cabal run
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
The benchmark is available at simd/benchmark/Main.hs. I defined two functions to benchmark:
f :: Num a => a -> a
= (x + 1)^(10 :: Int)
f x {-# INLINE f #-}
g :: Num a => a -> a
= let y = x + 1
g x = y * y
y2 = y2 * y2
y4 = y4 * y4
y8 in y8 * y2
{-# INLINE g #-}
Here is the benchmark result with Apple M4 Pro:
$ cabal bench -w ghc-9.8.4 --ghc-options=-fllvm -O2
benchmarking Float/f/scalar
time 3.206 μs (3.200 μs .. 3.211 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.207 μs (3.201 μs .. 3.215 μs)
std dev 22.70 ns (15.40 ns .. 37.09 ns)
benchmarking Float/f/vector
time 892.3 ns (889.6 ns .. 894.8 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 893.0 ns (890.9 ns .. 894.5 ns)
std dev 6.179 ns (4.750 ns .. 7.933 ns)
benchmarking Float/f/vector (unified)
time 897.8 ns (895.3 ns .. 900.2 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 898.7 ns (896.3 ns .. 902.2 ns)
std dev 9.800 ns (6.691 ns .. 16.75 ns)
benchmarking Float/g/scalar
time 3.215 μs (3.211 μs .. 3.219 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.213 μs (3.210 μs .. 3.220 μs)
std dev 15.61 ns (8.731 ns .. 28.17 ns)
benchmarking Float/g/vector
time 893.0 ns (890.7 ns .. 895.9 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 894.2 ns (892.1 ns .. 896.7 ns)
std dev 7.870 ns (6.011 ns .. 11.68 ns)
benchmarking Float/g/vector (unified)
time 901.2 ns (897.6 ns .. 904.0 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 898.0 ns (895.9 ns .. 899.9 ns)
std dev 6.522 ns (5.186 ns .. 8.885 ns)
benchmarking Double/f/scalar
time 3.185 μs (3.179 μs .. 3.193 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.188 μs (3.184 μs .. 3.199 μs)
std dev 21.43 ns (11.81 ns .. 36.86 ns)
benchmarking Double/f/vector
time 3.649 μs (3.621 μs .. 3.674 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 3.662 μs (3.640 μs .. 3.674 μs)
std dev 54.79 ns (39.52 ns .. 73.07 ns)
variance introduced by outliers: 13% (moderately inflated)
benchmarking Double/f/vector (unified)
time 3.671 μs (3.652 μs .. 3.688 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.672 μs (3.654 μs .. 3.683 μs)
std dev 45.43 ns (27.65 ns .. 83.19 ns)
benchmarking Double/g/scalar
time 3.221 μs (3.214 μs .. 3.228 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.220 μs (3.215 μs .. 3.225 μs)
std dev 17.02 ns (12.05 ns .. 28.72 ns)
benchmarking Double/g/vector
time 1.658 μs (1.656 μs .. 1.662 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.662 μs (1.659 μs .. 1.673 μs)
std dev 16.55 ns (4.710 ns .. 36.25 ns)
benchmarking Double/g/vector (unified)
time 1.661 μs (1.658 μs .. 1.664 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.658 μs (1.656 μs .. 1.662 μs)
std dev 9.381 ns (4.704 ns .. 17.62 ns)
The Float
code runs 3.206/0.8923≈3.59 times faster if vectorized.
Good.
For the Double
code, f
does not run faster.
On the other hand, g
runs 3.221/1.658≈1.94 times faster if vectorized.
Nice.
Here is results from another CPU, Ryzen 9 7940HS (Zen 4), run on Ubuntu on WSL2:
$ cabal bench -w ghc-9.8.4 --ghc-options=-fllvm -O2
benchmarking Float/f/scalar
time 4.190 μs (4.154 μs .. 4.231 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 4.201 μs (4.172 μs .. 4.261 μs)
std dev 139.4 ns (77.76 ns .. 238.0 ns)
variance introduced by outliers: 42% (moderately inflated)
benchmarking Float/f/vector
time 1.288 μs (1.276 μs .. 1.302 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.293 μs (1.282 μs .. 1.308 μs)
std dev 42.72 ns (30.33 ns .. 65.09 ns)
variance introduced by outliers: 45% (moderately inflated)
benchmarking Float/f/vector (unified)
time 1.290 μs (1.279 μs .. 1.301 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.291 μs (1.280 μs .. 1.304 μs)
std dev 42.05 ns (30.81 ns .. 60.24 ns)
variance introduced by outliers: 45% (moderately inflated)
benchmarking Float/g/scalar
time 4.187 μs (4.148 μs .. 4.235 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 4.200 μs (4.165 μs .. 4.257 μs)
std dev 149.5 ns (109.6 ns .. 213.1 ns)
variance introduced by outliers: 46% (moderately inflated)
benchmarking Float/g/vector
time 1.359 μs (1.350 μs .. 1.370 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.363 μs (1.355 μs .. 1.375 μs)
std dev 32.39 ns (21.39 ns .. 48.15 ns)
variance introduced by outliers: 30% (moderately inflated)
benchmarking Float/g/vector (unified)
time 1.366 μs (1.354 μs .. 1.378 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.365 μs (1.356 μs .. 1.378 μs)
std dev 35.55 ns (26.86 ns .. 46.75 ns)
variance introduced by outliers: 34% (moderately inflated)
benchmarking Double/f/scalar
time 4.214 μs (4.183 μs .. 4.246 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 4.225 μs (4.195 μs .. 4.280 μs)
std dev 131.8 ns (87.25 ns .. 222.4 ns)
variance introduced by outliers: 39% (moderately inflated)
benchmarking Double/f/vector
time 2.159 μs (2.140 μs .. 2.182 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 2.208 μs (2.180 μs .. 2.253 μs)
std dev 116.2 ns (77.12 ns .. 170.5 ns)
variance introduced by outliers: 67% (severely inflated)
benchmarking Double/f/vector (unified)
time 2.146 μs (2.131 μs .. 2.165 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.153 μs (2.138 μs .. 2.172 μs)
std dev 57.85 ns (44.10 ns .. 86.56 ns)
variance introduced by outliers: 34% (moderately inflated)
benchmarking Double/g/scalar
time 4.192 μs (4.159 μs .. 4.225 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 4.195 μs (4.171 μs .. 4.234 μs)
std dev 100.5 ns (76.68 ns .. 130.0 ns)
variance introduced by outliers: 27% (moderately inflated)
benchmarking Double/g/vector
time 2.187 μs (2.171 μs .. 2.205 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.192 μs (2.177 μs .. 2.215 μs)
std dev 61.57 ns (44.70 ns .. 88.71 ns)
variance introduced by outliers: 36% (moderately inflated)
benchmarking Double/g/vector (unified)
time 2.181 μs (2.159 μs .. 2.207 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.193 μs (2.176 μs .. 2.225 μs)
std dev 75.01 ns (45.34 ns .. 123.4 ns)
variance introduced by outliers: 46% (moderately inflated)
The Float
code runs 4.190/1.288≈3.25 times faster if vectorized.
For the Double
code, f
does not run faster.
On the other hand, g
runs 4.192/2.181≈1.92 times faster if vectorized.
Let’s try NCG backend from GHC 9.12.
$ cabal bench -w ghc-9.12.1 --builddir=dist-ncg -O2 --allow-newer
benchmarking Float/f/scalar
time 38.18 μs (38.00 μs .. 38.37 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.00 μs (37.80 μs .. 38.19 μs)
std dev 670.4 ns (522.7 ns .. 874.4 ns)
variance introduced by outliers: 14% (moderately inflated)
benchmarking Float/f/vector
time 7.118 μs (7.049 μs .. 7.184 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 7.121 μs (7.080 μs .. 7.178 μs)
std dev 170.4 ns (140.2 ns .. 209.8 ns)
variance introduced by outliers: 26% (moderately inflated)
benchmarking Float/f/vector (unified)
time 7.151 μs (7.106 μs .. 7.195 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 7.131 μs (7.091 μs .. 7.177 μs)
std dev 147.8 ns (123.3 ns .. 186.5 ns)
variance introduced by outliers: 21% (moderately inflated)
benchmarking Float/g/scalar
time 13.76 μs (13.63 μs .. 13.89 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 13.39 μs (13.22 μs .. 13.54 μs)
std dev 539.9 ns (434.0 ns .. 699.9 ns)
variance introduced by outliers: 48% (moderately inflated)
benchmarking Float/g/vector
time 1.397 μs (1.384 μs .. 1.412 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.395 μs (1.384 μs .. 1.408 μs)
std dev 35.98 ns (29.38 ns .. 44.60 ns)
variance introduced by outliers: 33% (moderately inflated)
benchmarking Float/g/vector (unified)
time 1.199 μs (1.187 μs .. 1.210 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.195 μs (1.186 μs .. 1.205 μs)
std dev 30.76 ns (26.36 ns .. 35.65 ns)
variance introduced by outliers: 34% (moderately inflated)
benchmarking Double/f/scalar
time 38.59 μs (38.38 μs .. 38.83 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.65 μs (38.46 μs .. 38.91 μs)
std dev 699.1 ns (550.4 ns .. 890.0 ns)
variance introduced by outliers: 14% (moderately inflated)
benchmarking Double/f/vector
time 8.237 μs (8.185 μs .. 8.302 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 8.281 μs (8.229 μs .. 8.349 μs)
std dev 195.3 ns (145.6 ns .. 255.1 ns)
variance introduced by outliers: 26% (moderately inflated)
benchmarking Double/f/vector (unified)
time 8.267 μs (8.208 μs .. 8.332 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 8.291 μs (8.233 μs .. 8.363 μs)
std dev 205.6 ns (162.1 ns .. 272.7 ns)
variance introduced by outliers: 27% (moderately inflated)
benchmarking Double/g/scalar
time 13.91 μs (13.79 μs .. 14.03 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 13.64 μs (13.47 μs .. 13.77 μs)
std dev 523.9 ns (437.4 ns .. 658.1 ns)
variance introduced by outliers: 46% (moderately inflated)
benchmarking Double/g/vector
time 2.667 μs (2.640 μs .. 2.701 μs)
0.999 R² (0.999 R² .. 0.999 R²)
mean 2.690 μs (2.671 μs .. 2.711 μs)
std dev 67.58 ns (53.11 ns .. 86.86 ns)
variance introduced by outliers: 31% (moderately inflated)
benchmarking Double/g/vector (unified)
time 2.267 μs (2.241 μs .. 2.298 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.264 μs (2.246 μs .. 2.288 μs)
std dev 68.17 ns (54.30 ns .. 86.66 ns)
variance introduced by outliers: 39% (moderately inflated)
Float/f
runs 38.18/7.118≈5.36 times faster if vectorized.
Float/g
runs 13.76/1.199≈11.5 times faster if vectorized.
Double/f
runs 38.59/8.237≈4.68 times faster if vectorized.
Double/g
runs 13.91/2.267≈6.14 times faster if vectorized.
With NCG backend, the degree of improvement is more than the number of parallelization (4 or 2). Maybe the scalar code is not optimized well.
Using CPU Features Conditionally
We have some SIMD support in x86 NCG backend, and we can write a library for vectorization. Is that all? No, we have another problem.
SSE2 is always available on x86-64, but other extensions like SSE4.1, AVX2, and AVX-512 are not available on all CPUs (or emulators).
Therefore, if a program wants to use AVX2, either it should be built exclusively for CPUs supporting AVX2, or conditionally use AVX2 by looking at CPUID.
Currently, GHC supports only the first option, via flags like -mavx2
.
This may be fine for programs that only need to run on a specific CPU, but not for libraries that need to run on many CPUs.
Therefore, it is desirable to have a mechanism to enable a CPU feature on a function (or execution path) basis rather than on a per-module basis, and a mechanism to switch the code to be executed by looking at CPUID.
As a prior art, GNU C has an attribute for selecting the target: __attribute__((target("avx2")))
.
Therefore, we could have something similar like:
{-# TARGET "avx2" #-}
foo :: X8 Int32 -> X8 Int32 -> X8 Int32
= 2 * x + y foo x y
There is a related issue: #15876: Function versioning instead of compilation flags… · Issues · Glasgow Haskell Compiler / GHC · GitLab
Now, I’m wondering if nullary type classes (or nullary type families) could be used to represent CPU features. I mean, something like
-- Using a nullary type class
class HasAVX2
-- -mavx2 would make the constraint `HasAVX2` available on that module
# :: HasAVX2 => Int32X8# -> Int32X8# -> Int32X8#
plusInt32X8
type Dict :: Constraint -> Type
data Dict c = c => Dict
-- modules that don't have -mavx2 enabled can use `detectAVX2` to conditionally enable AVX2
detectAVX2 :: Maybe (Dict HasAVX2)
= ... -- use CPUID to detect AVX2 detectAVX2
or
-- Using a nullary type family
type family AVX2 :: Bool
-- -mavx2 would make the equation `AVX2 ~ True` available on that module
# :: AVX2 ~ True => Int32X8# -> Int32X8# -> Int32X8#
plusInt32X8
data SBool (b :: Bool) where
STrue :: SBool True
SFalse :: SBool False
-- modules that don't have -mavx2 enabled can use `detectAVX2` to conditionally enable AVX2
detectAVX2 :: SBool AVX2
= ... -- use CPUID to detect AVX2 detectAVX2
This is just an idea (I’m no expert on GHC’s internals), and I don’t know if it works well when you want cross-platform support (x86, AArch64, RISC-V, …).
Conclusion
Although not complete, there is some progress on GHC’s SIMD support.
I plan to continue improving the SIMD support during my spare time. I’m also working on a library to wrap GHC’s SIMD primitives.
If you are interested, take a look at this ticket: #25030: Meta-issue for SIMD support in the native code generator · Issues · Glasgow Haskell Compiler / GHC · GitLab↩︎