Some thoughts on GHC's SIMD primitives

Posted on January 13, 2025

The use of parallelization is essential to exploit the performance of modern computers. In particular, we want to take advantage of SIMD, multi-core, and GPUs. This article will focus on SIMD — single instruction, multiple data.

GHC has provided primitives for SIMD for years, but they have not been effectively utilized.

In this article, I’d like to discuss what can be done to promote the utilization of GHC’s SIMD primitives.

Disclaimer: I’m not a high-performance computing expert — just an enthusiast who enjoys experimenting with hardware capabilities for fun.

Current Status

Paper

I must mention the following paper, which was written by the people who implemented SIMD primitives into GHC.

Geoffrey Mainland, Roman Leshchinskiy and Simon Peyton Jones, Exploiting Vector Instructions with Generalized Stream Fusion, ICFP 2013

In addition to GHC’s SIMD primitives, they seem to have considered integration with the vector package and Data Parallel Haskell (DPH). However, integration with the vector package doesn’t seem to have happended, and DPH did not survive.

Design

When exposing SIMD features to programming languages, there is the question of how to abstract (or not to abstract) the architecture-dependent vector register width.

GHC provides 128-bit, 256-bit, and 512-bit wide types, respectively. They correspond directly to the corresponding CPU registers. In other words, on x86 systems, 512-bit wide vector types correspond to ZMM (AVX-512) registers and do not use two YMM registers or four XMM registers.

In contrast, LLVM’s abstraction allows vectors of arbitrary length to be written. Vectors exceeding the machine’s SIMD width are compiled into multiple instructions or registers.

Under GHC’s current design, it is up to libraries to abstract away the architecture-dependent SIMD width.

Library

Since GHC’s SIMD primitive types are unboxed, one would want boxed wrappers for daily use. With a Hackage search, I found two packages providing boxed wrappers:

The simd package is based on type families:

module Data.SIMD.SIMD4 where

class SIMD4 a where
  data X4 a
  plusX4 :: X4 a -> X4 a -> X4 a
  indexArrayAsX4 :: ByteArray -> Int -> X4 a
  broadcastX4 :: a -> X4 a
  -- ...

class SIMD4 a => SIMD4Float a where
  divideX4 :: X4 a -> X4 a -> X4 a

instance (Show a, SIMD4 a) => Show (X4 a)
instance (Num a, SIMD4 a) => Num (X4 a)
instance (Fractional a, SIMD4Float a) => Fractional (X4 a)

instance SIMD4 Float
instance SIMD4 Double
-- ...

vectorizeUnboxedX4 :: (SIMD4 a, VU.Unbox a) => VU.Vector a -> VU.Vector (X4 a)
unVectorizeUnboxedX4 :: (SIMD4 a, VU.Unbox a) => VU.Vector (X4 a) -> VU.Vector a
vectorizeStorableX4 :: (SIMD4 a, Storable a, Storable (X4 a)) => VS.Vector a -> VS.Vector (X4 a)
unVectorizeStorableX4 :: (SIMD4 a, Storable a, Storable (X4 a)) => VS.Vector (X4 a) -> VS.Vector a

This package organizes SIMD types by the number of elements, such as X4, X8, and X16. One thing to note is that X4 Double is a wrapper for DoubleX4# and therefore requires AVX.

If you want to map over Vector Float, the vector’s length must be a multiple of 4.

Let’s see the other package, primitive-simd.

module Data.Primitive.SIMD where

class (Num v, Real (Elem v)) => SIMDVector v where
  type Elem v
  type ElemTuple v
  broadcastVector :: Elem v -> v
  mapVector :: (Elem v -> Elem v) -> v -> v
  -- ...

data FloatX4
data FloatX8
data FloatX16
data DoubleX2
data DoubleX4
data DoubleX8
-- ...

It provides individual data types, not type families, for the vector types (it uses type families for obtaining the element type from vector types).

One interesting point is, DoubleX4 does not always map to DoubleX4# and can be configured to use two DoubleX2#. It allows to use DoubleX4 on SSE-only machines, or NEON-only Arm machines.

I like the idea of using type families like X4/X8/X16, and I like the idea of using two DoubleX2# for X4 Double/DoubleX4. I’ll come back to this point later.

Improvements to GHC

This section explores ways to make GHC’s SIMD primitives more accessible and practical for users.

Expanding Backend Support

Historically (in GHC < 9.8), using SIMD primitives required:

The LLVM backend
An x86(-64) machine

This situation was not great because

you need to install LLVM toolchain in addition to GHC, and
some of the modern computers are not x86-based (for instance, I’m writing this article with an Apple Silicon Mac).

With SIMD primitives limited to few backends, it’s no surprise that a package authors have been hesitant to adopt them. To encourange broader use, GHC should support SIMD primitives across more backends.

Recently, some progress is being made in this area. The following sections will detail these advancements.

x86(-64) NCG

Currently, the most popular architecture for PCs is x86-64. Therefore, it would be beneficial to make the x86(-64) NCG support SIMD primitives.

Now, GHC 9.12 added some support for SIMD with x86-64 NCG. The work originally started as a GSoC project in 2018 and was once merged in 2019, but later reverted because of incomplete handling of register spilling.

Last year (2024), @sheaf and @jaro continued the work and managed to get basic support of SIMD into x86(-64) NCG.

The initial implementation was limited to FloatX4#, DoubleX2# (and packing/unpacking of Int64X2#) and some operations required SSE4.1 instructions (by default, GHC refuses to use SSE instructions beyond SSE2). Since then, I have worked to make NCG emit SSE1 instructions for basic FloatX4# operations:

and I implemented packing/unpacking of other integer vector types:

and now I’m working on to support arithmetic operations of integer vector types:

Although there are many tasks to be done (256-bit and 512-bit support via AVX/AVX-512/AVX10, better code generation, etc¹), the progress is promising.

AArch64 LLVM/NCG

The next most popular architecture for PCs is probably AArch64 (Arm64). Therefore, it should be supported as well.

AArch64 offers 128-bit wide ASIMD (NEON) instructions and optional SVE. NEON aligns well with GHC’s design, so it is the first target.

Supporting NEON via LLVM should be easy, and actually I did it in 2023:

So GHC 9.8 has support for SIMD on AArch64 via LLVM backend. If you are using ghcup on macOS, you can enable LLVM backend (-fllvm) via the following command:

$ brew install llvm@15
$ OPT=$(brew --prefix llvm@15)/bin/opt \
    LLC=$(brew --prefix llvm@15)/bin/llc \
    ghcup install ghc --force 9.8.4

Support via the NCG backend is also desirable, but it would be a non-trivial task.

WebAssembly

WebAssembly has 128-bit wide SIMD, so it would be good to have support for them.

There is an issue: #22618: Add SIMD128 support to the wasm backend · Issues · Glasgow Haskell Compiler / GHC · GitLab

Other Targets

Some of the other targets have support for SIMD, but with some notes.

PowerPC has 128-bit AltiVec, but does not support double precision floating-points. Should GHC emulate DoubleX2# via scalar operations, or choose not to support it?

32-bit Arm has NEON as an extension, but the floating-point vector operations do not support subnormals (they always use flush-to-zero mode). Is this acceptable? Should GHC use floating-point vector instructions only if -ffast-math is set?

RISC-V has vector extension, but I don’t know how close its design is to GHC’s design. There is some effort:

Other backends (like via-C or JS) has no SIMD support.

More Primitives

The number of operations implemented in current GHC is quite limited. Traditionally (GHC ≤ 9.10), the available operations were the following:

Basic operations
- pack
- insert
- unpack
- broadcast
- index/read/write for arrays and pointers
plus
minus
times
divide (for floating-points)
quot/rem (for integers)
- Interestingly, there is no SIMD instructions for integer division on x86.
negate (for signed integers and floating-points)

In GHC 9.12, additional primitives such as min, max, FMA, and shuffle were introduced.

Still, some operations that are frequently used are missing:

sqrt
abs
Bitwise operations

In fact, the LLVM backend can generate these instructions without dedicated vector primitives. For example, LLVM’s optimizer would fuse the following code into a vector instruction:

sqrtFloatX4# :: FloatX4# -> FloatX4#
sqrtFloatX4# v = case unpackFloatX4# v of
  (# x0, x1, x2, x3 #) ->
    packFloatX4# (# sqrtFloat# x0
                  , sqrtFloat# x1
                  , sqrtFloat# x2
                  , sqrtFloat# x3 #)

Nevertheless, adding dedicated vector primitives for these operations would be desirable, because implementing such optimizations in the NCG would not be pleasant.

In addition to these, primitives related to conditionals (or masks) are currently missing. Comparison operations, logical operations, conditional selection, and so on. It is worth considering how to represent masks. Should we introduce a generic type like BoolX4# that is independent of the vector’s element type (like LLVM’s abstraction), or should we define type-specific masks such as FloatX4Mask# for each vector type?

FFI

With the UnliftedFFITypes extension enabled, vector types can be used in FFI.

However, on Windows, the default C calling convention requires all 128-bit vectors to be passed by reference. To address this, it may be beneficial to support the vectorcall calling convention, which allows vector arguments to be passed directly in registers.

Designing a SIMD Library

SIMD primitives are low-level features and not for end-users. As mentioned earlier, libraries should provide user-friendly wrappers. But what should such a library look like?

Boxed Wrappers

The use of type families to define wrappers based on the number of elements, as seen in the simd package, is an elegant approach. This design has the advantage of being isomorphic to fixed-length vector types, for example:

-- Pseudocode
data X4 a = X4 !a !a !a !a

Another appealing idea is the approach taken by the primitive-simd package, which uses two DoubleX2# values to represent a vector of four Doubles. This allows for flexibility in supporting platforms with varying SIMD capabilities.

Combining these ideas, my proposed wrapper types would look like this:

data X4 a
data instance X4 Float = FloatX4 FloatX4#
#if MAX_VEC_LEN == 128
data instance X4 Double = DoubleX2X2 DoubleX2# DoubleX2#
#elif MAX_VEC_LEN >= 256
data instance X4 Double = DoubleX4 DoubleX4#
#endif

Type Classes

Let’s consider the type classes. I want to have instances like these:

instance (??? a) => Num (X4 a)
instance (??? a) => Fractional (X4 a)

What should the superclass constraints (??? a) be? One approach is to define specialized classes for X4:

class BroadcastX4 a where
  broadcastX4 :: a -> X4 a

instance BroadcastX4 Float where ...
instance BroadcastX4 Double where ...

class BroadcastX4 a => NumX4 a where
  plusX4 :: X4 a -> X4 a -> X4 a
  -- ...

instance NumX4 Float where ..
instance NumX4 Double where ..

instance (Num a, NumX4 a) => Num (X4 a) where
  (+) = plusX4
  -- ...

A more general approach would allow for other vector lengths, such as X8 or X16:

class Broadcast f a where
  broadcast :: a -> f a

instance Broadcast X4 Float where ...
instance Broadcast X4 Double where ...
-- instances for X8, X16 could be added

class Broadcast f a => NumF f a where
  plusF :: f a -> f a -> f a
  minusF :: f a -> f a -> f a
  timesF :: f a -> f a -> f a
  absF :: f a -> f a
  signumF :: f a -> f a
  negateF :: f a -> f a

instance NumF X4 Float where ...
instance NumF X4 Double where ...
-- instances for X8, X16 could be added

instance (NumF X4 a, Num a) => Num (X4 a) where
  (+) = plusF
  -- ...

Mapping a Long Vector

To process a (long) vector using SIMD data types, let’s define Storable-like classes:

class Storable a => StorableF f a where
  peekElemOffF :: Ptr a -> Int -> IO (f a) -- index in scalar elements
  pokeElemOffF :: Ptr a -> Int -> f a -> IO () -- index in scalar elements

instance StorableF X4 Float where ...
instance StorableF X4 Double where ...

Using these, a map function that supports SIMD could be implemented like this:

mapStorable :: (StorableF X4 a, StorableF X4 b)
            => (X4 a -> X4 b) -> (a -> b)
            -> VS.Vector a -> VS.Vector b
mapStorable fv f !v = unsafePerformIO $ do
  let !n = VS.length v
  !result <- VSM.unsafeNew n
  VS.unsafeWith v $ \ !inputPtr ->
    VSM.unsafeWith result $ \ !resultPtr -> do
      let loopVector !i | i + 4 > n = loopScalar i
                        | otherwise = do
                          !a <- peekElemOffF inputPtr i
                          pokeElemOffF resultPtr i (fv a)
                          loopVector (i + 4)
          loopScalar !i | i >= n = pure ()
                        | otherwise = do
                          !a <- peekElemOff inputPtr i
                          pokeElemOff resultPtr i (f a)
                          loopScalar (i + 1)
      loopVector 0
  VS.unsafeFreeze result
{-# INLINE mapStorable #-}

For simplicity, fusion-like optimizations are not implemented.

A use of this function would look like:

mapStorable (\v -> (v + 1)^(10 :: Int))
            (\x -> (x + 1)^(10 :: Int))
            (VS.fromList [0..10] :: VS.Vector Float)

Unifying the SIMD and Scalar Parts

The mapStorable function that we just defined requires separate functions for SIMD (X4 a -> X4 b) and scalar (a -> b) operations, leading to code duplication. Can we avoid this?

If the operation is f x = (x + 1)^10, the function type could be generalized to Num a => a -> a, and we could have:

mapStorableNum :: (StorableF X4 a, Num a, NumF X4 a)
               => (forall v. Num v => v -> v)
               -> VS.Vector a -> VS.Vector a
mapStorableNum f = mapStorable f f

However, this approach requires separate map functions for each constraints (Num, Fractional, Bits, etc.), which is undesirable.

Let’s consider another approach. The scalar function is isomorphic to Identity a -> Identity b, so both the SIMD part and scalar part could be written as f a -> f b. I mean, if we have a type class SIMD with X4 and Identity as instances, then we could write:

mapStorable' :: (StorableF X4 a, Storable X4 b)
             => (forall f. SIMD f => f a -> f b)
             -> VS.Vector a -> VS.Vector b
mapStorable' f = mapStorable f (runIdentity . f . Identity)

What should the SIMD f constraint be? f should have instances like Num (f Float), Num (f Double). Ideally, we would like to have Num (f a) for any Num a. That is:

class (forall a. Num a => Num (f a)) => SIMD f
instance SIMD Identity
instance SIMD X4

…except that it’s impossible. We defined X4 a only for a ~ Float and a ~ Double. We are not going to have X4 Integer or X4 Rational. So, let’s add the additional constraint:

class (forall a. (Num a, NumF X4 a) => Num (f a)) => SIMD f
instance SIMD Identity
instance SIMD X4

If we are going to have types for other element-counts like X8 or X16, we can add them:

class ( forall a. (Num a, NumF X4 a, NumF X8 a, NumF X16 a) => Num (f a)
      ) => SIMD f
instance SIMD Identity
instance SIMD X4
instance SIMD X8
instance SIMD X16

If we are going to support other type classes, we can add them:

type NumElement a = (Num a, NumF X4 a, NumF X8 a, NumF X16 a)
type FractionalElement a = (Fractional a, FractionalF X4 a,
                            FractionalF X8 a, FractionalF X16 a)
type FloatingElement a = (Floating a, FloatingF X4 a,
                          FloatingF X8 a, FloatingF X16 a)
type MonoidElement a = (Monoid a, MonoidF X4 a, MonoidF X8 a, MonoidF X16 a)
class ( forall a. NumElement a => Num (f a)
      , forall a. FractionalElement a => Fractional (f a)
      , forall a. FloatingElement a => Floating (f a)
      , forall a. MonoidElement a => Monoid (f a)
      ) => SIMD f
instance SIMD Identity
instance SIMD X4
instance SIMD X8
instance SIMD X16

The user can write as

f :: Num a => a -> a
f x = (x + 1)^(10 :: Int)

mapStorable' f (VS.fromList [0..10])

g :: (SIMD f, NumElement a) => f a -> f a
g x = (x + 1)^(10 :: Int)

mapStorable' g (VS.fromList [0..10])

A Working Example

I have a working example for SIMD at haskell-dsl-example/simd.

$ git clone https://github.com/minoki/haskell-dsl-example.git
$ cd haskell-dsl-example/simd

I use an Apple Silicon Mac, so I need to use GHC 9.8+ with LLVM backend enabled.

$ cabal configure -w ghc-9.8.4 --ghc-options=-fllvm

A successful run of cabal run looks like:

$ cabal run
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.8247526e8,1.0737418e9,3.4867843e9,1.0e10,2.5937424e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]
[1.0,1024.0,59049.0,1048576.0,9765625.0,6.0466176e7,2.82475249e8,1.073741824e9,3.486784401e9,1.0e10,2.5937424601e10]

The benchmark is available at simd/benchmark/Main.hs. I defined two functions to benchmark:

f :: Num a => a -> a
f x = (x + 1)^(10 :: Int)
{-# INLINE f #-}

g :: Num a => a -> a
g x = let y = x + 1
          y2 = y * y
          y4 = y2 * y2
          y8 = y4 * y4
      in y8 * y2
{-# INLINE g #-}

Here is the benchmark result with Apple M4 Pro:

$ cabal bench -w ghc-9.8.4 --ghc-options=-fllvm -O2
benchmarking Float/f/scalar
time                 3.206 μs   (3.200 μs .. 3.211 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.207 μs   (3.201 μs .. 3.215 μs)
std dev              22.70 ns   (15.40 ns .. 37.09 ns)

benchmarking Float/f/vector
time                 892.3 ns   (889.6 ns .. 894.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 893.0 ns   (890.9 ns .. 894.5 ns)
std dev              6.179 ns   (4.750 ns .. 7.933 ns)

benchmarking Float/f/vector (unified)
time                 897.8 ns   (895.3 ns .. 900.2 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 898.7 ns   (896.3 ns .. 902.2 ns)
std dev              9.800 ns   (6.691 ns .. 16.75 ns)

benchmarking Float/g/scalar
time                 3.215 μs   (3.211 μs .. 3.219 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.213 μs   (3.210 μs .. 3.220 μs)
std dev              15.61 ns   (8.731 ns .. 28.17 ns)

benchmarking Float/g/vector
time                 893.0 ns   (890.7 ns .. 895.9 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 894.2 ns   (892.1 ns .. 896.7 ns)
std dev              7.870 ns   (6.011 ns .. 11.68 ns)

benchmarking Float/g/vector (unified)
time                 901.2 ns   (897.6 ns .. 904.0 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 898.0 ns   (895.9 ns .. 899.9 ns)
std dev              6.522 ns   (5.186 ns .. 8.885 ns)

benchmarking Double/f/scalar
time                 3.185 μs   (3.179 μs .. 3.193 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.188 μs   (3.184 μs .. 3.199 μs)
std dev              21.43 ns   (11.81 ns .. 36.86 ns)

benchmarking Double/f/vector
time                 3.649 μs   (3.621 μs .. 3.674 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.662 μs   (3.640 μs .. 3.674 μs)
std dev              54.79 ns   (39.52 ns .. 73.07 ns)
variance introduced by outliers: 13% (moderately inflated)

benchmarking Double/f/vector (unified)
time                 3.671 μs   (3.652 μs .. 3.688 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.672 μs   (3.654 μs .. 3.683 μs)
std dev              45.43 ns   (27.65 ns .. 83.19 ns)

benchmarking Double/g/scalar
time                 3.221 μs   (3.214 μs .. 3.228 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.220 μs   (3.215 μs .. 3.225 μs)
std dev              17.02 ns   (12.05 ns .. 28.72 ns)

benchmarking Double/g/vector
time                 1.658 μs   (1.656 μs .. 1.662 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.662 μs   (1.659 μs .. 1.673 μs)
std dev              16.55 ns   (4.710 ns .. 36.25 ns)

benchmarking Double/g/vector (unified)
time                 1.661 μs   (1.658 μs .. 1.664 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.658 μs   (1.656 μs .. 1.662 μs)
std dev              9.381 ns   (4.704 ns .. 17.62 ns)

The Float code runs 3.206/0.8923≈3.59 times faster if vectorized. Good.

For the Double code, f does not run faster. On the other hand, g runs 3.221/1.658≈1.94 times faster if vectorized. Nice.

Here is results from another CPU, Ryzen 9 7940HS (Zen 4), run on Ubuntu on WSL2:

$ cabal bench -w ghc-9.8.4 --ghc-options=-fllvm -O2
benchmarking Float/f/scalar
time                 4.190 μs   (4.154 μs .. 4.231 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.201 μs   (4.172 μs .. 4.261 μs)
std dev              139.4 ns   (77.76 ns .. 238.0 ns)
variance introduced by outliers: 42% (moderately inflated)

benchmarking Float/f/vector
time                 1.288 μs   (1.276 μs .. 1.302 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.293 μs   (1.282 μs .. 1.308 μs)
std dev              42.72 ns   (30.33 ns .. 65.09 ns)
variance introduced by outliers: 45% (moderately inflated)

benchmarking Float/f/vector (unified)
time                 1.290 μs   (1.279 μs .. 1.301 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.291 μs   (1.280 μs .. 1.304 μs)
std dev              42.05 ns   (30.81 ns .. 60.24 ns)
variance introduced by outliers: 45% (moderately inflated)

benchmarking Float/g/scalar
time                 4.187 μs   (4.148 μs .. 4.235 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.200 μs   (4.165 μs .. 4.257 μs)
std dev              149.5 ns   (109.6 ns .. 213.1 ns)
variance introduced by outliers: 46% (moderately inflated)

benchmarking Float/g/vector
time                 1.359 μs   (1.350 μs .. 1.370 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.363 μs   (1.355 μs .. 1.375 μs)
std dev              32.39 ns   (21.39 ns .. 48.15 ns)
variance introduced by outliers: 30% (moderately inflated)

benchmarking Float/g/vector (unified)
time                 1.366 μs   (1.354 μs .. 1.378 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.365 μs   (1.356 μs .. 1.378 μs)
std dev              35.55 ns   (26.86 ns .. 46.75 ns)
variance introduced by outliers: 34% (moderately inflated)

benchmarking Double/f/scalar
time                 4.214 μs   (4.183 μs .. 4.246 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.225 μs   (4.195 μs .. 4.280 μs)
std dev              131.8 ns   (87.25 ns .. 222.4 ns)
variance introduced by outliers: 39% (moderately inflated)

benchmarking Double/f/vector
time                 2.159 μs   (2.140 μs .. 2.182 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 2.208 μs   (2.180 μs .. 2.253 μs)
std dev              116.2 ns   (77.12 ns .. 170.5 ns)
variance introduced by outliers: 67% (severely inflated)

benchmarking Double/f/vector (unified)
time                 2.146 μs   (2.131 μs .. 2.165 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.153 μs   (2.138 μs .. 2.172 μs)
std dev              57.85 ns   (44.10 ns .. 86.56 ns)
variance introduced by outliers: 34% (moderately inflated)

benchmarking Double/g/scalar
time                 4.192 μs   (4.159 μs .. 4.225 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.195 μs   (4.171 μs .. 4.234 μs)
std dev              100.5 ns   (76.68 ns .. 130.0 ns)
variance introduced by outliers: 27% (moderately inflated)

benchmarking Double/g/vector
time                 2.187 μs   (2.171 μs .. 2.205 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.192 μs   (2.177 μs .. 2.215 μs)
std dev              61.57 ns   (44.70 ns .. 88.71 ns)
variance introduced by outliers: 36% (moderately inflated)

benchmarking Double/g/vector (unified)
time                 2.181 μs   (2.159 μs .. 2.207 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.193 μs   (2.176 μs .. 2.225 μs)
std dev              75.01 ns   (45.34 ns .. 123.4 ns)
variance introduced by outliers: 46% (moderately inflated)

The Float code runs 4.190/1.288≈3.25 times faster if vectorized.

For the Double code, f does not run faster. On the other hand, g runs 4.192/2.181≈1.92 times faster if vectorized.

Let’s try NCG backend from GHC 9.12.

$ cabal bench -w ghc-9.12.1 --builddir=dist-ncg -O2 --allow-newer
benchmarking Float/f/scalar
time                 38.18 μs   (38.00 μs .. 38.37 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.00 μs   (37.80 μs .. 38.19 μs)
std dev              670.4 ns   (522.7 ns .. 874.4 ns)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Float/f/vector
time                 7.118 μs   (7.049 μs .. 7.184 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 7.121 μs   (7.080 μs .. 7.178 μs)
std dev              170.4 ns   (140.2 ns .. 209.8 ns)
variance introduced by outliers: 26% (moderately inflated)

benchmarking Float/f/vector (unified)
time                 7.151 μs   (7.106 μs .. 7.195 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.131 μs   (7.091 μs .. 7.177 μs)
std dev              147.8 ns   (123.3 ns .. 186.5 ns)
variance introduced by outliers: 21% (moderately inflated)

benchmarking Float/g/scalar
time                 13.76 μs   (13.63 μs .. 13.89 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 13.39 μs   (13.22 μs .. 13.54 μs)
std dev              539.9 ns   (434.0 ns .. 699.9 ns)
variance introduced by outliers: 48% (moderately inflated)

benchmarking Float/g/vector
time                 1.397 μs   (1.384 μs .. 1.412 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.395 μs   (1.384 μs .. 1.408 μs)
std dev              35.98 ns   (29.38 ns .. 44.60 ns)
variance introduced by outliers: 33% (moderately inflated)

benchmarking Float/g/vector (unified)
time                 1.199 μs   (1.187 μs .. 1.210 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.195 μs   (1.186 μs .. 1.205 μs)
std dev              30.76 ns   (26.36 ns .. 35.65 ns)
variance introduced by outliers: 34% (moderately inflated)

benchmarking Double/f/scalar
time                 38.59 μs   (38.38 μs .. 38.83 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.65 μs   (38.46 μs .. 38.91 μs)
std dev              699.1 ns   (550.4 ns .. 890.0 ns)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Double/f/vector
time                 8.237 μs   (8.185 μs .. 8.302 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 8.281 μs   (8.229 μs .. 8.349 μs)
std dev              195.3 ns   (145.6 ns .. 255.1 ns)
variance introduced by outliers: 26% (moderately inflated)

benchmarking Double/f/vector (unified)
time                 8.267 μs   (8.208 μs .. 8.332 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 8.291 μs   (8.233 μs .. 8.363 μs)
std dev              205.6 ns   (162.1 ns .. 272.7 ns)
variance introduced by outliers: 27% (moderately inflated)

benchmarking Double/g/scalar
time                 13.91 μs   (13.79 μs .. 14.03 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 13.64 μs   (13.47 μs .. 13.77 μs)
std dev              523.9 ns   (437.4 ns .. 658.1 ns)
variance introduced by outliers: 46% (moderately inflated)

benchmarking Double/g/vector
time                 2.667 μs   (2.640 μs .. 2.701 μs)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 2.690 μs   (2.671 μs .. 2.711 μs)
std dev              67.58 ns   (53.11 ns .. 86.86 ns)
variance introduced by outliers: 31% (moderately inflated)

benchmarking Double/g/vector (unified)
time                 2.267 μs   (2.241 μs .. 2.298 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.264 μs   (2.246 μs .. 2.288 μs)
std dev              68.17 ns   (54.30 ns .. 86.66 ns)
variance introduced by outliers: 39% (moderately inflated)

Float/f runs 38.18/7.118≈5.36 times faster if vectorized. Float/g runs 13.76/1.199≈11.5 times faster if vectorized.

Double/f runs 38.59/8.237≈4.68 times faster if vectorized. Double/g runs 13.91/2.267≈6.14 times faster if vectorized.

With NCG backend, the degree of improvement is more than the number of parallelization (4 or 2). Maybe the scalar code is not optimized well.

Using CPU Features Conditionally

We have some SIMD support in x86 NCG backend, and we can write a library for vectorization. Is that all? No, we have another problem.

SSE2 is always available on x86-64, but other extensions like SSE4.1, AVX2, and AVX-512 are not available on all CPUs (or emulators). Therefore, if a program wants to use AVX2, either it should be built exclusively for CPUs supporting AVX2, or conditionally use AVX2 by looking at CPUID. Currently, GHC supports only the first option, via flags like -mavx2.

This may be fine for programs that only need to run on a specific CPU, but not for libraries that need to run on many CPUs.

Therefore, it is desirable to have a mechanism to enable a CPU feature on a function (or execution path) basis rather than on a per-module basis, and a mechanism to switch the code to be executed by looking at CPUID.

As a prior art, GNU C has an attribute for selecting the target: __attribute__((target("avx2"))). Therefore, we could have something similar like:

{-# TARGET "avx2" #-}
foo :: X8 Int32 -> X8 Int32 -> X8 Int32
foo x y = 2 * x + y

Now, I’m wondering if nullary type classes (or nullary type families) could be used to represent CPU features. I mean, something like

-- Using a nullary type class

class HasAVX2
-- -mavx2 would make the constraint `HasAVX2` available on that module

plusInt32X8# :: HasAVX2 => Int32X8# -> Int32X8# -> Int32X8#

type Dict :: Constraint -> Type
data Dict c = c => Dict

-- modules that don't have -mavx2 enabled can use `detectAVX2` to conditionally enable AVX2
detectAVX2 :: Maybe (Dict HasAVX2)
detectAVX2 = ... -- use CPUID to detect AVX2

-- Using a nullary type family

type family AVX2 :: Bool
-- -mavx2 would make the equation `AVX2 ~ True` available on that module

plusInt32X8# :: AVX2 ~ True => Int32X8# -> Int32X8# -> Int32X8#

data SBool (b :: Bool) where
  STrue :: SBool True
  SFalse :: SBool False

-- modules that don't have -mavx2 enabled can use `detectAVX2` to conditionally enable AVX2
detectAVX2 :: SBool AVX2
detectAVX2 = ... -- use CPUID to detect AVX2

This is just an idea (I’m no expert on GHC’s internals), and I don’t know if it works well when you want cross-platform support (x86, AArch64, RISC-V, …).

Conclusion

Although not complete, there is some progress on GHC’s SIMD support.

I plan to continue improving the SIMD support during my spare time. I’m also working on a library to wrap GHC’s SIMD primitives.

If you are interested, take a look at this ticket: #25030: Meta-issue for SIMD support in the native code generator · Issues · Glasgow Haskell Compiler / GHC · GitLab ↩︎