Skip to content

Slim

Pure-Kotlin ARM64 NEON runtime for Android. Write SIMD instructions inline in Kotlin and have them executed by ART as if they were JIT-compiled Kotlin — no JNI per call, no NDK build, no separate .so.

Get started GitHub JitPack


Start here

The Guide walks you through writing your first NEON kernel — brightening a 16 MB float buffer ~7× faster than JIT-compiled Kotlin in 9 lines of inline DSL. Line by line, with the disassembled output, a runnable benchmark, and a "what just happened?" tour of the ART entry-point hijack. No prior assembly experience required.


What it looks like

val pixels = Floats(myFloatArray)

slim(pixels) {
    loadImm32(W4, java.lang.Float.floatToRawIntBits(0.5f))
    dup(V0, X4, S4)              // v0 = 0.5 × 4 (broadcast)
    loadImm32(W3, pixels.size)
    mov(X1, X0)

    val loop = bindLabel()
    ld1(V2, X1, S4)              // v2 = pixels[i..i+3]
    fmul(V2, V2, V0, S4)         // v2 *= 0.5
    st1(V2, X1, S4)
    add(X1, X1, 16)
    sub(W3, W3, 4)
    cbnz(W3, loop)
}

That's the whole API. Two functions — Slim.initialize(context) once at startup, then slim(data) { ... } anywhere. Inside the block, raw ARM64 NEON: registers, instructions, vector arrangements, condition codes. The runtime handles JIT memory, ART internals, and dispatch.


How fast?

Slim sits between scalar Kotlin (the floor — JIT-compiled, no SIMD) and hand-tuned native C++ with NEON intrinsics (the ceiling — what you'd otherwise ship in a .so). Two numbers worth knowing: how close to the native ceiling, and how much over the scalar floor.

vs hand-tuned native NEON — multi-device

How close does Slim's runtime-emitted code get to hand-written native NEON compiled with clang -O3? The :bench module answers fair-and-square: same fused 8-stage NEON pipeline both sides, same direct ByteBuffer, same thread, byte-identical output enforced by a correctness gate — only the dispatch mechanism differs.

Ran on 7 real devices via a cloud test farm:

Pixel 10 Pro XL — Android 17
Pixel 10 Pro XL · Android 17
Galaxy Note20 — Android 13
Galaxy Note20 · Android 13
Oppo A94 5G — Android 11
Oppo A94 5G · Android 11
Device Android 1080p (2 MB) 4K (8 MB)
Pixel 10 Pro XL 17 11% slower 6% slower
Galaxy A54 5G 16 13% slower 7% slower
Oppo Reno13 F 15 12% slower TIE
Galaxy A23 5G 14 9% slower TIE
Galaxy Note20 13 TIE TIE
Galaxy S20 FE 2022 12 13% slower TIE
Oppo A94 5G 11 7% slower TIE

TIE = Slim within 5% of JNI on that cell. Across 7 devices, three vendors (Google, Samsung, Oppo), and six Android versions (11 → 17), Slim never loses by more than 13% at 1080p, and matches JNI on 5 of 7 devices at 4K. Full screenshots and the bench module live in the bench/ directory.

vs scalar Kotlin

The reason a SIMD runtime exists at all. On a Samsung S24 (Cortex-X4, Android 16), a 16 MB SAXPY kernel:

Path Time Throughput Speedup
Hot-path Kotlin scalar (JIT-compiled) 5.32 ms 3.0 GB/s 1.0×
Slim with FloatArray (eager copy) 2.22 ms 7.2 GB/s 2.4×
Slim with Floats (zero-copy) 0.76 ms 23.4 GB/s 6.95×

Operational characteristics

Per-call dispatch overhead: ~3 µs. Concurrent dispatch: ~3 K calls/sec across 4 coroutines. Cold start: ~3 ms warm / ~10 ms uncached. Probe pool serves up to 8 in-flight kernels before blocking.


Install

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        mavenCentral()
        maven(url = "https://jitpack.io")
    }
}

// app/build.gradle.kts
dependencies {
    implementation("com.github.iamjosephmj:Slim:0.1.2")
}

0.1.2 — V1 internal release

Public API surface (Slim / slim {}) is stable and won't change incompatibly. The underlying engine is still validating against new Android releases. See Production readiness for the kill-switch pattern + anti-tamper compatibility checklist before shipping.


Where to next

  • Guide


    Learn ARM64 NEON via Slim. Start from for loops and end with real kernels. No prior assembly experience required.

  • Cookbook


    Recipes for common kernels: SAXPY, dot product, color filters, blur, threshold, and debugging your own kernels.

  • Architecture


    How the runtime works: memfd dual-map, ART entry-point hijack, four-tier hidden-API bypass, encoder, label assembler.

  • Contributing


    Adding encoder helpers, the testing pattern, ART-internals work, and per-vendor bypass tweaks.