Exploring llama.cpp Support on OrangePi AIPro - Part 0: Enabling the CANN Backend

Since I have an OrangePi AIPro (20T version) on hand, I thought I’d try running llama.cpp on it. I encountered some errors during the process (though compilation succeeded). After checking the issues, I found someone had already raised [CANN] Feature Request: Support OrangeAIPRO 310b CANN, which mentioned that llama.cpp’s CANN backend currently doesn’t support Ascend 310B. So I decided to tinker with it and familiarize myself with AscendC operator development along the way.

Based on llama.cpp commit e86f3c22211d9b5c3842e2961a022aac9cdbacad. The CANN version used is the community edition 8.5.0.alpha002.

Q1: Misaligned Memory Allocation Start Address

The BackTrace when the error occurred: BackTrace of llama.cpp ggml-cann backend

The error appeared in the ggml_backend_cpu_buffer_from_ptr function in ggml/src/ggml-backend.cpp (at line 2265, the last function), which is:

ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(void * ptr, size_t size) {
    GGML_ASSERT((uintptr_t)ptr % TENSOR_ALIGNMENT == 0 && "buffer pointer must be aligned");
    return ggml_backend_buffer_init(ggml_backend_cpu_buffer_from_ptr_type(), ggml_backend_cpu_buffer_from_ptr_i, ptr, size);
}

This was caused by the allocated Host memory not being aligned. TENSOR_ALIGNMENT is defined in ggml/src/ggml-impl.h as 32.

Initial investigation revealed that the memory allocated by CANN itself was not aligned. In other words, the result from aclrtMallocHost (appearing in the ggml_cann_host_malloc function) was not aligned. This behavior is documented in aclrtMallocHost:

In Ascend RC mode, this interface allocates Device memory. The memory on the Device is allocated by ordinary pages. If 64-byte alignment of the start address is required, users need to handle the alignment themselves.

The OrangePi AIPro’s NPU happens to be in Ascend RC mode, so manual start address alignment is necessary.

Solution

/**
 * @brief Allocates a new CANN host buffer of the specified size.
 *
 * This function allocates a new CANN host buffer with the given size.
 * @param size Size in bytes of the host buffer to allocate.
 * @return Pointer to the allocated host buffer, or nullptr if allocation fails.
 */
static void * ggml_cann_host_malloc(size_t size) {
    if (getenv("GGML_CANN_NO_PINNED") != nullptr) {
        return nullptr;
    }

    const size_t alignment = 128;
    size                   = GGML_PAD(size, alignment);
    if (size == 0) {
        size = alignment;
    }

    void *   hostPtr = nullptr;
    aclError err     = aclrtMallocHost((void **) &hostPtr, size);
    if (err != ACL_SUCCESS) {
        GGML_LOG_WARN("%s: failed to allocate %.2f MiB of pinned memory: %s\n", __func__, size / 1024.0 / 1024.0,
                      aclGetRecentErrMsg());
        return nullptr;
    }
    return hostPtr;
}

The above is the original implementation. It only pads size, not the address, and doesn’t account for the Ascend RC scenario.

Therefore, we need to allocate alignment-1 additional bytes to ensure we can eventually find an aligned start address:

#ifdef ASCEND_310B
    size = GGML_PAD(size, alignment) + alignment - 1;
#else
    size = GGML_PAD(size, alignment);
#endif

Then pad the address after obtaining hostPtr:

#ifdef ASCEND_310B
    uintptr_t hostPtrAddr = (uintptr_t)hostPtr;
    uintptr_t hostPtrPadAddr = GGML_PAD(hostPtrAddr, alignment);
    hostPtr = (void *)hostPtrPadAddr;

    GGML_ASSERT((uintptr_t)hostPtr % alignment == 0 && "buffer pointer must be aligned");
#endif

This allows passing the validation in ggml_backend_cpu_buffer_from_ptr later.

Q2: Cannot Free the Aligned Pointer Directly

Q1 indeed solved the start address alignment issue, but freeing the aligned address directly would cause an error in the ggml_backend_cann_host_buffer_free function:

/**
 * @brief Free resources associated with a CANN host buffer.
 *
 * This function frees the resources associated with a CANN host buffer, including
 * its context.
 *
 * @param buffer The CANN host buffer to free.
 */
static void ggml_backend_cann_host_buffer_free(ggml_backend_buffer_t buffer) {
    ACL_CHECK(aclrtFreeHost(buffer->context));
}

So we need to find a way to store the original, unaligned pointer.

Solution

After checking llama.cpp’s buffer-related code, there doesn’t seem to be much room for customization (it’s unclear whether ggml_backend_buffer_t can hold custom data; it seems unlikely).

Therefore, a somewhat hacky approach is used (I’m not sure if there’s a better implementation). Allocate extra space for hostPtr, and ensure this extra space is also aligned:

#ifdef ASCEND_310B
    size = GGML_PAD(size, alignment) + alignment*2 - 1;
#else
    size = GGML_PAD(size, alignment);
#endif

The final aligned pointer hostFinalPadAddr is obtained by padding hostPtr and then offsetting by the minimum alignment space:

uintptr_t hostPtrAddr = (uintptr_t)hostPtr;
uintptr_t hostPtrPadAddr = GGML_PAD(hostPtrAddr, alignment);
uintptr_t hostFinalPadAddr = hostPtrPadAddr + alignment;

The original pointer is stored at the position one element before hostFinalPadAddr (offset by uintptr_t size):

hostPtr = (void *)hostFinalPadAddr;
void **slot = (void **)hostFinalPadAddr;
slot[-1] = (void *)hostPtrAddr;

Thus, the pointer to be freed is retrieved from the element before the aligned address:

#ifdef ASCEND_310B
    void * rawPtr = ((void **)buffer->context)[-1];
    ACL_CHECK(aclrtFreeHost(rawPtr));
#else
    ACL_CHECK(aclrtFreeHost(buffer->context));
#endif

Q3: Ascend310B1 Does Not Support `aclrtMemGetAllocationGranularity`

aclrtMemGetAllocationGranularity appears in ggml_cann_init. The current implementation doesn’t use the ACL_CHECK macro to expose errors for this, causing it to only be discovered after other errors occur.

The error code on OrangePi AIPro is 207000, which is ACL_ERROR_RT_FEATURE_NOT_SUPPORT.

This behavior is also documented in aclrtMemGetAllocationGranularity:

On Atlas 200I/500 A2 inference products, Ascend RC mode does not support calling this interface.

Solution

Currently unclear what the backend uses this information for, so it’s temporarily removed (or simply ignored).

Q4: `AclOpKernelInit` Failure

Finally reaching the actual inference stage, after the previous dry-run (~~seemingly~~) phase.

BackTrace: CANN opp not installed

Solution

~~Actually, I just hadn’t installed the operator package Ascend-cann-kernels-310b_8.5.0.alpha002. It worked after installation~~.

Q5: Ascend310B1 Does Not Support Most Operators in the Operator Library

BackTrace:

The CANN Community Edition/Operator Library Interface/Introduction explicitly states that Atlas 200I/500 A2 inference products do not support fusion operator interfaces and only partially support NN operator interfaces. However, they fully support all operators in ONNX, TensorFlow, and Caffe (since the chip used in OrangePi AIPro should be consistent with Atlas 200I DK A2, both being Ascend310B1, so it should also not support fusion operator interfaces).

Currently known (and important) unsupported interfaces include:

Unquantized models (fp32, fp16): aclnnMm, aclnnBatchMatMul, aclnnMatmul;
Quantized models (less than int8): aclnnWeightQuantBatchMatmulV2. (Well, looking at NN operator interfaces, there seem to be very few supported ones. At least matmul class NN operators are completely unsupported; the only supported one I found is aclnnConvolution)

Solution

To support this going forward, would we need to implement these operators? Considering most operators are unsupported on Ascend310B, this is clearly unrealistic. NN operator and fusion operator code is in ops-nn, but it depends on opbase. I tried compiling it on OrangePi AIPro without success, so I gave up.

However, llama.cpp actually has a fallback mechanism (learned from the ggml-blas backend, which essentially only implements GGML_OP_MUL_MAT and GGML_OP_OUT_PROD operators). (Currently unclear at which stage) llama.cpp checks the backend’s operator (or should I say Op?) support. If not supported, it can (also unclear where exactly the fallback happens) fall back to the ggml-cpu backend.

The function that checks operator support is ggml_backend_cann_supports_op, while the actual inference uses ggml_cann_compute_forward.

So we can start by not supporting any operators, letting it run on the CPU first, then gradually add operators based on performance hotspots. However, we must at least add GGML_OP_NONE, GGML_OP_RESHAPE, and other default operators that don’t require backend support.

That is:

static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
    switch (op->op) {
#ifdef ASCEND_310B
        case GGML_OP_NONE:
        case GGML_OP_RESHAPE:
        case GGML_OP_VIEW:
        case GGML_OP_PERMUTE:
        case GGML_OP_TRANSPOSE:
            return true;
        default:
            return false;
#else
// ... Other Ascend Device
#endif
    }

    GGML_UNUSED(dev);
}

Q6: Tensors Allocated via `ggml-backend` Memory Pool Cannot Fallback to `ggml-cpu`

BackTrace:

My guess is that it detected the allocated memory was on the CANN Device, deemed it unable to execute on the CPU backend, and thus errored out (though I haven’t investigated the detailed cause).

Solution

On other Ascend devices, there might indeed be no good solution besides improving operator support. However, on OrangePi AIPro, this is relatively easy to solve.

Because the Ascend NPU in OrangePi AIPro doesn’t actually have its own independent Global Memory; it’s somewhat like Apple M-Series chips with unified memory. So Device-side memory allocated via CANN is actually on the Host side (not sure if other boards using Ascend310B have this architecture; also feels like this might be related to Ascend RC scenario, but not certain).

Conveniently, the backend’s ggml_backend_buffer_type_i has an is_host field to determine if the Buffer is on the Host side. The original implementation was:

static bool ggml_backend_cann_buffer_type_is_host(ggml_backend_buffer_type_t buft) {
    return false;

    GGML_UNUSED(buft);
}

We simply change it to return true for Ascend 310B:

static bool ggml_backend_cann_buffer_type_is_host(ggml_backend_buffer_type_t buft) {
#ifdef ASCEND_310B
    return true;
#endif
    return false;

    GGML_UNUSED(buft);
}

With these changes, llama.cpp now runs on Ascend 310B, though all operations fall back to the CPU for execution ( .

Exploring llama.cpp Support on OrangePi AIPro - Part 0: Enabling the CANN Backend

Q1: Misaligned Memory Allocation Start Address

Solution

Q2: Cannot Free the Aligned Pointer Directly

Solution

Q3: Ascend310B1 Does Not Support aclrtMemGetAllocationGranularity

Solution

Q4: AclOpKernelInit Failure

Solution

Q5: Ascend310B1 Does Not Support Most Operators in the Operator Library

Solution

Q6: Tensors Allocated via ggml-backend Memory Pool Cannot Fallback to ggml-cpu

Solution

Q3: Ascend310B1 Does Not Support `aclrtMemGetAllocationGranularity`

Q4: `AclOpKernelInit` Failure

Q6: Tensors Allocated via `ggml-backend` Memory Pool Cannot Fallback to `ggml-cpu`