Skip to content

Potential accounting bug: AI Gateway /logs.cost charges more for cache hits than cache misses on openai/gpt-5.4-mini (plus related Gemini and Workers AI gaps) #550

Description

@pemontto

We've been evaluating the new REST API at api.cloudflare.com/.../ai/v1/* as a replacement for the deprecated /compat/chat/completions path. So far it doesn't feel production-ready for our use case (an MSSP routing all upstream model calls through CF AI Gateway), and the findings below are the reason. Filing them together because they came out of the same cache-cost investigation. Happy to split into separate issues, just say the word and I'll file individually with cross-links.


A. openai/gpt-5.4-mini: hot calls cost more than cold calls in /logs.cost

Running the same long-prefix prompt repeatedly through /ai/v1/chat/completions to exercise OpenAI's automatic prompt cache. Each call uses a unique user-message tail (so CF's gateway response cache can't short-circuit) but an identical long system prefix (so OpenAI's prompt cache hits on the prefix).

Six calls, all HTTP 200, gateway authentication: true, Unified Billing:

Call usage.prompt_tokens usage.prompt_tokens_details.cached_tokens usage.completion_tokens /logs.cost
1 (cold) 2422 0 4 $0.0018345
2 (hot) 2422 2304 4 $0.0020073
3 (hot) 2422 2304 5 $0.0020118
4 (hot) 2422 2304 4 $0.0020073
5 (hot) 2422 2304 4 $0.0020073
6 (hot) 2422 2304 5 $0.0020118

Two things stand out:

  1. Hot calls cost more than the cold call, despite OpenAI reporting 2304 of the 2422 input tokens as cache hits.
  2. The hot-vs-cold cost delta is consistently $0.0001728, which is exactly 2304 × $0.075/M. So /logs.cost behaves as if cached tokens carry a per-token surcharge of $0.075/M on top of the regular input rate, instead of a discount.

Worst case: customers are being billed more for cache hits than cache misses.

Reproducer:

ACCT=<account>
BASE="https://api.cloudflare.com/client/v4/accounts/${ACCT}/ai/v1"
H_AUTH="Authorization: Bearer ${CF_API_TOKEN}"
H_GW="cf-aig-gateway-id: ${CF_GATEWAY_ID}"

LONG=$(yes "Static instruction line for context. " | head -400 | tr -d '\n')
NONCE="run-$(date +%s)"

for i in 1 2 3 4 5 6; do
  jq -nc --arg s "$LONG" --arg u "reply OK ${NONCE} call ${i}" \
    '{model:"openai/gpt-5.4-mini",messages:[{role:"system",content:$s},{role:"user",content:$u}],max_completion_tokens:5}' \
    | curl -s -X POST "${BASE}/chat/completions" -H "$H_AUTH" -H "$H_GW" -H "content-type: application/json" -d @- \
    | jq -c "{call:$i, prompt:.usage.prompt_tokens, cached:.usage.prompt_tokens_details.cached_tokens, completion:.usage.completion_tokens}"
done

sleep 15
curl -s "https://api.cloudflare.com/client/v4/accounts/${ACCT}/ai-gateway/gateways/${CF_GATEWAY_ID}/logs?per_page=10" -H "$H_AUTH" \
  | jq -r '.result[] | select((.model // "") | tostring | contains("gpt-5.4-mini")) | "in=\(.tokens_in) out=\(.tokens_out) cached=\(.cached) cost=\(.cost)"'

B. google/gemini-*: cache and reasoning token fields stripped by the OpenAI translation

Same harness on google/gemini-3.1-flash-lite through /ai/v1/chat/completions. Response usage:

{
  "prompt_tokens": 2418,
  "completion_tokens": 16,
  "total_tokens": 2434,
  "extra_properties": {"google": {"traffic_type": "ON_DEMAND"}}
}

No prompt_tokens_details. No completion_tokens_details. No cache token field. No reasoning token field. Six identical-prefix calls all logged at the same $0.0006285, with no way to tell from the response whether Gemini's prompt cache hit.

But Gemini natively reports both. Calling the same gateway through CF's native /google-ai-studio/v1beta provider passthrough (gateway.ai.cloudflare.com/v1/{ACCT}/{GW}/google-ai-studio/v1beta/models/gemini-2.5-flash:generateContent) returns usageMetadata:

{
  "promptTokenCount": 2421,
  "candidatesTokenCount": 15,
  "totalTokenCount": 2533,
  "cachedContentTokenCount": 2029,
  "cacheTokensDetails": [{"modality": "TEXT", "tokenCount": 2029}],
  "thoughtsTokenCount": 97,
  "serviceTier": "standard"
}

So the upstream IS reporting cachedContentTokenCount and thoughtsTokenCount. CF's REST translation layer is dropping both. Expected mapping into the OpenAI shape:

  • usageMetadata.cachedContentTokenCountusage.prompt_tokens_details.cached_tokens
  • usageMetadata.thoughtsTokenCountusage.completion_tokens_details.reasoning_tokens

Without these, clients have no visibility into Gemini cache or reasoning state on the new REST endpoint, even though the provider-native path through CF already exposes them.


C. @cf/moonshotai/kimi-k2.6: response shows cache hits but call never appears in /logs

Workers AI model called through /ai/v1/chat/completions with cf-aig-gateway-id. Six calls, all HTTP 200, all expose OpenAI-shape prompt_tokens_details.cached_tokens:

// Call 1 (cold)
{"prompt_tokens":2424,"completion_tokens":5,"total_tokens":2429,"prompt_tokens_details":{"cached_tokens":0}}
// Calls 2-6 (hot)
{"prompt_tokens":2423,"completion_tokens":5,"total_tokens":2428,"prompt_tokens_details":{"cached_tokens":2368}}

But GET /ai-gateway/gateways/{GW}/logs records zero entries with model matching kimi from this run, while google/gemini-3.1-flash-lite and openai/gpt-5.4-mini calls in the same window all show up.

There is no way to retrieve per-request cost or token accounting for Workers AI calls routed through the gateway. The Workers AI catalogue at /ai/models/search does publish a rate for @cf/moonshotai/kimi-k2.6 ($0.95/M input, $4/M output, $0.16/M cached), so reconciliation against the catalogue is possible in principle, but only if the calls land in /logs.


Asks

  1. Is finding A a /logs display bug, or are customers actually being billed this on Unified Billing? If the invoice deducts the displayed cost verbatim, this is a real overcharge on every cache-hit request.

  2. Document what /logs.cost means. Actual deduction or estimate? Pass-through or marked up? Cache discount honoured?

  3. Surface realistic per-request cost in the response body the way OpenRouter does, so clients can attribute spend in real time without polling /logs and can reconcile against rate cards themselves. Target shape:

    {
      "usage": {
        "...": "existing fields",
        "cost": 0.000326,
        "cost_details": {
          "upstream_inference_cost": 0.000326,
          "upstream_inference_prompt_cost": 0.000307,
          "upstream_inference_completions_cost": 0.000019
        },
        "prompt_tokens_details": {
          "cached_tokens": 4096,
          "cache_write_tokens": 0
        }
      }
    }
  4. Fix finding B so Gemini's cachedContentTokenCount and thoughtsTokenCount land as usage.prompt_tokens_details.cached_tokens and usage.completion_tokens_details.reasoning_tokens in the translated response.

  5. Fix finding C so Workers AI calls routed through /ai/v1/chat/completions with cf-aig-gateway-id appear in the gateway log stream.

Splits

If splitting helps triage, the natural break is:

Happy to file separately if preferred.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions