Potential accounting bug: AI Gateway /logs.cost charges more for cache hits than cache misses on openai/gpt-5.4-mini (plus related Gemini and Workers AI gaps)

We've been evaluating the new REST API at `api.cloudflare.com/.../ai/v1/*` as a replacement for the deprecated `/compat/chat/completions` path. So far it doesn't feel production-ready for our use case (an MSSP routing all upstream model calls through CF AI Gateway), and the findings below are the reason. Filing them together because they came out of the same cache-cost investigation. Happy to split into separate issues, just say the word and I'll file individually with cross-links.

---

## A. `openai/gpt-5.4-mini`: hot calls cost more than cold calls in `/logs.cost`

Running the same long-prefix prompt repeatedly through `/ai/v1/chat/completions` to exercise OpenAI's automatic prompt cache. Each call uses a unique user-message tail (so CF's gateway response cache can't short-circuit) but an identical long system prefix (so OpenAI's prompt cache hits on the prefix).

Six calls, all HTTP 200, gateway `authentication: true`, Unified Billing:

| Call | `usage.prompt_tokens` | `usage.prompt_tokens_details.cached_tokens` | `usage.completion_tokens` | `/logs.cost` |
|---|---|---|---|---|
| 1 (cold) | 2422 | **0** | 4 | **$0.0018345** |
| 2 (hot) | 2422 | **2304** | 4 | **$0.0020073** |
| 3 (hot) | 2422 | **2304** | 5 | **$0.0020118** |
| 4 (hot) | 2422 | **2304** | 4 | **$0.0020073** |
| 5 (hot) | 2422 | **2304** | 4 | **$0.0020073** |
| 6 (hot) | 2422 | **2304** | 5 | **$0.0020118** |

Two things stand out:

1. **Hot calls cost more than the cold call**, despite OpenAI reporting 2304 of the 2422 input tokens as cache hits.
2. **The hot-vs-cold cost delta is consistently $0.0001728**, which is exactly `2304 × $0.075/M`. So `/logs.cost` behaves as if cached tokens carry a per-token surcharge of $0.075/M on top of the regular input rate, instead of a discount.

Worst case: customers are being billed more for cache hits than cache misses.

**Reproducer:**

```bash
ACCT=<account>
BASE="https://api.cloudflare.com/client/v4/accounts/${ACCT}/ai/v1"
H_AUTH="Authorization: Bearer ${CF_API_TOKEN}"
H_GW="cf-aig-gateway-id: ${CF_GATEWAY_ID}"

LONG=$(yes "Static instruction line for context. " | head -400 | tr -d '\n')
NONCE="run-$(date +%s)"

for i in 1 2 3 4 5 6; do
  jq -nc --arg s "$LONG" --arg u "reply OK ${NONCE} call ${i}" \
    '{model:"openai/gpt-5.4-mini",messages:[{role:"system",content:$s},{role:"user",content:$u}],max_completion_tokens:5}' \
    | curl -s -X POST "${BASE}/chat/completions" -H "$H_AUTH" -H "$H_GW" -H "content-type: application/json" -d @- \
    | jq -c "{call:$i, prompt:.usage.prompt_tokens, cached:.usage.prompt_tokens_details.cached_tokens, completion:.usage.completion_tokens}"
done

sleep 15
curl -s "https://api.cloudflare.com/client/v4/accounts/${ACCT}/ai-gateway/gateways/${CF_GATEWAY_ID}/logs?per_page=10" -H "$H_AUTH" \
  | jq -r '.result[] | select((.model // "") | tostring | contains("gpt-5.4-mini")) | "in=\(.tokens_in) out=\(.tokens_out) cached=\(.cached) cost=\(.cost)"'
```

---

## B. `google/gemini-*`: cache and reasoning token fields stripped by the OpenAI translation

Same harness on `google/gemini-3.1-flash-lite` through `/ai/v1/chat/completions`. Response `usage`:

```json
{
  "prompt_tokens": 2418,
  "completion_tokens": 16,
  "total_tokens": 2434,
  "extra_properties": {"google": {"traffic_type": "ON_DEMAND"}}
}
```

No `prompt_tokens_details`. No `completion_tokens_details`. No cache token field. No reasoning token field. Six identical-prefix calls all logged at the same $0.0006285, with no way to tell from the response whether Gemini's prompt cache hit.

But Gemini natively reports both. Calling the **same gateway** through CF's native `/google-ai-studio/v1beta` provider passthrough (`gateway.ai.cloudflare.com/v1/{ACCT}/{GW}/google-ai-studio/v1beta/models/gemini-2.5-flash:generateContent`) returns `usageMetadata`:

```json
{
  "promptTokenCount": 2421,
  "candidatesTokenCount": 15,
  "totalTokenCount": 2533,
  "cachedContentTokenCount": 2029,
  "cacheTokensDetails": [{"modality": "TEXT", "tokenCount": 2029}],
  "thoughtsTokenCount": 97,
  "serviceTier": "standard"
}
```

So the upstream IS reporting `cachedContentTokenCount` and `thoughtsTokenCount`. CF's REST translation layer is dropping both. Expected mapping into the OpenAI shape:

- `usageMetadata.cachedContentTokenCount` → `usage.prompt_tokens_details.cached_tokens`
- `usageMetadata.thoughtsTokenCount` → `usage.completion_tokens_details.reasoning_tokens`

Without these, clients have no visibility into Gemini cache or reasoning state on the new REST endpoint, even though the provider-native path through CF already exposes them.

---

## C. `@cf/moonshotai/kimi-k2.6`: response shows cache hits but call never appears in `/logs`

Workers AI model called through `/ai/v1/chat/completions` with `cf-aig-gateway-id`. Six calls, all HTTP 200, all expose OpenAI-shape `prompt_tokens_details.cached_tokens`:

```json
// Call 1 (cold)
{"prompt_tokens":2424,"completion_tokens":5,"total_tokens":2429,"prompt_tokens_details":{"cached_tokens":0}}
// Calls 2-6 (hot)
{"prompt_tokens":2423,"completion_tokens":5,"total_tokens":2428,"prompt_tokens_details":{"cached_tokens":2368}}
```

But `GET /ai-gateway/gateways/{GW}/logs` records zero entries with `model` matching `kimi` from this run, while `google/gemini-3.1-flash-lite` and `openai/gpt-5.4-mini` calls in the same window all show up.

There is no way to retrieve per-request cost or token accounting for Workers AI calls routed through the gateway. The Workers AI catalogue at `/ai/models/search` does publish a rate for `@cf/moonshotai/kimi-k2.6` ($0.95/M input, $4/M output, $0.16/M cached), so reconciliation against the catalogue is possible in principle, but only if the calls land in `/logs`.

---

## Asks

1. **Is finding A a `/logs` display bug, or are customers actually being billed this on Unified Billing?** If the invoice deducts the displayed `cost` verbatim, this is a real overcharge on every cache-hit request.
2. **Document what `/logs.cost` means.** Actual deduction or estimate? Pass-through or marked up? Cache discount honoured?
3. **Surface realistic per-request cost in the response body** the way OpenRouter does, so clients can attribute spend in real time without polling `/logs` and can reconcile against rate cards themselves. Target shape:

   ```json
   {
     "usage": {
       "...": "existing fields",
       "cost": 0.000326,
       "cost_details": {
         "upstream_inference_cost": 0.000326,
         "upstream_inference_prompt_cost": 0.000307,
         "upstream_inference_completions_cost": 0.000019
       },
       "prompt_tokens_details": {
         "cached_tokens": 4096,
         "cache_write_tokens": 0
       }
     }
   }
   ```

4. **Fix finding B** so Gemini's `cachedContentTokenCount` and `thoughtsTokenCount` land as `usage.prompt_tokens_details.cached_tokens` and `usage.completion_tokens_details.reasoning_tokens` in the translated response.
5. **Fix finding C** so Workers AI calls routed through `/ai/v1/chat/completions` with `cf-aig-gateway-id` appear in the gateway log stream.

## Splits

If splitting helps triage, the natural break is:
- A → billing/accounting
- B → provider-translation parity (sits alongside #491, #470)
- C → gateway logging coverage

Happy to file separately if preferred.

## Related

- #491 (Bedrock Converse usage in camelCase not parsed) and #470 (streaming Azure cost not tracked). Broader pattern of per-provider cost-tracking gaps.
- #549 (no `/ai/v1/models` endpoint). No CF-published catalogue rates for third-party models means clients can't even reconcile against an authoritative price locally.


Call	`usage.prompt_tokens`	`usage.prompt_tokens_details.cached_tokens`	`usage.completion_tokens`	`/logs.cost`
1 (cold)	2422	0	4	$0.0018345
2 (hot)	2422	2304	4	$0.0020073
3 (hot)	2422	2304	5	$0.0020118
4 (hot)	2422	2304	4	$0.0020073
5 (hot)	2422	2304	4	$0.0020073
6 (hot)	2422	2304	5	$0.0020118

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential accounting bug: AI Gateway /logs.cost charges more for cache hits than cache misses on openai/gpt-5.4-mini (plus related Gemini and Workers AI gaps) #550

A. `openai/gpt-5.4-mini`: hot calls cost more than cold calls in `/logs.cost`

B. `google/gemini-*`: cache and reasoning token fields stripped by the OpenAI translation

C. `@cf/moonshotai/kimi-k2.6`: response shows cache hits but call never appears in `/logs`

Asks

Splits

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Potential accounting bug: AI Gateway /logs.cost charges more for cache hits than cache misses on openai/gpt-5.4-mini (plus related Gemini and Workers AI gaps) #550

Description

A. openai/gpt-5.4-mini: hot calls cost more than cold calls in /logs.cost

B. google/gemini-*: cache and reasoning token fields stripped by the OpenAI translation

C. @cf/moonshotai/kimi-k2.6: response shows cache hits but call never appears in /logs

Asks

Splits

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

A. `openai/gpt-5.4-mini`: hot calls cost more than cold calls in `/logs.cost`

B. `google/gemini-*`: cache and reasoning token fields stripped by the OpenAI translation

C. `@cf/moonshotai/kimi-k2.6`: response shows cache hits but call never appears in `/logs`