DirQ ("Direct Query") is an agent-based platform for querying and managing large Windows/Linux fleets. Agents form a peer-to-peer relay mesh and report data back to a central server. The server acts as an Ansible Automation Platform (AAP) inventory source, exposes collected data as structured facts, and can route Ansible execution through the mesh as an alternative to SSH/WinRM connectivity.
The key idea is simple:
- Query the fleet like a dataset instead of logging into hosts one by one
- Keep managed hosts outbound-only instead of opening SSH/WinRM inbound
- Reuse Ansible while replacing the transport underneath
- Build Ansible inventories from live DirQ query results instead of static host lists
- Scale with a relay tree so the server does not need a direct session to every node
- Scan for CVEs in real time — identify every affected host in seconds, not hours
- Run ad-hoc commands across the fleet — parallel exec with streaming results
One of the most practical workflows in DirQ is:
- Query the fleet for exactly the hosts you care about
- Turn those results into an Ansible inventory
- Run a playbook only against that live, data-driven target set
Examples:
- Find only hosts with disks over 90%, turn that into an inventory, then run a cleanup or expansion playbook.
- Query for hosts with vulnerable OpenSSL package versions, build an inventory from the result, and patch only those systems.
- A new CVE drops — run
dirq cve CVE-2024-6345and instantly see which hosts are vulnerable and which are already patched, across the entire fleet. - Query for hosts where
sshdor another critical service is stopped, generate an inventory, and run a remediation playbook immediately. - Quick ad-hoc check:
dirq exec WHERE tag.env = 'prod' -- uptimeto see every prod host's uptime without setting up a playbook.
DirQ is useful when traditional fleet access patterns start breaking down:
- Large locked-down environments — managed hosts cannot accept inbound SSH or WinRM.
- Segmented enterprise networks — a single control plane across data centers, edge sites, or heavily firewalled zones.
- Query-driven Ansible targeting — inventories based on live fleet state, not stale static groups.
- Ansible without transport pain — keep your playbooks, drop the SSH/WinRM dependency.
- Real-time CVE response — a vulnerability drops and you need to know which hosts are affected now, not after the next scheduled scan.
- Real-time fleet troubleshooting — answer "which prod hosts have disks over 90%?" and act on it immediately.
- Very large estates — server connection count stays bounded while the fleet grows.
What makes DirQ different:
- Mesh-first architecture: agents relay for each other, so the fleet becomes its own transport.
- Structured query model: modules return normalized data instead of raw command output.
- Ansible compatibility: DirQ acts as query engine, inventory source, and execution transport — existing playbooks work without modification.
- Inventory and execution in one system: the same platform that knows the fleet can also target it.
- Agent-side policy enforcement (OPA/Rego): each host can locally allow or deny exec/file/deploy operations with a Rego policy — defense in depth even for validly-authorized requests. Express segregation of duties, break-glass, and per-AAP-user authorization for regulated fleets.
- Architecture — how the mesh works, scaling
- Quick Start — run locally in 5 minutes
- Production Deployment — running a real multi-host fleet
- Query DSL — the fleet query language
- Ansible Integration — inventory, groups, facts, query-based targeting
- Execution Transport — run Ansible through the mesh
- Fleet Exec — ad-hoc parallel command, script, and grep execution
- Topology Graph — visualize the agent mesh tree
- Fleet-Scale Emulation — one agent process hosting N virtual hosts
- Debug & Diagnostics —
dirq debugsubcommands for in-flight sessions and mesh reachability - Observability — Prometheus
/metricsand Grafana - Security — TLS, authentication, exec safety, agent-side OPA/Rego policy
- High Availability — multi-pod deployment on OpenShift/Kubernetes
- Multi-Datacenter Deployment — isolated meshes, per-DC routing
- AAP Integration — collection, EE, credentials, setup checklist
- MCP Integration — use dirq as an AI tool via Model Context Protocol
- Configuration Reference — all environment variables
- REST API — endpoint reference
- Building — compile, cross-compile, container images
- Project Structure
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent │ │ Agent │ │ Agent │ │ Agent │
│ (leaf) │ │ (leaf) │ │ (leaf) │ │ (leaf) │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ Agent (relay peer) │ │ Agent (relay peer) │
└───────────┬───────────┘ └───────────┬───────────┘
│ │
▼ ▼
┌──────────────────────────────────────┐
│ Agent (zone leader) │
└──────────────────┬───────────────────┘
│
══════════════╪══════════════
│ (OpenShift Route)
▼
┌──────────────────────────────────────┐
│ DirQ Server (Go) │
│ REST API · gRPC · Query Engine │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────┐
│ SQLite / PostgreSQL│
└──────────────────┘
All links are gRPC over TLS. Agents connect outbound — no inbound ports required on managed hosts. Only a bounded number of zone leaders connect directly to the server.
| Component | Language | Description |
|---|---|---|
dirq-server |
Go | Central server: gRPC, REST API, query engine, Ansible inventory. SQLite by default; PostgreSQL optional. |
dirq-agent |
Go | Endpoint agent: collects data, relays queries, optionally executes commands. Single static binary. |
dirq |
Go | CLI: submit queries, manage hosts/tags/tokens, run ad-hoc commands, generate and rotate certificates. |
atgreen.dirq |
Python | Ansible collection: inventory plugin + connection plugin for AAP. |
The server holds a fixed number of zone leader connections (default 5). All other agents fill a tree below those zone leaders, growing as deep as needed (BFS fill order).
| Fleet size | Tree depth | Server connections |
|---|---|---|
| 250 | 2 | 5 |
| 12,500 | 3 | 5 |
| 625,000 | 4 | 5 |
The server always holds exactly DIRQ_MAX_ZONE_LEADERS connections regardless of fleet size. The tree deepens — it never widens at the server.
The live mesh shape is held in memory by the server (MeshTopology, RWMutex-protected maps for nodes, ZLs, parent/child links, depth cache). Registration, fan-out, and dispatch all read this directly — no DB round-trips on hot paths. agents.role and agents.parent_id are best-effort snapshots persisted every 30 s for operator visibility and rehydrated on restart. The CLI overlays the in-memory view onto DB records before serializing, so dirq hosts list always reflects live truth.
Registration arrivals flow through a burst-aware batcher (default 200 ms window, 200 max batch). On flush, the assigner prefers one zone leader per distinct source IP — so a thundering herd from a single subnet can't fill all ZL slots from one host. There is no proactive rebalancer; reactive recovery (reassignOrphans on stream close, fallback parents + orphan promotion via RequestPeers) handles every churn case the old proactive paths used to.
Broadcast dispatchers (query, exec, deploy) use per-target accounting instead of an idle timeout. A session's loop runs until every target is accounted — either by a real response or a synthetic disconnect failure synthesized from one of four mesh-state signals (zone-leader stream close, PeerDisconnected from a relay, the periodic reaper, or a fanout-buffer-full at dispatch time). All four paths funnel through one first-terminal-wins gate (ClaimAgent) so no agent is counted twice.
The hard timeout is command_timeout + 30 s and is a true safety net — rarely the actual completion driver. Practical consequence: dirq exec --timeout 3600 -- yum upgrade -y doesn't get cut off at 30 s of silence between fast and slow responders. When a dispatcher can't account for everyone, the CLI prints Status: incomplete | Targets: N | Received: M | Missing: K instead of claiming completion.
Query results aggregate in-mesh, not at the server. Each relay buffers results
from its children for 2 seconds, then flushes one AggregatedQueryResult
upstream. Zone leaders do the same. The server receives ~5 messages (one per
zone leader) instead of 100k individual responses.
Each non-zone-leader agent receives 2 fallback parent addresses during registration, chosen from different branches of the tree. On parent failure:
- Try fallback parent 0 (different branch, sub-second)
- Try fallback parent 1 (another branch)
- Ask the server for a new parent assignment via
RequestPeersRPC
Agents never fall back to direct server connections — they always ask the server where to go. The server marks the dead parent offline and assigns a healthy replacement. When a zone leader goes offline, the server immediately reassigns its orphaned children to other healthy nodes.
| Module | Data collected |
|---|---|
cpu |
Physical/logical cores, model name, vendor |
memory |
Total, available, used bytes; percent used; swap |
disk |
Per-partition: device, mount point, fs type, total/used/free bytes, percent used |
os_info |
Hostname, OS, version, arch, uptime, kernel version, distro, distro_version, distro_family |
packages |
Installed packages: name, version, arch, source (rpm/dpkg/registry) |
network |
Interfaces: name, MAC, MTU, flags, IP addresses (loopback filtered) |
services |
Services: name, display name, state, start type (systemd/Windows Services) |
hotfixes |
Windows hotfixes: kb_id, description, installed_on (Get-HotFix) |
This is a single-host development setup — do not use it as-is for a multi-host fleet. The
podman-composeserver publishes its gRPC port through podman's NAT, so the server sees every agent's source IP as a podman-bridge address (10.89.0.x) instead of the agent's real host IP. It then advertises those unroutable addresses to other agents as relay parents, and the mesh fails to connect across hosts (dial tcp 10.89.0.x:50052: i/o timeout, agents stuck re-registering,dirq debug pingtiming out). It works on one machine only because every container shares one bridge. For a real fleet, see Production Deployment.
- Go 1.26+
- Podman and podman-compose
podman-compose up -dThe server auto-generates TLS certs, runs DB migrations, and creates a bootstrap API token. The token is written to a file (not logged) for security:
# The server log shows the token file path:
podman logs dirq_dirq-server_1 2>&1 | grep "bootstrap"
# Read the token:
cat /var/lib/dirq/bootstrap-tokenThe server writes ready-to-copy config files on startup:
/var/lib/dirq/agent.conf— agent config with server address, registration secret, and inline TLS certs (base64-encoded). Copy to/etc/dirq/agent.confon each agent host./var/lib/dirq/client.conf— CLI config with server URL and bootstrap token. Copy to/etc/dirq/client.confor~/.config/dirq/client.confon any workstation.
# On the server, copy the generated agent config to a remote host:
scp /var/lib/dirq/agent.conf agent-host:/etc/dirq/agent.conf
# On the agent host:
sudo systemctl enable --now dirq-agentFor local dev, build and run the agent directly:
go build -o bin/dirq-agent ./cmd/dirq-agent
./bin/dirq-agentThe agent auto-generates TLS certs into the same directory as the server (/var/lib/dirq/tls). When both run on the same machine, they share the auto-generated CA and verify each other automatically.
go build -o bin/dirq ./cmd/dirqThe CLI reads config from ~/.config/dirq/client.conf (user-local) or /etc/dirq/client.conf (system-wide). Copy the server-generated client.conf:
# Copy from server to your workstation:
scp server:/var/lib/dirq/client.conf ~/.config/dirq/client.conf
# Now just use dirq — no env vars needed:
dirq doctor
dirq hosts list
dirq select hostname, cpu.logical_cores, memory.pct_usedOr set env vars directly:
export DIRQ_SERVER_URL=https://dirq-server:8080
export DIRQ_TOKEN=<bootstrap-token>
export DIRQ_TLS_INSECURE=true # for self-signed certscd test-playbook
DIRQ_SERVER_URL=http://localhost:8090 DIRQ_TOKEN=$DIRQ_TOKEN ansible-playbook test.yml -vGOOS=windows GOARCH=amd64 go build -o bin/dirq-agent.exe ./cmd/dirq-agent
# Run in foreground
.\bin\dirq-agent.exe
# Or install as a Windows Service (runs as SYSTEM)
.\bin\dirq-agent.exe install
sc start DirQAgentThe podman quick start is a single-host laptop convenience. A real multi-host fleet has two hard requirements it does not meet — getting either wrong leaves agents stuck re-registering with the mesh unable to route between hosts.
During registration the server records the source IP of the agent's gRPC
connection and advertises it to the rest of the mesh as that agent's relay
address (so other agents know where to attach). If the server runs behind
NAT — most commonly a podman/docker container with published ports
(-p 50051:50051) — it sees a bridge address (10.89.0.x) instead of the
agent's host IP and hands that unroutable address to everyone. The symptom is
dial tcp 10.89.0.x:50052: i/o timeout in agent logs and dirq debug ping
timing out even though dirq hosts list shows the agent "online" (it
registered, but never actually attached to its parent — a ghost-online
node).
Run the server so it sees real client IPs:
- Native (recommended). Install the
dirq-serverpackage and run it as a systemd service on a host with a routable address. This is what the RPM/DEB packaging targets. - Containerized. Give the container host networking
(
network_mode: hostin compose, or--network=host) so it shares the host's network namespace. Do not publish the gRPC port with-p— that is what masks the source IP. With host networking, pointDIRQ_DB_URLat the host (@127.0.0.1:5432, not a compose service name) and bind the HTTP/gRPC listeners directly (DIRQ_HTTP_ADDR,DIRQ_GRPC_ADDR).
Either way, open 50052/tcp host-to-host between agents (so they can reach
their relay parents) and 50051/tcp from agents to the server.
The server's Ed25519 signing key, CA, and bootstrap token live in
/var/lib/dirq. If that directory is ephemeral (a container with no volume),
recreating or rebuilding the server regenerates the signing key, and every
already-registered agent then rejects the server's signed messages until you
re-distribute the new agent.conf. Mount /var/lib/dirq on a persistent
volume, and persist the Postgres data directory too. After any signing-key
change, re-copy the freshly generated agent.conf to the agents.
Install the dirq-agent package (native systemd service) and drop in the
server-generated config — see Deploy agents. Prefer the
packaged unit over a hand-rolled one so config paths, the data directory, and
restart behavior match the docs.
The quick start sets DIRQ_TLS_DISABLED=true and DIRQ_AUTH_DISABLED=true
for convenience. In production both must be enabled — with them off, API
tokens and remote-exec payloads cross the network in cleartext, and any host
that can reach the gRPC port can register or run commands. Distribute the
server's CA to agents (via agent.conf) so TLS verifies, and set a
registration secret. See Security.
A SQL-like language for ad-hoc fleet queries. Queries are parsed on the server, pushed through the relay mesh, filtered agent-side, and aggregated server-side.
SELECT <fields | *>
[WHERE <expression>]
[GROUP BY <field>, ...]
[ORDER BY <field> [ASC|DESC], ...]
[LIMIT <n>]
Every clause except SELECT is optional. Queries always target all online hosts;
use tag.* conditions in WHERE to narrow the target (see below).
Keywords are case-insensitive (select, SELECT, and Select all work).
Fields use dotted notation: module.field. See Built-in Query Modules for available modules.
Each disk partition contains: device, mount_point, fs_type, total_bytes, used_bytes, free_bytes, pct_used.
Each package contains: name, version, arch, source.
Each network interface contains: name, mac, mtu, flags, addresses (array of {addr, family}).
Each service contains: name, display_name, state, start_type.
Conditions support AND, OR, NOT, and parenthesized grouping with proper precedence (AND binds tighter than OR). Simple AND-only filters are pushed to agents; complex expressions (OR, NOT) are evaluated server-side.
WHERE disk.pct_used > 80
WHERE cpu.logical_cores >= 8 AND memory.pct_used > 50
WHERE os_info.os = 'linux' OR os_info.os = 'freebsd'
WHERE (os_info.os = 'linux' OR os_info.os = 'freebsd') AND cpu.logical_cores > 4
WHERE NOT os_info.os = 'windows'
WHERE os_info.kernel_version LIKE '7.0%'
WHERE os_info.kernel_version NOT LIKE '%debug%'
WHERE packages.name IN ('openssl', 'nginx', 'curl')
WHERE packages.name NOT IN ('telnet', 'rsh')
WHERE services.name = 'sshd' AND services.state = 'stopped'
WHERE cpu.model IS NOT NULLOperators: =, !=, >, <, >=, <=, LIKE, NOT LIKE, IN, NOT IN, IS NULL, IS NOT NULL
Agent tags are available as tag.* fields in WHERE conditions. The server evaluates tag conditions before dispatching — only matching agents receive the query.
-- Only prod hosts
WHERE tag.env = 'prod' AND disk.pct_used > 80
-- Multiple environments
WHERE tag.env IN ('prod', 'staging')
-- Group targeting
WHERE tag.group = 'webservers'
-- Complex targeting
WHERE (tag.env = 'prod' OR tag.env = 'staging') AND tag.group = 'webservers'Tag conditions can be freely mixed with data conditions using AND/OR.
When a WHERE condition references a field inside an array module (packages, services, disk, network), the agent filters the array and returns only matching entries:
-- Returns only 3 packages, not all 2000 installed
WHERE packages.name IN ('openssl', 'nginx', 'curl')
-- Returns only partitions over 80% full
WHERE disk.pct_used > 80SELECT os_info.os, COUNT(os_info.hostname), AVG(memory.total_bytes)
GROUP BY os_info.os
ORDER BY disk.pct_used DESC
ORDER BY os_info.os ASC, os_info.hostname DESC
LIMIT 10Aggregation functions: COUNT, AVG, SUM, MIN, MAX
Aggregates work with or without GROUP BY:
-- Fleet-wide total (bare aggregate)
SELECT COUNT(hostname) WHERE os_info.os = 'linux'
-- Per-group breakdown
SELECT os_info.os, COUNT(hostname) GROUP BY os_info.os-- Hosts with full disks in prod (only matching partitions returned)
SELECT os_info.hostname, disk.mount_point, disk.pct_used
WHERE tag.env = 'prod' AND disk.pct_used > 80 ORDER BY disk.pct_used DESC
-- Check specific package versions
SELECT os_info.hostname, packages.name, packages.version
WHERE packages.name IN ('openssl', 'nginx', 'curl')
-- Find hosts where sshd is stopped
SELECT os_info.hostname, services.name, services.state
WHERE services.name = 'sshd' AND services.state = 'stopped'
-- Count hosts by OS
SELECT os_info.os, COUNT(os_info.hostname), AVG(memory.total_bytes)
GROUP BY os_info.os
-- Find beefy hosts
SELECT os_info.hostname, cpu.logical_cores, memory.total_bytes
WHERE cpu.logical_cores >= 16
-- Packages matching a pattern
SELECT os_info.hostname, packages.name, packages.version
WHERE packages.name LIKE 'openssl%'
-- OR and parentheses
SELECT os_info.hostname, os_info.os
WHERE (os_info.os = 'linux' OR os_info.os = 'freebsd') AND cpu.logical_cores > 4
-- Exclude specific packages, limit results
SELECT os_info.hostname, packages.name
WHERE packages.name NOT IN ('telnet', 'rsh') LIMIT 50
-- Everything about all hosts
SELECT *# Natural syntax — no quoting needed for simple queries
dirq select os_info.hostname, cpu.logical_cores
dirq select os_info.hostname, disk.pct_used WHERE disk.pct_used = 80
# Quoted form — avoids shell interpretation of > < etc.
dirq "select os_info.hostname, disk.pct_used where disk.pct_used > 80"
# Flags
dirq select os_info.os, COUNT(os_info.hostname) GROUP BY os_info.os --json
dirq "select * where tag.env = 'prod'" --timeout 30Ask questions in plain English — an LLM uses DirQ's fleet tools to gather data and compose an answer. The LLM can call multiple tools and iterate until it has enough information.
dirq ask "which prod hosts have full disks?"
dirq ask "how many hosts are running linux?"
dirq ask "what versions of openssl are installed?"
dirq ask "are any hosts vulnerable to CVE-2024-6345?"Tool calls are shown as the LLM works:
$ dirq ask "how many linux servers do I have?"
[dirq_query] SELECT COUNT(hostname) WHERE os_info.os = 'linux'
You have 4 Linux servers, all running RHEL 8.10.
The LLM is read-only — it can query and inspect but cannot execute commands or modify hosts. If you ask it to make changes, it will suggest the dirq exec command to run.
Configuration: Uses DIRQ_LLM_URL + DIRQ_LLM_API_KEY + DIRQ_LLM_MODEL, or falls back to ANTHROPIC_API_KEY. Supports both Anthropic's native API and any OpenAI-compatible endpoint.
# Anthropic (direct)
export ANTHROPIC_API_KEY=sk-ant-...
# OpenAI-compatible (any provider)
export DIRQ_LLM_URL=https://api.openai.com/v1
export DIRQ_LLM_API_KEY=sk-...
export DIRQ_LLM_MODEL=gpt-4oUse --model to override the model for a single query:
dirq ask "disk usage in prod" --model claude-sonnet-4-20250514Generate an AI-readable reference for the query language:
dirq skill # print to stdout
dirq skill | pbcopy # copy to clipboard (macOS)Query the fleet and run Ansible against the results in one step:
# Run a playbook against hosts matching a WHERE clause
dirq run cleanup-disks.yml WHERE disk.pct_used = 90
# Quoted form
dirq "run deploy.yml where tag.env = 'prod'"
# Ad-hoc command
dirq run --command "yum update -y openssl" WHERE packages.name = 'openssl'
# Ansible module
dirq run --module ping WHERE os_info.os = 'linux'
# All online hosts (no WHERE clause)
dirq run deploy.ymlDeploy RPM, DEB, or MSI packages across the fleet through the relay mesh. Designed primarily for non-disruptive self-updates of the dirq-agent package itself — the depth-first rolling strategy updates deepest nodes first, working up the tree so a parent is never updated while its children are mid-install. This keeps the relay mesh intact throughout the upgrade.
# Deploy to all agents (rolling wave)
dirq deploy ./patch-2026-05.rpm
# Deploy to specific hosts
dirq deploy ./patch.rpm WHERE tag.env = 'prod'
# Windows packages
dirq deploy ./agent-0.3.0.msi WHERE os_info.os = 'windows'
# Override rolling deployment — install everywhere at once
dirq deploy ./monitoring.rpm --parallelPackage type is detected from the file extension:
.rpm→rpm -U.deb→dpkg -i.msi→msiexec /i ... /qn
Scan RHEL systems for known vulnerabilities. DirQ fetches affected package data from the Red Hat Security Data API, then queries the fleet to find hosts running vulnerable versions.
# Scan all RHEL hosts
dirq cve CVE-2024-6345
# Scan only production
dirq cve CVE-2024-6345 WHERE tag.env = 'prod'
# Machine-readable output
dirq cve CVE-2024-6345 --jsonOutput shows each host's status:
CVE-2024-6345: pypa/setuptools: Remote code execution via download functions...
Severity: Important
web1.prod python-setuptools 39.2.0-7.el8 VULNERABLE (fixed in 39.2.0-8.el8_10)
web2.prod python-setuptools 39.2.0-8.el8_10 patched
db1.prod python-setuptools 39.2.0-7.el8 VULNERABLE (fixed in 39.2.0-8.el8_10)
2 vulnerable, 1 patched
Visualize the agent mesh tree:
dirq hosts graphdirq-server
├── ● dirq-agent-01 [ZL]
│ ├── ● dirq-agent-06
│ └── ● dirq-agent-08
├── ● dirq-agent-02 [ZL]
│ └── ● dirq-agent-07
└── ● dirq-agent-03 [ZL]
└── ● dirq-agent-09
● = online, ○ = offline, [ZL] = zone leader.
Export to Graphviz DOT format for rendering (left-to-right layout fits large fleet trees on screen):
dirq hosts graph --dot | dot -Tpng -o topology.pngCheck the health of your DirQ deployment with dirq doctor:
dirq doctor DIRQ_SERVER_URL ok https://dirq.example.com:8080
API token valid ok authenticated
TLS certificate ok valid
Database ok postgres connected
Agents online ok 1247/1250
Agent version skew !! 3 agents on v0.21.x (server is v0.22.3)
Relay tree ok depth 4, 5 zone leader(s)
Ansible installed ok ansible-playbook [core 2.20.5]
Connection plugin ok /usr/local/ansible/connection_plugins
9 passed, 1 warnings, 0 failed
Quoted arguments that start with SELECT are automatically split into individual
args before parsing. This lets you write queries as a single quoted string:
dirq "select hostname where tag.env = 'prod'" # same as: dirq select hostname where ...Other commands are not flattened. For dirq exec, the remote command goes after -- so flags and special characters pass through without conflict:
dirq exec WHERE tag.env = 'prod' -- ls -l # everything after -- is the remote commandFor testing mesh behavior at fleet scale without provisioning one VM per host, a single dirq-agent process can host N virtual hosts in-process. Each VH presents itself to the server as an independent agent with its own ID, session token, mTLS client cert, upstream gRPC connection, and downstream relay listen port.
DIRQ_VIRTUAL_HOSTS=25 \
DIRQ_HOSTNAME_PREFIX=dirq-test-linux-1 \
DIRQ_REGISTRATION_JITTER_SECONDS=30 \
./bin/dirq-agentSynthesized hostnames are <prefix>-NNNNN. Per-instance mTLS material lives under $DATA_DIR/tls/instances/<hostname>/ so siblings can't clobber each other. The relay listener binds synchronously in Run() before registration, so port collisions surface as a startup error instead of silently failing later.
The AWS test fleet (make aws) exposes this via DIRQ_REPLICAS_PER_VM:
LINUX_COUNT=50 DIRQ_REPLICAS_PER_VM=1000 make aws # 50,000 emulated hosts on 50 VMsThe userdata script auto-widens the SG relay port range to 50052..50051+N, reserves the ephemeral-port block via net.ipv4.ip_local_reserved_ports so concurrent dnf install doesn't collide with VH listen sockets, and picks a sensible registration-jitter default (N/4 s, clamped to 5–60 s) when running with >1 VH.
Multi-VH is Linux-only (Windows VMs stay single-tenant).
Per-VM density caveat: every emulated VH runs its own gRPC stream + state, but they all share the host kernel, CPU, and memory. Running heavy workloads (a real dnf install, large package syncs) at 25 VHs/VM on a t3.small saturates the CPU enough that gRPC heartbeats time out and dirq honestly reports VHs as peer disconnected. That's a property of the emulation density, not the mesh — production deployments with 1 agent per real host don't have it. For heavy-workload emulation, prefer CPU-rich instance types (c6i.large+) or drop density to ~10 VHs/VM.
dirq debug covers diagnostic tools used when something looks wrong in the mesh. All endpoints are admin-scoped.
| Command | Purpose |
|---|---|
dirq debug inflight |
List every exec / query / deploy session the server is currently coordinating, with the still-missing agent set, arrivals-in-the-last-1/5/30 s, and a per-zone-leader breakdown (subtree, pending, send_buf). Marks the chokepoint ZL with ← bottleneck (send_buf full) when its stream-send buffer is at capacity. |
dirq debug path <hostname> |
Walk the agent's mesh parent chain from the DB snapshot. Flags broken links. Fastest, DB-only. |
dirq debug stream <hostname> |
Show the server's in-memory view of how it would currently reach this agent (directly connected vs. routed through a zone leader). |
dirq debug ping <hostname> |
Send a no-op exec through the mesh and report round-trip timing. Slowest of the three lookup tools but the only one that proves a message actually reaches the agent right now. |
The three lookup tools form a hierarchy of trust — path (DB), then stream (live process state), then ping (end-to-end proof).
| Symptom | Likely cause | Fix |
|---|---|---|
Agents show online in dirq hosts list but dirq debug ping times out; agent logs loop on dial tcp 10.89.0.x:50052: i/o timeout |
The server is advertising an unroutable relay address — it observed a NAT/bridge source IP at registration (typically the server running in a container with published ports). The agent registered but never attached to its parent (ghost-online). | Run the server so it sees real agent IPs (native or host networking, no -p on the gRPC port) and open 50052/tcp host-to-host. See Production Deployment. |
Exec / query / ping to agents start timing out after a server restart; agent logs show rejected unsigned or invalid server message |
The server's signing key changed — an ephemeral /var/lib/dirq regenerated it on restart while agents still trust the old key. |
Persist /var/lib/dirq; re-distribute the regenerated agent.conf and restart the affected agents. See Production Deployment. |
Registration never succeeds: tls: first record does not look like a TLS handshake |
TLS mode mismatch — one side speaks TLS, the other plaintext. DIRQ_TLS_INSECURE skips cert verification but still uses TLS; DIRQ_TLS_DISABLED turns TLS off entirely. |
Make the mode identical on the server and every agent. |
The server exposes a Prometheus scrape endpoint at /metrics (unauthenticated; restrict at the network layer if needed). Two families:
dirq self-health — counts, durations, gauges describing the server's own behavior:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
dirq_build_info |
gauge=1 | version,backend,go_version |
Pin running build |
dirq_agents_total / dirq_agents_online |
gauge | — | Fleet size |
dirq_zone_leaders |
gauge | — | Direct server connections |
dirq_tree_depth_max |
gauge | — | Deepest path in the relay tree |
dirq_subtree_size |
gauge | zone_leader (hostname) |
Per-ZL fan-out — spot imbalance |
dirq_inflight_sessions |
gauge | kind (query/exec/deploy) |
Active broadcasts |
dirq_inflight_pending_targets |
gauge | kind |
Sum of unaccounted targets — is anything stuck? |
dirq_broadcast_total |
counter | kind,outcome (complete/incomplete/hard_timeout/canceled) |
Activity + reliability |
dirq_broadcast_duration_seconds |
histogram | kind |
Latency |
dirq_broadcast_missing_total |
counter | kind |
Sum of did-not-reply across completions |
dirq_register_total |
counter | outcome (ok/rejected_secret/rejected_other) |
Registration activity |
dirq_register_duration_seconds |
histogram | — | Register RPC latency |
dirq_peer_disconnect_total / dirq_peer_connect_total |
counter | — | Mesh churn |
dirq_fact_stage_depth |
gauge | — | Fact-cache backpressure |
dirq_fact_flush_total |
counter | backend,outcome |
Postgres/SQLite write activity |
dirq_fact_flush_duration_seconds |
histogram | backend |
SQLite writer-lock watch |
dirq_server_cert_expiry_seconds |
gauge | — | Server TLS cert countdown (alert if < 7d) |
Plus all free Go runtime metrics (go_goroutines, go_memstats_*, go_gc_duration_seconds, etc.).
Fleet composition — aggregated views of the managed fleet, sliced by collected facts. One combined gauge with bounded-cardinality labels:
dirq_fleet_count{os,distro,distro_version,arch,cores_bucket,memory_gb_bucket,exec_enabled,online}
Major distro version only (8 not 8.10) to bound cardinality; minor versions remain queryable via the Postgres data source (below). Recomputed every 30 s (refreshFleetMetricsInterval) so /metrics scrapes stay cheap.
# Fleet count by distro + major version, stacked area
sum by (distro, distro_version) (dirq_fleet_count{online="true"})
# Online percentage trend
dirq_agents_online / dirq_agents_total
# Did-not-reply rate, last 5 min
rate(dirq_broadcast_missing_total[5m])
/ rate(dirq_broadcast_total{outcome=~"complete|incomplete"}[5m])
# 95p exec duration
histogram_quantile(0.95, rate(dirq_broadcast_duration_seconds_bucket{kind="exec"}[5m]))
# Cert expiry alert
dirq_server_cert_expiry_seconds < 7 * 86400
scrape_configs:
- job_name: dirq
metrics_path: /metrics
scheme: https # drop to http if TLS is disabled
tls_config:
insecure_skip_verify: true # if using self-signed certs
static_configs:
- targets: ['dirq-server:8080']Default retention (15 d) is enough for week-over-week trends; bump --storage.tsdb.retention.time=90d for quarterly views.
For queries the Prometheus metrics don't cover (per-host kernel versions, specific package presence, disk usage above N%), point Grafana at the dirq database directly with a read-only role:
CREATE ROLE grafana_readonly LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE dirq TO grafana_readonly;
GRANT USAGE ON SCHEMA public TO grafana_readonly;
GRANT SELECT ON agents, agent_facts, exec_log, queries TO grafana_readonly;Then panels are SQL against the agents and agent_facts tables:
-- Top 20 hosts by disk usage in prod
SELECT a.hostname,
p->>'mount_point' AS mount,
(p->>'pct_used')::float AS pct_used
FROM agents a
JOIN agent_facts f ON f.agent_id = a.id AND f.module = 'disk'
CROSS JOIN LATERAL jsonb_array_elements(f.data->'partitions') AS p
WHERE a.tags->>'env' = 'prod'
AND (p->>'pct_used')::float > 85
ORDER BY pct_used DESC
LIMIT 20;
-- Distinct kernel versions present today
SELECT data->>'kernel_version' AS kernel, COUNT(*) AS hosts
FROM agent_facts
WHERE module = 'os_info'
GROUP BY 1
ORDER BY 2 DESC;Postgres queries return current state only — for time-series trends use the Prometheus metrics. For retention beyond what Prometheus holds, an external snapshot table is the standard option but isn't bundled.
The inventory plugin creates a nested group hierarchy from agent metadata and tags:
@all
├── @os_linux / @os_windows
├── @arch_amd64 / @arch_arm64
├── @exec_enabled
├── @tag_env
│ ├── @tag_env_prod
│ └── @tag_env_dev
├── @tag_role
│ ├── @tag_role_webserver
│ └── @tag_role_database
└── @tag_dc
├── @tag_dc_us_east
└── @tag_dc_eu_west
Target hosts with standard Ansible patterns:
hosts: os_linux
hosts: tag_env_prod
hosts: tag_role_webserver:&os_linux # intersection
hosts: exec_enabledAll collected data exposed as dirq_* hostvars:
dirq_agent_id: "abc-123"
dirq_os: "linux"
dirq_cpu: { physical_cores: 8, logical_cores: 16, ... }
dirq_memory: { total_bytes: 34359738368, pct_used: 34.4, ... }
dirq_disk: { partitions: [{ mount_point: "/", pct_used: 67.3, ... }] }
dirq_tag_env: "prod"
dirq_exec_enabled: trueThe inventory plugin accepts an optional query parameter. Only hosts matching the query appear in the inventory:
# inventories/vulnerable-openssl.yml
plugin: atgreen.dirq.dirq
server_url: http://dirq-server:8080
query: "SELECT os_info.hostname WHERE packages.name = 'openssl' AND packages.version LIKE '1.%'"
# inventories/disks-full.yml
plugin: atgreen.dirq.dirq
server_url: http://dirq-server:8080
query: "SELECT os_info.hostname WHERE disk.pct_used > 90"In AAP, each file becomes an Inventory Source. Job templates pair each inventory with a remediation playbook:
| Job Template | Inventory Source | Playbook | Targets |
|---|---|---|---|
| Patch OpenSSL | vulnerable-openssl.yml | update-openssl.yml | Hosts with OpenSSL 1.x |
| Fix Full Disks | disks-full.yml | cleanup-disks.yml | Hosts over 90% disk |
The query runs in real time during inventory sync — the host list is always current.
Standalone:
DIRQ_QUERY="SELECT os_info.hostname WHERE disk.pct_used > 90" \
ansible-playbook -i ansible/dirq_inventory.py cleanup-disks.yml# Tag a single host by ID
dirq hosts tag <agent-id> env=prod role=webserver dc=us-east
# Tag multiple hosts with a WHERE clause
dirq hosts tag env=prod WHERE os_info.os = 'linux'
dirq hosts tag role=webserver WHERE tag.dc = 'us-east'
# Untag by ID or query
dirq hosts untag <agent-id> role dc
dirq hosts untag env WHERE tag.env = 'staging'Tags flow into inventory groups automatically.
The relay mesh doubles as an Ansible connection transport. The inventory plugin
automatically sets ansible_connection for exec-enabled hosts, so existing
playbooks work without modification — no need to add connection: dirq or
gather_facts: false.
# This just works — no connection: dirq needed.
# The inventory plugin handles it.
- hosts: tag_env_prod
tasks:
- command: uptime
- copy:
src: app.conf
dest: /etc/myapp/app.conf
- fetch:
src: /var/log/status.log
dest: /tmp/status.log
flat: yesThe inventory plugin also maps DirQ facts to standard Ansible variables
(ansible_os_family, ansible_distribution, ansible_architecture,
ansible_processor_vcpus, ansible_memtotal_mb, etc.) and sets OS-specific
shell and interpreter settings (ansible_shell_type, ansible_python_interpreter
for Linux, powershell for Windows). Most existing roles work without changes.
- AAP launches a job template — the inventory already set
ansible_connection - The connection plugin routes
exec_command/put_file/fetch_fileto the DirQ server REST API - The server pushes through the relay mesh to the target agent
- The agent executes locally and returns results back through the mesh
- AAP records the job result normally
Exec is disabled by default — opt in per agent:
DIRQ_EXEC_ENABLED=true ./bin/dirq-agentDefault exec timeout is 300 seconds (5 minutes), configurable via dirq_exec_timeout
in the connection plugin. Long-running tasks like yum update work without special
handling — the broadcast dispatcher has no idle timeout, so --timeout 3600 against
a slow fleet behaves as written rather than getting cut off after the first burst of
fast responders. Exec responses are forwarded immediately through the relay chain —
they are not batched by the result aggregator.
Every operation is logged in PostgreSQL with AAP job attribution:
curl "$DIRQ_SERVER_URL/api/v1/exec_log?aap_job_id=42"For quick ad-hoc tasks that don't need a full Ansible playbook, dirq exec runs a command or script across matching hosts in parallel and streams results back in real time.
dirq exec -- uptime
dirq exec WHERE tag.env = 'prod' -- openssl version
dirq exec --become WHERE tag.role = 'webserver' -- systemctl restart nginx
dirq exec -- hostname -f
dirq exec --json -- df -h /Upload and execute a local script file with --script. Linux scripts honor their shebang. Windows .ps1 files run with PowerShell.
dirq exec WHERE tag.env = 'prod' --script ./health-check.sh
dirq exec WHERE os_info.os = 'windows' --script ./audit.ps1
dirq exec WHERE tag.role = 'webserver' --become --script ./patch.shWith --script, no -- separator is needed since the script path is a dirq flag, not a remote command.
Search log files across the fleet without a centralized logging stack. Uses grep on Linux and Select-String on Windows.
dirq grep "Out of memory" /var/log/messages
dirq grep -i "error|timeout" /var/log/nginx/error.log WHERE tag.env = 'prod'
dirq grep "FATAL" /var/log/app.log --tail 1000
dirq grep "Failed password" /var/log/secure --becomeResults are formatted as a table with matches grouped by host:
HOST LINE MATCH
web-prod-01 4821 Jan 15 03:22:41 kernel: Out of memory: Killed process 1234 (java)
web-prod-01 6103 Jan 15 08:14:02 kernel: Out of memory: Killed process 5678 (python3)
db-prod-02 11042 Jan 14 22:01:18 kernel: Out of memory: Killed process 891 (mysqld)
3 matches across 2 hosts (15 hosts searched)
Use --tail N to search only the last N lines of a file (avoids scanning multi-GB logs). Use --become for files that require root access (e.g. /var/log/secure).
Results stream back as each host responds — fastest hosts appear first:
Targets: 3
── web-01 rc=0 ──
14:23:01 up 42 days, 3:17, 0 users, load average: 0.12, 0.08, 0.05
── db-01 rc=0 ──
14:23:01 up 91 days, 12:44, 0 users, load average: 0.45, 0.38, 0.31
── web-02 rc=0 ──
14:23:02 up 13 days, 7:02, 0 users, load average: 0.03, 0.05, 0.01
3/3 completed
With --json, output is NDJSON (one JSON object per line), suitable for piping.
TLS is enabled by default on all gRPC and REST API connections. If no certificates are configured, self-signed certs are auto-generated at startup.
| TLS vars set | Behavior |
|---|---|
| Nothing | Auto-generate self-signed + mTLS cert issuance per agent |
CERT + KEY |
TLS with user certs, no mTLS |
CERT + KEY + CA + CA_KEY |
Full mTLS with user-supplied CA |
DIRQ_TLS_DISABLED=true |
Explicitly insecure (must opt in) |
When the server has access to the CA private key (auto-generated or via DIRQ_TLS_CA_KEY), it issues a unique TLS client certificate to each agent during registration. The certificate's CN is the agent ID, binding the TLS identity to the application identity.
After registration:
- All gRPC connections (AgentStream, RequestPeers, relay) require a valid client cert signed by the server's CA
- The server and relay agents verify that the cert CN matches the claimed agent ID
- The registration secret becomes a one-time bootstrap token — a leaked secret can register an agent once, but the cert it receives is bound to that specific agent ID
This activates automatically when the CA key is available. On auto-generated certs, it's always on. For user-supplied certs, set DIRQ_TLS_CA_KEY.
Agents persist their issued cert to disk and reuse it across restarts. Certs are valid for 1 year; agents renew automatically when within 30 days of expiry (no restart needed).
Generate certs:
# Self-signed CA (quick start)
dirq cert generate --dir ./certs
# Use your own CA
dirq cert generate --ca ./my-ca.crt --ca-key ./my-ca.key --dir ./certsBoth generate server.crt, server.key, agent.crt, agent.key, and a copy of ca.crt in the output directory.
Full mTLS with user-supplied CA:
# Server (needs CA key to issue per-agent certs)
DIRQ_TLS_CA=./certs/ca.crt DIRQ_TLS_CA_KEY=./certs/ca.key \
DIRQ_TLS_CERT=./certs/server.crt DIRQ_TLS_KEY=./certs/server.key dirq-server
# Agent (only needs CA cert — gets its own cert during registration)
DIRQ_TLS_CA=./certs/ca.crt dirq-agentRotate certificates across the fleet without downtime:
dirq cert rotate agent_cert --stagger 3600 # renew all agent certs over 1 hour
dirq cert rotate ca --stagger 3600 # distribute a new CA
dirq cert rotate signing_key # roll the message signing keyThe --stagger flag spreads renewals over time to avoid overloading the server. See SECURITY.md for the full rotation procedure including CA and signing key rotation.
API authentication is required by default. On first startup, a bootstrap token is auto-generated and printed to the server log. Save it.
dirq token create ops-team --scope admin
dirq token create monitoring --scope readonly
export DIRQ_TOKEN=<token>Token scopes are enforced per-endpoint:
readonly— queries, host listing, facts, inventory, query history, exec logadmin— all of the above, plus tag management, token management, exec, put_file, fetch_file, deploy
Set DIRQ_AUTH_DISABLED=true to disable (not recommended).
Every control message the server sends through the relay mesh — queries, exec requests, file transfers, rebalancer commands — is signed with Ed25519 before dispatch. Each agent verifies the signature before processing.
This is critical because queries and exec requests flow through relay agents. Without signing, a compromised relay could inject fake commands to downstream agents. With signing:
- Only the server can originate commands. Relay agents forward signed messages but cannot forge them.
- Signatures include an expiry window (5 minutes), preventing replay attacks.
- The server's public key is distributed to agents during registration over the TLS-protected gRPC stream.
The signing key pair is auto-generated on first startup and persisted. To use a pre-generated key, set DIRQ_SIGNING_KEY.
By default, any client that can reach the server's gRPC port can register as an agent. For production deployments, set a registration secret — a pre-shared key that agents must present during registration:
# Server
DIRQ_REGISTRATION_SECRET=my-fleet-secret dirq-server
# Agent
DIRQ_REGISTRATION_SECRET=my-fleet-secret dirq-agentOr in config files:
# /etc/dirq/server.conf
registration_secret: my-fleet-secret
# /etc/dirq/agent.conf
registration_secret: my-fleet-secret
When configured, the server rejects Register calls that don't present the matching secret. This prevents unauthorized hosts from joining the mesh.
Session tokens issued during registration are Ed25519-signed and time-stamped. They expire after 24 hours, at which point the agent re-registers automatically to obtain a fresh token. Relay peers verify session tokens cryptographically using the server's signing public key — no shared state between relays and the server is needed.
- Server-originated only: exec requests must come from the server and carry a valid Ed25519 signature. Relay agents forward but cannot forge exec requests.
- Opt-in per agent:
exec_enableddefaults tofalse. - Full audit trail: every operation logged with AAP job ID, user, command, exit status.
- AAP retains authority: DirQ is the data plane; AAP controls RBAC, credentials, approvals.
- File transfer limits: 100 MB default.
- Windows: agent runs as SYSTEM (Windows Service). Become uses PowerShell scheduled tasks.
- Linux: become uses
sudo -n(non-interactive, NOPASSWD required).
The Production Deployment fundamentals apply to every server below — each must observe agents' real source IPs and persist
/var/lib/dirq.
Run one DirQ server per datacenter. Meshes never span DC boundaries.
DC us-east DC eu-west
┌──────────────────────┐ ┌──────────────────────┐
│ Agents ──► DirQ │ │ Agents ──► DirQ │
│ Server │ │ Server │
│ + PG │ │ + PG │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────┐
│ AAP Controller │
│ Inventory Source per DC → all merge into one inventory │
│ Each host carries dirq_server_url from its DC │
└──────────────────────────────────────────────────────────┘
The inventory plugin sets dirq_server_url per host. The connection plugin reads it automatically — a host from us-east routes through dirq-us-east, a host from eu-west routes through dirq-eu-west, even in the same play.
- hosts: tag_env_prod # spans all DCs
connection: atgreen.dirq.dirq
tasks:
- command: uptime # routed through correct DC per hostcd collection/atgreen/dirq
ansible-galaxy collection build
ansible-galaxy collection install atgreen-dirq-1.0.0.tar.gzIncludes: atgreen.dirq.dirq inventory plugin + connection plugin.
# execution-environment.yml
version: 3
dependencies:
galaxy:
collections:
- name: atgreen.dirqansible-builder build -t dirq-ee:latestImport from collection/atgreen/dirq/docs/aap-credential-type.yml or create manually. Injects DIRQ_SERVER_URL and DIRQ_TOKEN as environment variables.
- Build and publish the
atgreen.dirqcollection - Build a custom EE and push to your registry
- Import the DirQ credential type in AAP
- Create DirQ credentials (one per DC if multi-DC)
- Add inventory sources using
atgreen.dirq.dirqplugin - Create job templates with
connection: atgreen.dirq.dirq - Attach DirQ credentials to job templates
DirQ includes a built-in Model Context Protocol (MCP) server, allowing LLMs like Claude to manage your fleet directly as a tool.
Start the MCP server:
dirq mcpThis runs an MCP stdio server that exposes fleet management tools over JSON-RPC 2.0.
Add to claude_desktop_config.json:
{
"mcpServers": {
"dirq": {
"command": "dirq",
"args": ["mcp"],
"env": {
"DIRQ_SERVER_URL": "https://your-server:8080",
"DIRQ_TOKEN": "your-token"
}
}
}
}Add to your project's .mcp.json:
{
"mcpServers": {
"dirq": {
"command": "dirq",
"args": ["mcp"],
"env": {
"DIRQ_SERVER_URL": "https://your-server:8080",
"DIRQ_TOKEN": "your-token"
}
}
}
}| Tool | Description |
|---|---|
dirq_hosts_list |
List all registered hosts, optionally filtered by WHERE clause |
dirq_hosts_show |
Show detailed info for a specific host |
dirq_hosts_facts |
Get real-time system facts (CPU, memory, disk, packages, etc.) |
dirq_hosts_tag |
Add or update tags on hosts |
dirq_query |
Run DirQ SELECT queries across the fleet |
dirq_exec |
Execute shell commands on targeted hosts |
dirq_cve_scan |
Scan RHEL hosts for a specific CVE vulnerability |
dirq_errata_check |
Check fleet against a Red Hat advisory |
dirq_kb_check |
Check Windows hosts for installed hotfixes |
dirq_graph |
Show the fleet mesh topology |
With the MCP server configured, you can ask Claude things like:
- "Which hosts in prod have more than 80% disk usage?"
- "Are any of our RHEL hosts vulnerable to CVE-2024-6345?"
- "Tag all Windows hosts with role=iis"
- "Run
uptimeon all Linux hosts in staging" - "Show me the fleet topology"
Both the server and agent support configuration via config files, environment variables, or both. Environment variables always override config file values, which override defaults.
Config files use a simple key: value format with optional indented tags: block. Comments start with #.
Agent config — /etc/dirq/agent.conf (Linux) or C:\ProgramData\dirq\agent.conf (Windows):
# DirQ agent configuration
server: grpc.example.com:50051
listen: 0.0.0.0:50052
exec_enabled: true
tags:
env: prod
dc: us-east
role: webserver
Server config — /etc/dirq/server.conf (Linux) or C:\ProgramData\dirq\server.conf (Windows):
# DirQ server configuration
grpc_addr: :50051
http_addr: :8080
db_url: postgres://dirq:dirq@db.internal:5432/dirq?sslmode=require
max_zone_leaders: 10
max_children: 50
registration_secret: my-fleet-secret
tls_ca: /etc/dirq/certs/ca.crt
tls_cert: /etc/dirq/certs/server.crt
tls_key: /etc/dirq/certs/server.key
Override the config file path with DIRQ_CONFIG:
DIRQ_CONFIG=/opt/dirq/custom.conf dirq-agentIf the config file doesn't exist, it is silently ignored — all values fall back to environment variables or defaults.
Priority: environment variable > config file > default.
| Config key | Environment variable | Default | Description |
|---|---|---|---|
grpc_addr |
DIRQ_GRPC_ADDR |
:50051 |
gRPC listen address |
http_addr |
DIRQ_HTTP_ADDR |
:8080 |
REST API listen address |
db_url |
DIRQ_DB_URL |
sqlite:///var/lib/dirq/dirq.db |
Database URL (SQLite or postgres://...) |
pod_id |
DIRQ_POD_ID |
hostname | Unique pod identifier |
max_zone_leaders |
DIRQ_MAX_ZONE_LEADERS |
5 |
Max direct server connections |
max_children |
DIRQ_MAX_CHILDREN |
50 |
Max children per node (fan-out) |
auth_disabled |
DIRQ_AUTH_DISABLED |
false |
Disable API auth (not recommended) |
require_aap_binding |
DIRQ_REQUIRE_AAP_BINDING |
false |
When true, reject write ops whose aap_user the token isn't bound to, and forbid unbound tokens from write ops (see Security) |
registration_secret |
DIRQ_REGISTRATION_SECRET |
Pre-shared secret for agent registration (see Security) | |
leader_election |
DIRQ_LEADER_ELECTION |
false |
Enable Postgres advisory-lock leader election for multi-pod HA (see HA.md) |
fact_flush_interval |
DIRQ_FACT_FLUSH_INTERVAL |
250ms |
Fact-cache batch flush interval |
fact_flush_size |
DIRQ_FACT_FLUSH_SIZE |
5000 |
Distinct (agent_id, module) keys per flush |
fact_stage_cap |
DIRQ_FACT_STAGE_CAP |
20000 |
Hard cap on staged distinct keys (drops only new keys on saturation) |
| Config key | Environment variable | Default | Description |
|---|---|---|---|
server |
DIRQ_SERVER |
localhost:50051 |
DirQ server gRPC address |
listen |
DIRQ_LISTEN |
:50052 |
Relay listener (always enabled) |
exec_enabled |
DIRQ_EXEC_ENABLED |
false |
Enable remote execution |
registration_secret |
DIRQ_REGISTRATION_SECRET |
Must match server's registration secret | |
tags: block |
DIRQ_TAGS |
Tags: env=prod,dc=us-east |
|
hostname |
DIRQ_HOSTNAME |
(autodetected) | Override the hostname the agent reports |
virtual_hosts |
DIRQ_VIRTUAL_HOSTS |
0 |
Spawn N in-process virtual hosts for fleet emulation (Linux only) |
hostname_prefix |
DIRQ_HOSTNAME_PREFIX |
Prefix for synthesized virtual-host names (<prefix>-NNNNN) |
|
registration_jitter_seconds |
DIRQ_REGISTRATION_JITTER_SECONDS |
(auto for multi-VH) | Cap on random startup delay before first Register; smooths thundering-herd boot |
policy_file |
DIRQ_POLICY_FILE |
Path to a local OPA/Rego policy evaluated before exec/file/deploy side effects (see Agent-side policy) | |
policy_fail_closed |
DIRQ_POLICY_FAIL_CLOSED |
true when policy_file is set |
Deny if the policy fails to load or evaluate |
policy_query |
DIRQ_POLICY_QUERY |
data.dirq.agent.allow |
Rego decision query |
Tags can be set in the config file as an indented block under tags:, or via the DIRQ_TAGS environment variable as comma-separated key=value pairs. Both sources are merged, with environment variables taking precedence for duplicate keys.
An optional Rego policy lets each agent refuse local operations even when the
server validly authorized them — defense in depth, not a replacement for
server-side authorization. Set policy_file and the agent compiles the policy
at startup and evaluates it before every exec, put_file, fetch_file, and
deploy side effect. Denied operations return a terminal policy denied: …
error and run nothing locally.
exec_enabled: true
policy_file: /etc/dirq/policy.rego
policy_fail_closed: true
The policy queries data.dirq.agent.allow (boolean) and an optional
data.dirq.agent.reason (string). Input is a stable, documented JSON document
per operation — never raw file content, script bodies, or environment values
(those are reduced to sizes, SHA-256 hashes, and key names). For example:
package dirq.agent
default allow := false
default reason := "denied by default"
# Prod hosts: only an approved AAP template may restart nginx.
allow if {
input.operation == "exec"
input.tags.env == "prod"
input.aap_job_template == "restart-nginx"
input.command == "systemctl restart nginx"
}
# Writes limited to one app's config directory.
allow if {
input.operation == "put_file"
startswith(input.dest_path, "/etc/myapp/")
input.content_size <= 1048576
}Ready-to-adapt examples (minimal allowlist, production AAP-only, file-path
restrictions) ship under examples/policy/. With no
policy_file configured, agent behavior is unchanged. See
SECURITY.md for the full model.
| Config key | Environment variable | Default | Description |
|---|---|---|---|
tls_ca |
DIRQ_TLS_CA |
CA certificate path | |
tls_ca_key |
DIRQ_TLS_CA_KEY |
CA private key path (server only — enables per-agent mTLS cert issuance) | |
tls_cert |
DIRQ_TLS_CERT |
This process's certificate path | |
tls_key |
DIRQ_TLS_KEY |
This process's private key path | |
tls_insecure |
DIRQ_TLS_INSECURE |
false |
Skip cert verification (agent only) |
tls_disabled |
DIRQ_TLS_DISABLED |
false |
Disable TLS entirely (not recommended) |
Example agent config with TLS and registration secret:
server: grpc.example.com:50051
exec_enabled: true
registration_secret: my-fleet-secret
tls_ca: /etc/dirq/certs/ca.crt
tls_cert: /etc/dirq/certs/agent.crt
tls_key: /etc/dirq/certs/agent.key
tags:
env: prod
| Config key | Environment variable | Default | Description |
|---|---|---|---|
signing_key |
DIRQ_SIGNING_KEY |
Ed25519 private key file | |
signing_pub |
DIRQ_SIGNING_PUB |
Ed25519 public key file |
Config files support inline base64-encoded PEM certs, so a single file contains everything an agent needs. The server generates these automatically in /var/lib/dirq/agent.conf.
| Config key | Environment variable | Description |
|---|---|---|
tls_ca_data |
DIRQ_TLS_CA_DATA |
Base64-encoded CA certificate PEM |
tls_cert_data |
DIRQ_TLS_CERT_DATA |
Base64-encoded agent certificate PEM |
tls_key_data |
DIRQ_TLS_KEY_DATA |
Base64-encoded agent private key PEM |
When tls_ca_data/tls_cert_data/tls_key_data are set and no file paths are given, the agent materializes them to /var/lib/dirq/tls/ on startup.
Config file: ~/.config/dirq/client.conf (user-local, checked first) or /etc/dirq/client.conf (system-wide). On Windows: %APPDATA%\dirq\client.conf or C:\ProgramData\dirq\client.conf. The server generates a ready-to-copy client.conf at /var/lib/dirq/client.conf.
# ~/.config/dirq/client.conf
server_url: https://dirq-server:8080
token: <your-api-token>
tls_insecure: true
| Config key | Variable / Flag | Default | Description |
|---|---|---|---|
server_url |
DIRQ_SERVER_URL / --server |
(required) | Server REST URL |
token |
DIRQ_TOKEN / --token |
API token | |
tls_insecure |
DIRQ_TLS_INSECURE / --tls-insecure |
false |
Skip TLS verification |
llm_url |
DIRQ_LLM_URL |
LLM API base URL (Anthropic or OpenAI-compatible) | |
llm_api_key |
DIRQ_LLM_API_KEY |
LLM API key | |
llm_model |
DIRQ_LLM_MODEL |
claude-sonnet-4-20250514 |
LLM model name |
--json |
false |
Raw JSON output |
For dirq ask, if DIRQ_LLM_* is not configured, falls back to ANTHROPIC_API_KEY with Anthropic's native API.
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/query |
Submit a DirQ query |
GET |
/api/v1/hosts |
List hosts |
GET |
/api/v1/hosts/{id} |
Host details |
GET |
/api/v1/hosts/{id}/facts |
Cached facts |
PUT |
/api/v1/hosts/{id}/tags |
Replace tags |
PATCH |
/api/v1/hosts/{id}/tags |
Merge tags |
DELETE |
/api/v1/hosts/{id}/tags/{key} |
Remove tag |
GET |
/api/v1/queries |
Recent queries |
POST |
/api/v1/tokens |
Create token |
GET |
/api/v1/tokens |
List tokens |
DELETE |
/api/v1/tokens/{name} |
Delete token |
GET |
/api/v1/inventory |
Ansible inventory |
POST |
/api/v1/exec |
Execute command (single agent) |
POST |
/api/v1/exec_multi |
Execute command/script across fleet (streaming NDJSON) |
POST |
/api/v1/put_file |
Write file |
POST |
/api/v1/fetch_file |
Read file |
GET |
/api/v1/exec_log |
Exec audit log |
GET |
/api/v1/debug/inflight |
In-flight broadcast sessions with per-ZL breakdown (admin) |
GET |
/api/v1/status |
Fleet status (agent counts, ZLs, tree depth, database kind) |
GET |
/healthz |
Liveness — process is up |
GET |
/readyz |
Readiness — this pod is the active leader (200) or a standby (503); always 200 when leader election is disabled |
GET |
/metrics |
Prometheus scrape (unauth; see Observability) |
# All binaries
go build -o bin/dirq-server ./cmd/dirq-server
go build -o bin/dirq-agent ./cmd/dirq-agent
go build -o bin/dirq ./cmd/dirq
# Windows agent
GOOS=windows GOARCH=amd64 go build -o bin/dirq-agent.exe ./cmd/dirq-agent
# Tests
go test ./...
# Container images
podman build --target server -t dirq-server .
podman build --target agent -t dirq-agent .cmd/
dirq-server/ Server entrypoint
dirq-agent/ Agent entrypoint (Windows Service support)
dirq/ CLI entrypoint
proto/dirq/v1/ Protobuf definitions
internal/
server/ gRPC, REST API, query dispatch, exec routing
agent/ Registration, relay mesh, query execution, exec
query/ DirQ DSL parser and evaluator
modules/ System data collectors (7 modules)
db/ SQLite + PostgreSQL backends and data access
tlsutil/ TLS configuration, cert generation
signutil/ Message signing (Ed25519)
collection/atgreen/dirq/ Ansible collection for AAP
plugins/connection/ connection: atgreen.dirq.dirq
plugins/inventory/ inventory: atgreen.dirq.dirq
ansible/ Standalone plugins for CLI Ansible
Containerfile Multi-stage build
podman-compose.yml Dev environment
execution-environment.yml EE definition for ansible-builder
MIT License. Copyright (c) 2026 Anthony Green. See LICENSE for details.