Changelog¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.5.0] - 2026-03-23¶

Changed¶

Fix Grafana billing panels: query both billing and ccloud_billing tables

CCloud billing data lives in ccloud_billing (migrated by ddebea2fe0a8), not the core billing table. The three billing stat panels were querying an empty table. Use UNION ALL to cover both tables so the dashboard works for any ecosystem. (d768a78) - Fix Grafana dashboards: broken pie charts, missing panels, wrong column names

Overview dashboard (full rewrite): - Fix pie charts showing 100% single-color by adding reduceOptions.values - Add unit:currencyUSD, displayLabels:name, legend placement, sort:desc - Reorganize to 4 pie charts: Environment, Resource, Product Category, Product Type - Replace identity pie chart with gauge panel (arc viz, top 30 principals) - Add collapsible billing detail rows (per-environment, per-resource, per-product-type) - Add Object Roster row with resource/identity count stats - Add "Cost breakdown summary" row separator

Details dashboard (column fixes): - billing panel: b.total -> b.total_cost (correct column name) - resources panel: r.updated_at -> r.last_seen_at (column doesn't exist) - identities panel: i.updated_at -> i.last_seen_at (column doesn't exist) (4337fa5)

Fixed¶

Fix: task-147 — Fix plugin loader to actually use plugins_path for external plugin discovery

discover_plugins() hardcoded f"plugins.{entry.name}" for imports, making external plugin directories (via plugins_path config) scan correctly but fail on import. Extract _import_plugin_module() helper that routes built-in plugins through importlib.import_module (when parent is on sys.path) and external plugins through spec_from_file_location (file-based import, no sys.path mutation). Missing init.py now raises ImportError with actionable message. (c5fd57d) - Fix: task-148 — Decouple emitters from pipeline loop, make them independent DB readers

Remove EmitPhase, _EmitterEntry, EmitResult, _load_emitters, _aggregate_rows, _Bucket, and _GRANULARITY_ORDER from orchestrator. Pipeline loop is now gather → calculate → commit only.

Add EmitterRunner as independent post-pipeline component that reads chargebacks from DB, drives configured emitters via PerDateDriver/LifecycleDriver, and persists per-tenant/emitter/date emission state (EmissionRecord table).

Add --emit-once CLI flag for standalone re-emission without pipeline run. Add EmitterSpec.name and lookback_days config fields. Add Alembic migration 010. Wire EmitterRunner in WorkflowRunner._run_tenant as post-pipeline hook. (841963c) - Fix: task-146 — Remove ecosystem_name from generic_metrics_only plugin, hardcode ecosystem like CCloud/SMK

The generic_metrics_only plugin exposed a configurable ecosystem_name field used as the data partition key in billing lines. This broke the orchestrator contract: ecosystem should be the hardcoded plugin selector, not user-configured. The mismatch caused find_by_date() to return zero rows — silent data loss.

Removed ecosystem_name from GenericMetricsOnlyConfig and hardcoded "generic_metrics_only" in all 5 emission sites (plugin property, build_shared_context, CoreBillingLineItem, handler identity construction, log message). Added ECOSYSTEM module constant in cost_input.py matching peer plugin pattern. Updated docs and changelog. 95% test coverage with full data flow integration test. (1b32756) - Fix: task-145 — Reject pipeline trigger in API-only mode, prevent double PipelineRun records

In API-only mode (workflow_runner=None), trigger_pipeline now returns 400 immediately without creating a PipelineRun record. In both mode, the endpoint no longer creates its own PipelineRun — lifecycle is fully owned by WorkflowRunner._run_tenant() via PipelineRunTracker, eliminating duplicate records. _run_pipeline simplified to thin async wrapper with logging only. (e819e59) - Fix: task-144 — Convert StorageConfig.connection_string to SecretStr and mask secrets in --show-config

Prevent two secret leak paths: (1) connection_string with embedded DB passwords now uses Pydantic SecretStr, masked as **** in all serialization; (2) --show-config excludes plugin_settings to prevent raw API key dumps. Engine log output stripped of credentials via urlparse masking. (b70a9e5) - Fix: task-142 — Fix env_id in API response schema and allocation issue reporting

Add env_id field to ChargebackDimensionResponse and AllocationIssueResponse API schemas, pass env_id through _build_dimension_response and list_allocation_issues route, and include env_id in find_allocation_issues GROUP BY to prevent cross-environment aggregation conflation. (9c4f11e) - Fix: task-141 — Fix env_id propagation gaps in 3 code paths

Three env_id propagation gaps discovered post task-140:

_allocate_to_unallocated() now accepts metadata param and passes dimension_metadata so UNALLOCATED rows carry env_id
chargeback_to_domain() reconstitutes env_id from dimension table into ChargebackRow.metadata on read-back
ChargebackDimensionInfo gains env_id field, populated at both get_dimension() and get_dimensions_batch() construction sites (2ec86b2)
Fix: task-140 — Add env_id to chargeback dimensions via plugin-extensible chargeback repository

env_id from CCloud billing API was dropped during chargeback calculation, causing environment_id aggregation to fail for 94% of resources (broken resource table join). Now stored directly on chargeback_dimensions via plugin-extensible ChargebackRepository pattern.

StorageModule protocol gains create_chargeback_repository()
UnitOfWork delegates chargebacks repo to plugin StorageModule
AllocationContext.dimension_metadata propagates env_id from billing line
CCloudChargebackRepository writes env_id to dimension table
aggregate() uses native env_id column (resource join removed)
Migration 009 adds env_id column and backfills from ccloud_billing (006e8b5)

[0.4.2] - 2026-03-21¶

Changed¶

Build multi-arch Docker images (amd64 + arm64) via QEMU (22dda6f)
Remove local build sections from example Docker Compose files

Images are published to GHCR — no need for local build context. Updated stale comments accordingly. (a9d4651)

Documentation¶

Docs: Make Docker Compose the primary quickstart path

Rewrite quickstart to lead with Docker Compose instead of Python/uv. The full walkthrough now goes: create CCloud service account → get API key → docker compose up. Python/uv is documented as an alternative at the bottom.

Remove all GitHub URL links from docs — use relative links to stay within the documentation site
Inline Docker Compose instructions in deployment.md instead of linking out
Update root README quick start to show docker compose first
Update prerequisites to list Docker as primary runtime
Update getting-started index description (356a867)

Fixed¶

Fix: Add latest tag to chitragupt-ui Docker image on release (2af79bd)
Fix: Update changelog test to allow git-cliff in docs workflow

The docs.yml workflow now legitimately uses git-cliff to generate CHANGELOG.md before building docs. The test should only assert that GitHub Release creation stays out of docs.yml. (592ff12) - Fix: Use GHCR image names in example Docker Compose files

Point all example compose files to ghcr.io/waliaabhishek/chitragupt and ghcr.io/waliaabhishek/chitragupt-ui so users pull the published images by default. (13ca2ee) - Fix: Generate changelog for both dev and versioned doc deploys

Tag deploys get full changelog (all releases). Main deploys get only unreleased changes since last tag. (9dadb03) - Fix: Generate CHANGELOG.md in CI so docs changelog page has release entries

docs.yml now runs git-cliff before versioned deploys so the changelog page includes all releases. release.yml pushes the generated CHANGELOG.md back to main so the repo stays in sync. (25a8fb3) - Fix: Include all config parameters in example configs

Example config.yaml files were missing most optional parameters — only showing the bare minimum. Now every parameter from the config schema is documented (commented out for optional ones) with explanations of what it does, valid ranges, and defaults. Covers: logging (format, per_module_levels), features (max_parallel_tenants), API (request_timeout_seconds), tenant tuning (retention_days, allocation_retry_limit, zero_gather_deletion_threshold, gather_failure_threshold, tenant_execution_timeout_seconds, metrics_prefetch_workers), and plugin settings (billing_api, Prometheus auth options, flink multi-region, allocator_params, allocator_overrides, identity_resolution_overrides, metrics_step_seconds, chargeback_granularity, min_refresh_gap_seconds, granularity_durations, emitters). (b7ecb90)

[0.4.1] - 2026-03-20¶

Changed¶

Cleanup: Remove stale configs/ and deployables/, consolidate into examples/
Delete configs/examples/ (3 stale config files from chunk 1.3, unreferenced)
Delete deployables/README.md and QUICKSTART.md (redirect stubs, docs already point to examples/)
Move collector.sh from deployables/assets/ to examples/shared/scripts/
Update all references to collector.sh path in docs, tests, and examples
Update .dockerignore to exclude examples/ instead of deployables/ and configs/
deployables/ directory is now fully removed from the repository (f9136cf)

Fixed¶

Fix: Update docs to reference examples/ instead of deployables/QUICKSTART.md

Three documentation files still pointed to deployables/QUICKSTART.md after the TASK-139 restructuring. Updated to link directly to examples/ directory. (0240f48) - Fix: TASK-139 — restructure deployables into self-contained example directories

Replace the monolithic deployables/ layout with three self-contained examples under examples/, each runnable with a single docker compose up:

ccloud-grafana: CCloud worker mode + Grafana (no API, no frontend)
ccloud-full: CCloud full stack (API + worker + Grafana + UI)
self-managed-full: Self-managed Kafka full stack

Shared Grafana provisioning assets moved to examples/shared/. Stale example configs in deployables/config/examples/ removed (18 files). Makefile updated with per-example targets and legacy aliases pointing to ccloud-full. (5163bb1)

[0.4.0] - 2026-03-20¶

Fixed¶

Fix: TASK-138 — fail hard on missing CRN organization segment in Flink gathering

Replace silent tenant_id fallback with explicit ValueError when parse_ccloud_crn() yields no organization key, preventing wrong org_id from reaching the Flink Statements API. (d78cf11) - Fix: Clarify tenant_id is an internal partition key, not a CCloud org ID

tenant_id was misleadingly documented as the Confluent Cloud Organization ID across configs, examples, .env files, and docs. It is actually an arbitrary string used solely as a DB partition key — CCloud APIs are scoped by credentials, not by org ID. Renamed env vars from CCLOUD_ORG_ID to CCLOUD_TENANT_ID and updated all documentation and examples to prevent user confusion. (3827afd)

[0.3.3] - 2026-03-20¶

Documentation¶

Docs: Add configuration guide, cost model explainer, and fix doc build

New pages: - Configuration Guide: narrative walkthrough of building a config with decision points, tradeoffs, and worked examples for all 3 ecosystems - How Costs Work: complete cost lifecycle from billing data through allocation with exact math, rounding guarantees, and the fallback chain

Enhanced existing references: - Added "why" columns to product type and allocation strategy tables - Added decision guidance callouts to all config reference pages - Expanded identity discovery and fallback behavior documentation - Added "when to change" guidance to tuning parameters

Fixes: - Converted 4 broken relative links (deployables/, CHANGELOG.md) to absolute GitHub URLs so mkdocs strict build passes - Added api-reference.md and operations/upgrading.md to nav (were orphaned) - Replaced deprecated pymdownx.slugs.uslugify with configurable slugify (32a6098)

Fixed¶

Fix: Relax changelog test to allow CI-only releases with no section headers (139b9b0)
Fix: TASK-132 — --validate flag now validates plugin-specific configs

--validate previously only checked top-level AppSettings, missing plugin-specific validators (e.g. CKU ratio sum, required API credentials). Now discovers plugins via existing registry, calls validate_plugin_settings() on each tenant's plugin to exercise Pydantic validators without creating live connections. Extracts _build_registry() helper shared by _create_runner() and _validate_plugin_configs(). (52bc6c3) - Fix: TASK-135 — Distinguish metrics prefetch failure from empty data in allocation pipeline

Metrics prefetch failures (Prometheus unreachable/timeout/error) now produce a distinct METRICS_FETCH_FAILED allocation detail instead of being silently conflated with empty data (NO_USAGE_FOR_ACTIVE_IDENTITIES). Chargeback rows produced during Prometheus outages are identifiable and filterable in the DB.

Changes: _prefetch_metrics returns failed_keys set alongside prefetched data; AllocationContext gains metrics_fetch_failed bool; UsageRatioModel, _kafka_usage_allocation, and allocate_by_usage_ratio guard on the flag; new AllocationDetail.METRICS_FETCH_FAILED enum value persisted to DB. (74f39b5)

[0.3.0] - 2026-03-19¶

Fixed¶

Fix: Add CLI entry point and git-tag-based dynamic versioning
Add [project.scripts] entry so uv run chitragupt works without PYTHONPATH hacks
Replace hardcoded version with hatch-vcs (derives from git tags)
Fix hatch build config: packages=["src"] → explicit package list so editable install puts src/ on sys.path
Dockerfile: accept APP_VERSION build arg for release builds, fallback 0.0.0.dev0
Release workflow: pass git tag as APP_VERSION to Docker build (15d9241)

[0.2.2] - 2026-03-19¶

Documentation¶

Docs: Add status badges to README and Codecov integration to CI

Add CI, coverage, Python, Ruff, mypy, and uv badges to README. Add Codecov upload step to CI workflow for coverage reporting. (f9aabac)

[0.2.1] - 2026-03-18¶

Documentation¶

Docs: Comprehensive documentation audit and fixes across all doc areas
Create API reference (docs/api-reference.md) covering all 25 REST endpoints
Fix architecture docs: correct storage table names, add phase objects, deletion detection step, pipeline_state tracking, StorageModule protocol
Fix config docs: correct Flink allocation strategy, add missing connector product types, self-managed product types, SASL credentials, undocumented fields, cross-field validation constraints, tuning parameters
Fix ops docs: remove non-functional LOG_LEVEL env var, add Grafana multi-tenant warning, fix health/pipeline response schemas, add readiness endpoint, fix Dockerfile example
Fix getting-started: add .env auto-discovery, env var limitation warning, clarify emitter availability (a66fe1d)
Docs: Merge CCloud prerequisites into quickstart for single-page setup guide

Fold service account creation, permissions, and API key setup into the quickstart so users don't need to hop between pages. Add architecture overview section to README. (01f300f)

Fixed¶

Fix: TASK-134 — Implement Prometheus/OpenMetrics emitter for chargeback and resource presence metrics

Add PrometheusEmitter that exposes chargeback, billing, resource presence, and identity presence as timestamped Prometheus gauge metrics. Includes storage injection via needs_storage_backend factory attribute, collector script for promtool TSDB backfill, and example config. (b1c7ff8)

[0.2.0] - 2026-03-18¶

Changed¶

Maintenance: TASK-118 — Upgrade React to v19 with dependency cascade

Major version upgrades: react/react-dom 18→19, @refinedev/core 4→5, @refinedev/antd 5→6, react-router-dom 6 → react-router 7, @types/react 18→19. Removes forwardRef from ChargebackGrid (React 19 deprecation), migrates JSX.Element → React.JSX.Element (global JSX namespace removed), updates Refine pagination API (current → currentPage), removes react-router v7 future flags from test MemoryRouter usage.

All 258 tests pass, 92.77% coverage, typecheck and build clean. (69c5969) - Maintenance: TASK-117 — Upgrade jsdom to v28

Upgrade jsdom 25→28 (3 major versions). All 258 tests pass, coverage thresholds met, build and typecheck clean.

eslint 10 upgrade deferred — eslint-plugin-react-hooks has no stable release declaring eslint 10 peer dep support. (4206b80) - Maintenance: TASK-116 — Upgrade vite/vitest/plugin-react to latest majors

Upgrade vite 5→8 (Rolldown), vitest 2→4, @vitejs/plugin-react 4→6, @vitest/coverage-v8 2→4. Migrate custom vitest environment to v4 API (vitest/runtime imports, viteEnvironment). Fix pre-existing TS errors in dashboard and TenantContext test mocks. (5200739) - Maintenance: TASK-113 — Upgrade backend Python dependencies

fastapi 0.133.0→0.135.1, cachetools 7.0.3→7.0.5, ruff 0.15.2→0.15.6, sqlalchemy 2.0.46→2.0.48, plus transitive dependency updates. All 2364 tests pass, ruff and mypy clean. (910d535) - Cleanup: TASK-112 — Remove duplicate LOGGER definitions and dead helper functions

Remove uppercase LOGGER duplicates from 3 files (crn.py, connector_identity.py, connections.py), replacing call sites with the standard lowercase logger convention. Delete unused allocate_to_owner and allocate_to_resource from helpers.py (superseded by DirectOwnerModel/TerminalModel). Update test fixtures accordingly. (0968c34)

Documentation¶

Docs: TASK-109 — Add PostgreSQL connection string examples to config and docs

Add commented-out PostgreSQL examples to ccloud-complete.yaml and self-managed-complete.yaml. Expand deployment.md storage section with driver requirements, connection string format, one-database-per-tenant rule, env var usage, and SQLite-vs-PostgreSQL comparison table. (15c9c88) - Docs: Add CCloud RBAC permissions and service account setup to prerequisites (cb5c371)

Fixed¶

Fix: TASK-133 — Patch double table creation in API test fixtures and use rolling timestamps

Suppress cleanup_orphaned_runs_for_all_tenants during TestClient lifespan to prevent second create_tables() call on shared temp DB. Replace hardcoded 2026-02 timestamps with rolling dates so tests don't age outside the default 30-day query window. (bea1a31) - Fix: TASK-130 — Replace hardcoded system identity exclusion with SENTINEL_IDENTITY_TYPES constant

Define SENTINEL_IDENTITY_TYPES in identity.py to canonically identify synthetic fallback identities (e.g. UNALLOCATED). Replace the hardcoded != "system" check in orchestrator._build_tenant_period_cache with not-in-constant, matching the OWNER_IDENTITY_TYPES pattern. (8f8c090) - Fix: TASK-125 — Validate CKU ratio sum at config parse time

Move kafka_cku_usage_ratio + kafka_cku_shared_ratio sum check from allocation time to CCloudPluginConfig.validate_allocator_params so bad config is caught at startup before expensive gather operations. (24dd9b2) - Fix: TASK-123 — Guard masked API key check against empty strings

all() on an empty iterable returns True (vacuous truth), causing empty API key IDs to be incorrectly classified as masked. Replace is not None with truthiness check so empty strings fall through to CREDENTIALS_UNKNOWN. (e115328) - Fix: TASK-121 — Stamp last_seen_at on all CCloud gathered entities

All 12 gather functions in confluent_cloud/gathering.py omitted last_seen_at, leaving it permanently None. Add datetime.now(UTC) to every CoreResource and CoreIdentity constructor, matching the pattern used by self_managed_kafka and generic_metrics_only plugins. (e8e46c8) - Fix: TASK-120 — Add safety guard to remainder distribution loops

Extract _distribute_remainder() helper to DRY up identical loops in split_amount_evenly() and allocate_by_usage_ratio(). Replace unbounded while loop with for/else bounded by len(amounts)*2 iterations, raising RuntimeError on non-convergence. (eff6621) - Fix: TASK-115 — Upgrade ag-grid to v35

Upgrade ag-grid-community and ag-grid-react from v33.3.2 to v35.1.0. Migrate deprecated string-based rowSelection to object-based API, remove deprecated checkboxSelection/headerCheckboxSelection from column defs, remove suppressRowClickSelection. Clean up pre-existing quality issues (empty catch, dead code, invalid lint comment, unmemoized handlers). All 258 frontend tests pass. (9afb3d9) - Fix: Sync package-lock.json with package.json for CI docker build

npm ci was failing because openapi-typescript and related dependencies were in package.json but missing from the lock file. (a0121e1)

Security¶

Security: TASK-111 — Add .env exclusion, Dependabot, and dependency auditing
Add .env* glob to .gitignore to prevent accidental secret commits
Add .github/dependabot.yml for weekly pip ecosystem CVE scanning
Add uv audit --frozen step to CI workflow (non-blocking initially) (8ff2845)

[0.1.2] - 2026-03-15¶

Fixed¶

Fix: Use --latest instead of --unreleased in git-cliff integration test

When CI runs on a tagged commit, --unreleased produces no commits since the tag points at HEAD. Using --latest validates against the most recent tagged release which always has content. (c50c02b)

[0.1.1] - 2026-03-15¶

Documentation¶

Docs: TASK-108 — Add orchestration override examples to config examples

Add commented-out examples of allocator_overrides, identity_resolution_overrides, and allocator_params to self-managed-complete.yaml, generic-postgres.yaml, and generic-redis.yaml so users can discover these customization hooks without reading source code. (af63bee)

Fixed¶

Fix: Set mike default version in docs workflow

Ensures root GitHub Pages URL redirects to latest version instead of returning 404. (ab0af71)

[0.1.0] - 2026-03-15¶

Added¶

Feat(ccloud): complete plugin with OrgWide and Default handlers (chunk 2.6)
Add OrgWideCostHandler for AUDIT_LOG_READ, SUPPORT (tenant_period split)
Add DefaultHandler for TABLEFLOW_, CLUSTER_LINKING_ (UNALLOCATED)
Add org_wide_allocator (SHARED), default_allocator, cluster_linking_allocator (USAGE)
Wire all 7 handlers into plugin with documented ordering
Rename _make_row → make_row (public API for plugins)
Add full pipeline e2e integration tests

Phase 2 complete. CCloud plugin fully operational. (7e4b051) - Feat(ccloud): add FlinkHandler with metrics-driven identity resolution (chunk 2.5)

FlinkHandler: service handler for FLINK_NUM_CFU/FLINK_NUM_CFUS product types
flink_identity.py: two-step identity resolution (metrics → statement owner lookup)
flink_allocators.py: CFU usage-ratio allocation via stmt_owner_cfu context
create_flink_sentinel() in _identity_helpers.py for DRY compliance
Handler wired as 5th in plugin chain
37 new tests, 97% coverage (ee0c8c6)
Feat(ccloud): add ConnectorHandler and KsqldbHandler (chunk 2.4)

Implements ServiceHandler protocol for Connector and ksqlDB services:

ConnectorHandler: - service_type: "connector" - Handles: CONNECT_CAPACITY, CONNECT_NUM_TASKS, CONNECT_THROUGHPUT, CUSTOM_CONNECT_PLUGIN - Identity resolution via connector auth mode (SERVICE_ACCOUNT, KAFKA_API_KEY, UNKNOWN) - Allocators: capacity (SHARED), tasks/throughput (USAGE)

KsqldbHandler: - service_type: "ksqldb" - Handles: KSQL_NUM_CSU, KSQL_NUM_CSUS - Identity resolution via direct owner_id - Allocator: ksqldb_csu_allocator (USAGE)

Plugin wiring: - Handler order: kafka, schema_registry, connector, ksqldb - All handlers tested and wired into get_service_handlers()

32 new tests, 680 total passing, 99% coverage on CCloud plugin. (ffad716) - Feat(ccloud): add ksqlDB identity resolution helper (ebeeb91) - Feat(ccloud): add connector identity resolution helper (chunk 2.4)

Implements resolve_connector_identity() for resolving connector owners based on authentication mode (SERVICE_ACCOUNT, KAFKA_API_KEY, UNKNOWN). Uses sentinel identities for unknown/masked credential cases. (6fcaa95) - Feat(ccloud): implement ksqlDB CSU allocator (chunk 2.4)

Add ksqldb_csu_allocator for even split across active identities with USAGE cost type. Follows same pattern as connector_allocators.

Fallback chain: merged_active -> tenant_period -> UNALLOCATED (f2016d4) - Feat(ccloud): add Connect allocators for chunk 2.4

Implement three connector cost allocators: - connect_capacity_allocator: even split, SHARED cost type - connect_tasks_allocator: even split, USAGE cost type - connect_throughput_allocator: delegates to tasks allocator

All use merged_active -> tenant_period -> UNALLOCATED fallback chain. 100% test coverage with 9 unit tests. (340015f) - Feat(ccloud): implement Kafka + Schema Registry handlers (chunk 2.3)

KafkaHandler: 7 product types, Prometheus bytes_in/bytes_out queries
SchemaRegistryHandler: 3 product types, no metrics needed
Identity resolution with temporal filtering (billing window)
Allocators: hybrid 70/30, pure usage-based, even split
Plugin wiring: get_service_handlers(), get_metrics_source()
74 new tests, 97% coverage on handlers/allocators (e8976d5)
Feat(ccloud): implement resource + identity gathering (chunk 2.2)
Add CRN parser for Confluent Resource Names
Add CCloudConnection.get_raw() for non-envelope APIs
Add proactive throttling with request_interval_seconds
Implement CCloudBillingCostInput with date windowing
Implement all resource gatherers: environments, kafka clusters, connectors, schema registries, ksqlDB, flink pools/statements
Implement identity gatherers: service accounts, users, API keys, identity providers/pools
Add connection caching for Flink regional connections
Fix TD-020: ensure plugin.initialize() before get_metrics_source()
Fix TD-017: wire get_cost_input() to return CCloudBillingCostInput

520 tests passing, 97% coverage (9848a72) - Feat: GAP-002+003+005+010+015+017 workflow runner + plugin metrics

GAP-015+017: EcosystemPlugin protocol now owns get_metrics_source(); CCloud plugin returns None; workflow_runner uses plugin.get_metrics_source() instead of standalone _create_metrics_source()
GAP-002: wait()-based global timeout replaces as_completed per-future timeout
GAP-003: storage.create_tables() called before orchestrator runs
GAP-005: enable_periodic_refresh=False runs single cycle then returns
GAP-010: max_parallel_tenants (default=4) bounds ThreadPoolExecutor size
FeaturesConfig gains max_parallel_tenants field with validation (f14a1f9)
Feat(ccloud): add ConfluentCloudPlugin stub implementing EcosystemPlugin (d92605e)
Feat(ccloud): add typed view models for Flink and Connectors (633d9f4)
Feat(ccloud): add CCloudPluginConfig with validation (d3cef84)
Feat(ccloud): add CCloudConnection.post() method (76dc209)
Feat(ccloud): add retry logic with rate limit handling (65e420a)
Feat(ccloud): implement CCloudConnection.get() with pagination (d556b96)
Feat(ccloud): add CCloudConnection dataclass structure (1516302)
Feat(ccloud): add CCloudApiError and CCloudConnectionError exceptions (9157bd4)

Changed¶

Add details about Log Format changing capability (0531fbd)
Other stuff (b674808)
Fix slow shutdown: drain() now signals shutdown event before waiting

In 'both' mode, uvicorn's lifespan teardown called drain() which passively waited for running tenants to finish. But shutdown_event was only set after run_api() returned — creating a circular dependency where drain waited for pipelines that didn't know they should stop. (93001be) - Fix pipeline run persistence and test expectations

_run_pipeline now always persists workflow_runner results to the PipelineRun DB record (status, dates_gathered, dates_calculated, rows_written, error_message) and sets status=skipped when already_running. trigger_pipeline always creates a PipelineRun regardless of whether a workflow_runner is present, so every trigger has a trackable record. Corrected test expectations: no-runner case is status=failed (not completed), and the capture_run_api mock now accepts the mode kwarg. (4a8e691) - Add application lifecycle layer with pipeline status tracking and readiness endpoint

The backend was built as a headless batch processor with API and frontend bolted on afterward. Neither side understood the application's lifecycle state — on first startup the UI showed empty charts identical to a broken system, and during pipeline runs the UI allowed writes that could conflict.

This adds: - Explicit pipeline stage tracking (stage + current_date fields on PipelineRun) - PipelineRunTracker class managing run lifecycle (create, progress, finalize, fail) - Orphaned run cleanup on startup (handles process crash mid-pipeline) - GET /api/v1/readiness endpoint with per-tenant status - Frontend readiness-first initialization with polling - Persistent pipeline status banner showing live stage and date - Read-only mode during pipeline activity (disables write operations) - Alembic migration 007 for new columns (bc22c01) - Style: apply ruff formatting and remove unused import (e9ae42a) - Set explicit image names in docker-compose

Removes 'deployables-' prefix from built images (58aec77) - Add docker-push target for multi-arch builds

Builds linux/amd64 + linux/arm64 images
Pushes to registry (default docker.io, override with REGISTRY=)
Local docker-build unchanged (single arch, no push) (93d44eb)
Fix frontend TypeScript errors
Remove invalid style prop from AgGridReact (goes on wrapper div)
Fix htmlType type in test mock to match button type union
Remove unused setSearchParams destructure
Add missing resource_id to TagWithDimensionResponse mocks (b9ffd03)
Fix Dockerfile: copy uv from official image, fix PYTHONPATH
Use ghcr.io/astral-sh/uv image instead of install script (slim has no curl)
Fix PYTHONPATH to /app/src where modules actually live
Fix entrypoint to 'python -m main' (not src.main) (d5223a8)
Add Docker make commands for local development
docker-build: Force rebuild all images (--no-cache)
docker-up: Start backend + grafana
docker-dev: Start all services including frontend
docker-dev-ui: Backend + frontend only (skip grafana)
docker-down: Stop all services
docker-logs: Tail logs (203a73b)
Perf: Cut test suite runtime from 235s to 18s

SMK plugin tests called plugin.initialize() with identity_source defaulting to "prometheus", triggering _validate_principal_label() which hit a real Prometheus endpoint with 4-retry exponential backoff (~15s per test, ~140s total). Fixed by setting identity_source=static in base_settings fixtures — tests that specifically exercise the prometheus validation path already override this explicitly.

CCloud connection retry tests slept through real backoff due to a 1.0s floor guard in _get_rate_limit_wait plus additive random jitter in _calculate_backoff (~6-7s for max-retries test, ~1.2s for rate-limit header tests). Added autouse fixture in conftest patching time.sleep in the connections module. Replaced wall-clock elapsed assertions with mock call_count checks for the throttling tests. (05291b8) - Fix dev targets: add PYTHONPATH and use npx vite (9cc5d41) - Add dev targets for running backend and frontend together

make dev: API + worker + frontend
make dev-api: API + worker only
make dev-ui: API only + frontend (for UI development) (701cd22)
Add per-date progress logging during chargeback calculation

Logs start/end of each billing date with row count and elapsed time, matching the reference codebase behavior for tracking calculation progress. (c237b1a) - Fix default plugins path to point to src/plugins

_DEFAULT_PLUGINS_PATH was pointing to /plugins/ which doesn't exist. Plugins are in src/plugins/. Removes need for explicit plugins_path config in every YAML file. (20bf1fc) - Fix logging disabled by alembic and CCloud API page_size error

env.py: Add disable_existing_loggers=False to fileConfig() call Alembic's logging config was silently disabling all existing loggers, causing tenant errors to not be displayed after migrations ran.
connections.py: Reduce DEFAULT_PAGE_SIZE from 500 to 99 CCloud API requires page_size < 100. Endpoints without explicit page_size override were failing with 400 error. (35a6dd9)
Fix broken example configs to match actual plugin schemas

README quickstart referenced non-existent config.yaml. Example configs used wrong field names (flat api_key vs nested ccloud_api.key) and non-existent fields for self-managed plugin. (ace66f3) - Rename project to Chitragupt, optimize Dockerfile, fix lint issues

Rename package from chargeback-engine to chitragupt
Update all references in docs, configs, and code
Optimize Dockerfile: use uv standalone installer, remove uv from runtime
Fix 45 ruff lint issues in test files (unused imports, long lines, import order) (2c565ac)
Add Makefile with common dev commands (f5e6295)
Ignore MkDocs build output (site/) (aa1d1f2)
5.1: Tech Debt Phase 2

Resolves TD-019, TD-021, TD-031, TD-034, TD-035, TD-037.

TD-031/034/035: Add AllocationDetail StrEnum with standardized reason codes for allocation decisions. Update helpers and all allocators to use enum.
TD-019: Migrate from requests to httpx for thread-safe HTTP clients. Update CCloudConnection and PrometheusMetricsSource.
TD-037: Add SQL-level pagination to find_active_at/find_by_period. Return (list, total_count) tuples with filter params and LIMIT/OFFSET.
TD-021: Add TenantRuntime caching in WorkflowRunner with health checks, config change detection, and proper lifecycle management.

1123 tests, 98.17% coverage. (f92e1e7) - 5.0: Tech Debt Cleanup

Resolves 12 tech debt items: - TD-008: Document step_seconds fallback - TD-010: HTTP connection pooling in Prometheus - TD-016: Data retention cleanup in WorkflowRunner - TD-018/TD-024: Session lifecycle cleanup (plugin.close()) - TD-023: Orchestrator test invariants - TD-029: Handler gather_identities tests - TD-032: FlinkContextDict TypedDict - TD-033: Flink statement name collision fix - TD-036: Flink region skip logging - TD-038: Pipeline run state persistence (PipelineRunTable) - TD-039: Single-tenant pipeline trigger (run_tenant) - TD-040: OpenAPI TypeScript generation setup - TD-041: .nvmrc for Node 22 (493d5b6) - 4.4: Tag Management + Export

Backend: - GET/PATCH/DELETE /tags endpoints with search, pagination - POST /tags/bulk for bulk tagging by dimension IDs - POST /tags/bulk-by-filter for bulk tagging by filter criteria - Migration 003: add display_name column + UNIQUE(dimension_id, tag_key) - Tag model: tag_key (immutable), tag_value (auto-UUID), display_name (mutable) - find_by_filters() now overlays custom tag display_names onto ChargebackRow.tags

Frontend: - TagManagementPage (/tags) with search, inline edit, delete w/ Popconfirm - BulkTagModal for bulk tagging (by IDs or by filter) - ExportButton for CSV export - SelectionToolbar shows when rows selected - ChargebackGrid with row selection checkboxes - TagEditor simplified to 2-field form (Key + Display Name)

Tests: 1077 backend (97.9% cov), 135 frontend (86.9% func cov) QA Rounds: 3 (541eee3) - 4.3: Cost Dashboards with polish (error/retry + chart toggle)

Backend: - Aggregation endpoint filter params (identity_id, product_type, resource_id, cost_type) - Repository aggregate() with optional filter WHERE clauses

Frontend: - 4 ECharts components (CostTrendChart, CostByIdentityChart, CostByProductChart, CostByResourceChart) - useAggregation hook with refetch - ChartCard with error/onRetry props - ProductChartTypeToggle (Segmented pie/treemap) - Dashboard page with 2x2 grid, filter sync, time bucket selector - DRY optimization: trendData reused for identity chart (2b6671f) - 4.2: Chargeback Explorer with AG Grid + tag editing

Backend: - Add dimension_id to ChargebackRow and ChargebackResponse - Add GET /chargebacks/{dimension_id} endpoint - Fix route order (aggregate before dynamic path) - Fix date.today() -> datetime.now(UTC).date()

Frontend: - AG Grid with infinite scroll (100k+ rows) - FilterPanel with URL-synced state - ChargebackDetailDrawer with tag editing - TagEditor component - 59 tests, 95% coverage (6876280) - 4.1: Frontend scaffold with Refine.dev + Ant Design

Custom tenant-scoped data provider for FastAPI backend
TenantContext with localStorage persistence and retry
Ant Design Layout with Sider/Header/Content
Disabled menu items when no tenant selected
5 placeholder pages with tenant checks
MSW test infrastructure (20 tests, 98% coverage)
Vite dev proxy to backend (ff7d794)
3.5: Docker deployment + CLI polish
Dockerfile: Python 3.14, uv 0.10.6, multi-stage, non-root user
docker-compose.yml: engine + grafana with healthcheck + depends_on
Docker-ready configs: config-ccloud.yaml, config-self-managed.yaml
.env.example with credential templates
Updated datasource.yml for directory mount
Comprehensive README.md deployment guide

Phase 3 complete. (e128108) - 3.4: Self-Managed Kafka Plugin

Implements metrics-only chargeback paradigm where costs are constructed from YAML pricing model × Prometheus usage metrics rather than fetched from a billing API.

SelfManagedKafkaPlugin with dependency injection pattern
ConstructedCostInput generates BillingLineItems from infra costs
SelfManagedKafkaHandler handles all 4 product types
Per-product-type allocators (COMPUTE/STORAGE even, NETWORK usage-ratio)
Prometheus + Admin API resource/identity discovery
kafka-python as optional dependency
132 tests, 100% coverage on new module (7ed3030)
3.3: Grafana dashboards + Docker Compose
Docker Compose with Grafana + SQLite datasource plugin
chargeback_overview.json: 10 panels (stats, pie, time series)
chargeback_details.json: 6 table panels with pagination
Template variables with cascade queries (:sqlstring for IN clauses)
README.md with setup instructions (713bc4f)
3.2: FastAPI write endpoints + aggregation
PATCH /chargebacks/{dimension_id} for tag management (replace/add/remove)
Tags CRUD routes (GET/POST/DELETE)
Pipeline trigger/status with WorkflowRunner.run_once() integration
Server-side aggregation (multi-dimension GROUP BY, time bucketing)
CSV export with filters and streaming
Repository extensions: get_dimension, aggregate, get_tag, find_tags_for_tenant
Schemas: Tag, Pipeline, Aggregation, Export models
91 new tests (895 total), 98% coverage (1c7d417)
3.1: FastAPI core + read endpoints
FastAPI factory with lifespan (shared backend caching, disposal)
6 read endpoints: tenants, billing, chargebacks, resources, identities, health
Temporal query support (active_at vs period_start/period_end)
Database-level pagination with LIMIT/OFFSET
Datetime validation (reject naive datetimes)
ApiConfig extensions (enable_cors, cors_origins, request_timeout_seconds)
main.py --mode api|worker|both
63 API tests, 98% coverage (0b803d6)
Style: ruff format (2232aaa)
2.1: post-implementation hardening
Add connection pooling via requests.Session
Add close() method for session cleanup
Fix rate limit headers per Confluent Cloud API docs: use rateLimit-reset (relative seconds) not X-RateLimit-Reset
Add test coverage for RateLimit-Reset header variant
Module-level imports in test_connections.py
Remove dead code (unreachable raise)
Improve test_connection_close with mock assertion (d092874)
Added memory folder to ignored (abe6bcc)
1.7: Pipeline orchestrator + workflow runner
ChargebackOrchestrator: gather→calculate pipeline with UTC validation, UNALLOCATED identity fallback, zero-gather protection, allocation retry
WorkflowRunner: concurrent tenant execution with per-tenant timeout
main.py: CLI entry point with --config-file, --env-file, --run-once
Storage: mark_resources_gathered, mark_needs_recalculation, increment_allocation_attempts, allocation_attempts column
Config: TenantConfig fields (allocation_retry_limit, max_dates_per_run, zero_gather_deletion_threshold, tenant_execution_timeout_seconds), LoggingConfig.per_module_levels
381 tests, 96% coverage (62 new tests)

Phase 1 Core Framework complete. (d0df416) - 1.6: Metrics layer — MetricsSource protocol, PrometheusMetricsSource

Thread-safe TTL cache, retry with backoff + jitter, parallel query execution via ThreadPoolExecutor, basic/digest/bearer auth, bounded cache eviction. 44 tests, 100% coverage on metrics modules. (bd0ab91) - 1.5: Allocation engine — helpers, registry with overrides, dynamic loader

AllocationContext/AllocationResult dataclasses, AllocatorRegistry with two-tier override support, 6 allocation helpers (usage_ratio, evenly, hybrid, to_owner, to_resource, active_fraction), split_amount_evenly, load_protocol_callable with signature validation for customer extensibility. Resolves TD-001/TD-002. (1d71fd8) - 1.4: Storage layer — schema, repositories, mappers, migrations

SQLModel tables (7), per-entity repository protocols (6), stateless domain↔ORM mappers, UnitOfWork protocol, SQLModelBackend with engine cache, Alembic baseline migration. Temporal queries (find_active_at, find_by_period), star-schema chargebacks, Decimal-as-string for SQLite.

PipelineState + CustomTag domain models. TD-003 resolved (UnitOfWork import). TD-004/005 resolved (Alembic warning, ResourceWarning).

89 storage tests, 210 total, 95% coverage, 3 QA rounds. (9ce3382) - 1.3: Configuration system — YAML loader, Pydantic models, env substitution (838f121) - 1.2: Plugin protocols, registry, and loader

4 runtime_checkable protocols (CostAllocator, CostInput, ServiceHandler, EcosystemPlugin), factory-based PluginRegistry, EcosystemBundle with post-init product_type indexing, and discover_plugins() loader with structural validation. 26 tests, 100% coverage. (4bc2e66) - Flatten project structure to repo root

Move pyproject.toml, src/, tests/ from chargeback-engine/ subdirectory to repo root. Standard Python project layout. Update .gitignore for Python artifacts. Add uv.lock. (2a0b1c4) - 1.1: Project scaffold + domain models

Core domain models: Resource, ResourceStatus, Identity, IdentitySet, IdentityResolution, BillingLineItem, ChargebackRow, CostType, MetricQuery, MetricRow. Pure dataclasses, frozen where immutability matters. 49 tests, 100% coverage. (8beaf76) - Initial project setup: .gitignore

Git exclusion rules for chargeback-engine. Excludes backlog/, .claude/, ccloud-chargeback-helper-reference/, CLAUDE.md from git. (1d145b7)

Documentation¶

Docs: TASK-105 — Add upgrade and migration guide to operations documentation

Covers backup procedures (SQLite/PostgreSQL), upgrade steps for Docker and source-based deployments, auto-migration behavior, rollback, and breaking changes policy. Linked from deployment doc and operations index. (5b5b7a7) - Docs: TASK-103 — Add CHANGELOG and release notes mechanism

Add git-cliff-powered changelog generation with Keep a Changelog format. Combined release + docs workflow replaces docs.yml: tag push triggers changelog-based GitHub Release creation then versioned MkDocs deployment. Includes CONTRIBUTING.md with commit conventions and release process. (e1af479) - Docs: Add project name origin and extended description to README (1ba7ec1) - Docs: TASK-102 — Add Docker-based quickstart guide to deployables/

Step-by-step guide for running the full stack with Docker Compose, covering config selection, credentials, UI profile, smoke tests, and teardown. Linked from root README. (554bf3b) - Docs: Open external links in new tab (00e64da) - Docs: Add pipeline flow diagram to data-flow.md (afacb06) - Docs: Fix ccloud-reference.md to match actual implementation (fcfc7e7) - Docs: Add README.md and fix mkdocs anchor slugify

Add root README with quick start, features, and doc links
Add slugify setting to mkdocs.yml for consistent anchor generation (9dfe76e)
Docs: TASK-029 — Comprehensive user documentation infrastructure

Add complete MkDocs-based documentation: - mkdocs.yml with Material theme, mermaid, versioning via mike - .github/workflows/docs.yml for tag-triggered deployment - 20 markdown files covering getting-started, configuration, architecture, and operations - pyproject.toml docs dependency group (mkdocs-material, mike) (da4809f)

Fixed¶

Fix: Use uv run for mike in docs workflow

mike is installed in the docs dependency group via uv, not globally. Must invoke through uv run in CI where the venv isn't on PATH. (5498d35) - Fix: TASK-106 — Add CLI experience flags: --version, --validate, --show-config

Replace hardcoded API_VERSION with dynamic get_version() via importlib.metadata. Add --version (argparse built-in), --validate (config pre-flight check), and --show-config (resolved config with SecretStr masking) flags to CLI entry point. All three exit immediately without starting the engine or API server.

Also fix all pre-existing mypy strict errors (38 across 14 files) and ruff lint errors (51 across ~30 files) to pass newly added pre-commit hooks. (a359136) - Fix: TASK-101 — Fix UI auto-refresh cascade, filters instability, and missing product_category dimension

Split TenantContext into stable (tenant selection) and volatile (readiness polling) contexts to prevent 11 of 12 consumers from re-rendering every 5s during pipeline runs. Memoize filters object and add queryParams value in useChargebackFilters to eliminate ChargebackGrid and AllocationIssues cascade re-fetches. Fix product_sub_type → product_category in dashboard aggregation. (b4cdaa2) - Fix: TASK-100 — Fix date persistence, stop auto-refresh cascade, and add Refresh Data button

Three UX bugs fixed: (1) date range now persists across page navigation and reload via localStorage fallback (URL > localStorage > defaults), (2) readiness poll no longer cascades into data re-fetches — tenantsLoaded converted to useRef, setReadiness guarded by JSON fingerprint, context value memoized with useMemo, restartKey counter for error recovery, (3) Refresh Data button added to FilterPanel with dashboard key-remount and AG Grid cache refresh wiring.

Also fixes pre-existing jsdom/AbortController incompatibility that caused 45 test failures by adding a custom Vitest environment that restores native AbortController. (bcc4006) - Fix: TASK-099 — Add missing database indexes to eliminate full table scans on UI/Grafana polls

Add Alembic migration 008 with two composite indexes: - ix_chargeback_facts_dimension_timestamp(dimension_id, timestamp) on chargeback_facts - ix_chargeback_dimensions_eco_tenant(ecosystem, tenant_id) on chargeback_dimensions

Update table model table_args to match migration. Fix ruff formatting in readiness tests. (73ad19b) - Fix: TASK-098 — Add AbortController to all fetch hooks and backend backpressure to prevent UI Connecting state

Frontend: Replace cancelled-flag pattern with AbortController in all data hooks (useInventorySummary, useDataAvailability, useAllocationIssues, useAggregation, useFilterOptions, TenantContext, ChargebackGrid, ChargebackDetailDrawer, TagManagementPage). In-flight requests now abort on unmount/dep change, preventing request stampede from overwhelming the backend.

Backend: Add uvicorn concurrency config (limit_concurrency=100, timeout_keep_alive=10), readiness endpoint TTL cache (2s via time.monotonic), and RequestTimeoutMiddleware (504 after request_timeout_seconds) to provide backpressure under load. (f0edc91) - Fix: TASK-097 — Add dates_pending_calculation to PipelineRunResult for log disambiguation

Add pending count from find_needing_calculation() to pipeline run summary logs so operators can distinguish caught-up (pending=0, calculated=0) from partial failure (pending=3, calculated=0) without cross-referencing error lines. (26c5c57) - Fix: TASK-096 — Implement read/write connection pool separation to fix frontend disconnects during pipeline execution

Separate read-only and read-write SQLite connection pools so API read endpoints never contend with the pipeline writer. Read-only engine uses PRAGMA query_only=1 to prevent lock escalation, eliminating SQLITE_BUSY errors and threadpool exhaustion during pipeline runs in --mode both.

Key changes: - Add ReadOnlyUnitOfWork protocol (ISP-compliant: no commit/rollback) - Add get_or_create_read_only_engine() with shared _create_cached_engine helper (DRY) - Add ReadOnlySQLModelUnitOfWork subclass with commit() guard - Split API dependencies: get_unit_of_work (read-only) / get_write_unit_of_work - Fix session leak: dependencies now own UoW context manager - Update all 15 route files: read routes use ReadOnlyUnitOfWork, write routes use UnitOfWork - Readiness/pipeline-status/tenants-list use read-only pool directly (eed262a) - Fix: TASK-095 — Fix SQL parameter explosion and N+1 query patterns in repository layer

Replace materialized dimension ID list in delete_before() with scalar subquery to avoid SQLite's 32K parameter limit. Rewrite _run_bulk_tag() to batch-fetch dimensions and tags (2 queries per 500-item chunk instead of 2N individual queries). Add chunking guards to _overlay_tags() and get_dimensions_batch(). Add find_tags_by_dimensions_and_key() batch method to TagRepository protocol. (5580802) - Fix: task-094 — Fix readiness endpoint and UI pipeline status across all startup modes

Six bugs fixed: 1. API-only mode no longer reports orphaned DB "running" records as active pipeline 2. Frontend shows mode-appropriate message in no_data state (API-only vs both) 3. Orphan cleanup extracted to shared function, called at API-only startup too 4. Frontend polls at 5s during active pipeline, 15s when idle 5. Dead app.state.pipeline_runs dict removed 6. Per-tenant permanent failure now visible in UI even when other tenants healthy (fd7932a) - Fix: task-093 — Add allocation issues diagnostic table

Add dedicated endpoint, repository method, and dashboard table for surfacing failed cost allocations grouped by dimension + error code, ordered by total_cost DESC. Filters exclude success codes (usage_ratio_allocation, even_split_allocation) and NULL allocation_detail. (28405d4) - Fix: task-092 — Add object inventory counters panel to dashboard

Add collapsible InventoryCounters panel showing resource and identity counts. New useInventorySummary hook fetches from /inventory/summary endpoint. Integration test verifies full wiring from page to component. (7d8a4c2) - Fix: task-091 — Add data availability timeline panel to dashboard

Add visual timeline panel showing dots for each date with chargeback data. Users can now immediately see data freshness and gaps in the dashboard.

New useDataAvailability hook fetches from /chargebacks/dates endpoint
New DataAvailabilityTimeline ECharts scatter chart with date filtering
Integrated into dashboard between stat cards and cost trend chart
16 new tests covering hook and component behavior (2c22d15)
Fix: task-090 — Convert filter inputs to dynamic dropdowns

Replace free-text Input components with Select dropdowns in FilterPanel. New useFilterOptions hook fetches identities, resources, and product types from backend APIs with Promise.all, deduplication, and error handling. Split into two effects to avoid refetching identities/resources on date change. Both call sites updated to pass tenantName prop.

176 tests, 96% coverage. (9521a4a) - Fix: task-089 — Add pie charts for environment and product sub-type

Add 4 pie charts to dashboard in responsive row: Environment, Resource, Product Type, Product Sub-Type. Create DimensionPieChart component with topNWithOther utility for top-10 + "Other" bucketing. Refactor CostByProductChart and CostByResourceChart to delegate to new component. Add environment_id and product_sub_type aggregation hooks.

Add topNWithOther() to aggregation.ts
Create DimensionPieChart component
Refactor CostByProductChart pie mode to use DimensionPieChart
Convert CostByResourceChart from table to pie
Add 2 new useAggregation hooks (environment_id, product_sub_type)
Update dashboard layout: 4 pies at xs=24 sm=12 lg=6
155 tests passing, 95.8% coverage (befe480)
Fix: task-088 — Add summary stat cards to dashboard

Add SummaryStatCards component showing Total Cost, Usage Cost, and Shared Cost at the top of CostDashboardPage. Update AggregationBucket and AggregationResponse types to include usage_amount and shared_amount fields matching the backend schema from task-084. Update existing test fixtures and MSW handlers to include the new fields. (df2bea5) - Fix: task-087 — Add object inventory counts endpoint

Add GET /api/v1/tenants/{tenant_name}/inventory/summary endpoint that returns counts of resources and identities grouped by type. Implements count_by_type() on both ResourceRepository and IdentityRepository protocols with GROUP BY queries. (5d5c575) - Fix: task-086 — Add data availability endpoint

Add GET /tenants/{tenant_name}/chargebacks/dates returning distinct dates with chargeback facts for a tenant. Adds get_distinct_dates to ChargebackRepository protocol and SQLModel implementation, using a lightweight DISTINCT date(timestamp) query with tenant-scoped subquery. (0e5ff1a) - Fix: task-085 — Add environment_id as groupable dimension in aggregate endpoint

Add environment_id to _VALID_GROUP_BY in aggregation route
Handle environment_id specially in aggregate() — maps to ResourceTable.parent_id
Use conditional LEFT OUTER JOIN on ResourceTable only when environment_id requested
5 new integration tests covering environment grouping, org-wide costs, multi-dimension (c492455)
Fix: task-084 — Split usage_amount and shared_amount in aggregate endpoint

Add usage_amount and shared_amount fields to aggregate endpoint response, allowing callers to distinguish usage-driven vs shared/infrastructure costs without separate filtered requests. Uses SQL CASE WHEN for single-pass aggregation. Backward compatible — total_amount unchanged. (8ea4073) - Fix: task-083 — Add shutdown_check to orchestrator for clean signal propagation

ChargebackOrchestrator now accepts optional shutdown_check callback. When set, the run() loop checks it before each billing date iteration and breaks cleanly if shutdown is requested. WorkflowRunner wires _is_shutdown_requested as the callback, enabling single Ctrl+C shutdown. (b1f89b2) - Fix: date range picker resets on change due to batched setSearchParams

Two sequential setFilter calls for start_date and end_date race under React Router's batched setSearchParams — the second overwrites the first. Add setFilters() batch setter and use it in FilterPanel's date picker. (331a055) - Fix: task-082 — Batch chargeback fact writes with session.add_all()

Replace per-row session.merge() with batched session.add_all() for chargeback facts. Adds upsert_batch() to ChargebackRepository protocol and implementation. Renames _process_billing_line to _collect_billing_line_rows, accumulates rows in CalculatePhase.run() and calls upsert_batch() once per date.

Performance: ~41K rows/day now written in single add_all() vs 41K merges. (4f9fc3f) - Fix: task-081 — Remove max_dates_per_run cap during backfill

Remove artificial date-processing limit that caused 90-day backfill to take ~3 hours instead of ~30 minutes. The cap provided no benefit since tenants run in parallel ThreadPoolExecutor threads.

Remove max_dates_per_run field from TenantConfig
Remove _max_dates_per_run and cap slice from ChargebackOrchestrator
Update example configs and docs
Add backward-compat tests for configs with extra field (0a0f5a2)
Fix: task-077 — ChainModel construction-time validation

Add post_init to ChainModel that enforces: - Non-empty models sequence (ValueError if empty) - Last model must be TerminalModel (ValueError if not)

Updated 13 existing tests to comply with new validation. (afde827) - Fix: task-076 — Remove deprecated allocation helpers

Remove allocate_evenly_with_fallback from helpers.py after migration to SMK_INFRA_MODEL complete. Delete stale tests, update assertions to use ChainModel-based allocators. No behavioral changes — cleanup only. (88a540d) - Fix: task-079 — SMK infrastructure allocation models (COMPUTE/STORAGE)

Migrate SELF_KAFKA_COMPUTE and SELF_KAFKA_STORAGE from allocate_evenly_with_fallback to composable ChainModel. Adds SMK_INFRA_MODEL with 3-tier chain: - Tier 0: EvenSplit over metrics_derived (CostType.USAGE) - Tier 1: EvenSplit over resource_active (NO_ACTIVE_IDENTITIES_LOCATED) - Tier 2: Terminal to UNALLOCATED (NO_IDENTITIES_LOCATED)

Fixes behavioral gap where static identities in resource_active were bypassed because allocate_evenly_with_fallback used tenant_period (always empty in SMK). (d898c8a) - Fix: task-080 — Generic Metrics-Only plugin composable models

Migrates GenericMetricsOnlyHandler from imperative allocation helpers (allocate_evenly_with_fallback, _make_usage_ratio_allocator closure) to declarative ChainModel composition.

Add make_model_from_config() factory for even_split (2-tier) and usage_ratio (3-tier) ChainModels with proper AllocationDetail codes
Replace _allocator_map with _model_map: dict[str, ChainModel]
Fix TestCircularImports sys.modules pollution that broke isinstance checks across module reloads
Add integration test through plugin.initialize() entry point (b131a3f)
Fix: task-075 — SMK allocation models migration to composable ChainModel
Created allocation_models.py with SMK_INGRESS_MODEL and SMK_EGRESS_MODEL
3-tier ChainModel: UsageRatio → EvenSplit(resource_active) → Terminal
Updated kafka.py to use models directly in _ALLOCATOR_MAP
Removed kafka_allocators.py (imperative logic now in models)
Updated all tests for new imports and behavioral delta (resource_active fallback) (71369b2)
Fix: task-074 — CCloud fallback allocator for unknown product types

Add get_fallback_allocator() to EcosystemPlugin protocol and wire through EcosystemBundle. CCloud returns unknown_allocator (allocates to resource_id with SHARED cost type), SMK/Generic return None. Orchestrator now dispatches to fallback_allocator instead of inline UNALLOCATED allocation, preserving cost lineage for unrecognized product types per reference UnknownAllocator. (aef7724) - Fix: task-073 — CCloud Org-wide model with UNALLOCATED terminal

Add ORG_WIDE_MODEL ChainModel with explicit UNALLOCATED terminal for org-wide costs (AUDIT_LOG_READ, SUPPORT). Fixes ALLOC-02 gap where org-wide costs were terminating to resource_id instead of UNALLOCATED system identity.

_ORG_WIDE_OWNER_TYPES excludes "principal" — only durable identity types
EvenSplit tier 0 across tenant_period owners (SA, user, pool)
TerminalModel tier 1 to UNALLOCATED with NO_IDENTITIES_LOCATED detail
org_wide_allocator delegates to ORG_WIDE_MODEL
28 tests covering all verification scenarios
Fixed pre-existing kafka_handler test import (kafka_num_cku_allocator) (1771b50)
Fix: task-072 — CCloud Kafka CKU composition model

Migrate kafka_num_cku_allocator to composable DynamicCompositionModel: - Add _extract_combined_usage helper (DRY: delegates to _extract_usage) - Add CKU_USAGE_CHAIN (4-tier: usage ratio → merged_active → tenant_period → terminal) - Add CKU_SHARED_CHAIN (3-tier: merged_active → tenant_period → terminal) - Add make_dynamic_cku_model() and _CKU_DYNAMIC_MODEL singleton - Add kafka_cku_allocator (single-line delegation to model) - Remove kafka_num_cku_allocator - Update kafka.py wiring for KAFKA_NUM_CKU/CKUS

27 new tests in test_cku_allocators.py. 100% coverage. (e427a23) - Fix: task-071 — KAFKA network models migration to composable ChainModel

Add _extract_usage helper for single-metric-key usage extraction
Add make_network_model factory producing 4-tier ChainModel
Add BYTES_IN_MODEL, BYTES_OUT_MODEL, PARTITION_MODEL constants
Add kafka_network_read_allocator, kafka_network_write_allocator, kafka_partition_allocator
Update _KAFKA_ALLOCATORS dict with direction-specific allocators
Fix metric direction blending: READ uses bytes_out, WRITE uses bytes_in (cf5fa79)
Fix: task-070 — FLINK_MODEL migration to composable allocation models

Migrates Flink allocators to use the composable ChainModel system: - Define FLINK_MODEL as 4-tier ChainModel (UsageRatio → merged_active → tenant_period → terminal) - Replace imperative flink_cfu_allocator with direct alias to FLINK_MODEL - Fix terminal tier: resource_id instead of "UNALLOCATED", SHARED instead of USAGE - Add tenant_period fallback tier (missing in original) (b88bb79) - Fix: task-069 — KSQLDB_MODEL migration to composable allocation models

Migrated ksqlDB allocator to use ChainModel with 3-tier fallback: - Tier 0: EvenSplit over merged_active (USAGE) - Tier 1: EvenSplit over tenant_period owners (SHARED, NO_ACTIVE_IDENTITIES_LOCATED) - Tier 2: Terminal to resource_id (SHARED, NO_IDENTITIES_LOCATED)

Also: helpers.py allocate_evenly now allows None allocation_detail for happy-path Tier 0 (user-approved design decision matching reference behavior). (16d9ac2) - Fix: task-068 — CONNECTOR_MODEL migration to composable allocation models

Add CONNECTOR_TASKS_MODEL (USAGE) and CONNECTOR_CAPACITY_MODEL (SHARED)
Replace imperative connector allocators with model aliases
Add AllocationDetail.NO_IDENTITIES_LOCATED on terminal tier
Remove fragile post-processing loop for cost_type override
Add 9 new tests including handler→allocator integration test (d6f509d)
Fix: task-067 — SR_MODEL migration for Schema Registry allocator

Migrate schema_registry_allocator to composable allocation model: - Create SR_MODEL ChainModel with 3 tiers (USAGE → SHARED → Terminal) - Tier 0: EvenSplit over merged_active (CostType.USAGE) - Tier 1: EvenSplit over tenant_period (CostType.SHARED + NO_ACTIVE_IDENTITIES_LOCATED) - Tier 2: Terminal to resource_id (CostType.SHARED + NO_IDENTITIES_LOCATED) - Fixes behavioral parity: Tier 0 now uses USAGE, Tier 2 uses resource_id not UNALLOCATED (a3394ba) - Fix: task-065 — CompositionModel and DynamicCompositionModel

Add composition models for splitting costs across multiple strategies: - CompositionModel: fixed ratios validated at construction, last component absorbs rounding remainder - DynamicCompositionModel: runtime-determined ratios via callable - Both inject composition_index and composition_ratio metadata - Both implement call for CostAllocator compatibility (590db30) - Fix: task-064 — ChainModel meta-model

Add AllocationError exception and ChainModel dataclass for composable allocation fallback chains. ChainModel tries models in sequence, injects chain_tier metadata, supports debug logging, raises AllocationError on exhaustion. Includes 13 unit tests (100% coverage). (c88c5c3) - Fix: task-063 — TerminalModel and DirectOwnerModel primitives

Add two composable allocation models: - TerminalModel: always returns result, never None (chain termination) - DirectOwnerModel: returns None when owner unresolved (fallback trigger)

Both implement allocate() for AllocationModel and call for CostAllocator. (d3d1def) - Fix: task-062 — EvenSplitModel and UsageRatioModel primitives

Add composable allocation model primitives for the CAM system: - EvenSplitModel: splits cost evenly across identities from source callable - UsageRatioModel: splits cost proportionally by usage values - Both implement allocate() for chain composition (returns None for fallback) - Both implement call() for CostAllocator compatibility (never returns None) - Extended allocate_evenly() with allocation_detail and cost_type params - Extended allocate_by_usage_ratio() with allocation_detail param (0edf175) - Fix: task-061 — AllocationModel protocol and AllocationContext dataclass

Add foundational protocol for composable allocation model system: - Create AllocationModel protocol with allocate() -> AllocationResult | None - Add metadata field to AllocationResult for chain execution diagnostics - 10 tests covering protocol compliance, dataclass behavior, no circular imports (2ac0bca) - Fix: task-060 — Add frontend Docker container with compose profile

Multi-stage Dockerfile (node:22-alpine builder + nginx:1.27-alpine runtime) with nginx proxy to backend API. Compose profile ui enables opt-in frontend on port 8081. Includes security headers (X-Frame-Options, X-Content-Type-Options, X-XSS-Protection). (ed6fbc8) - Fix: task-059 — Filter tenant_period fallback to OWNER_IDENTITY_TYPES

allocate_evenly_with_fallback() and ksqldb_csu_allocator() now filter tenant_period identities to OWNER_IDENTITY_TYPES, excluding api_key and system identities from cost allocation. Added "principal" to OWNER_IDENTITY_TYPES for self-managed Kafka support.

4 new tests verify filter behavior and terminal fallback paths. (a9ca9fb) - Fix: task-058 — Implement graceful shutdown handling

Add _shutdown_event field and set_shutdown_event() to WorkflowRunner
Replace blocking wait() with 1-second polling loop in run_once()
Use executor.shutdown(wait=False, cancel_futures=True) for immediate exit
Add signal handlers for run-once mode (standalone and both modes)
Single Ctrl+C now exits within 2 seconds in all modes
Clean log message on shutdown (no ugly stacktrace) (18e065f)
Fix: task-057 — Fix alembic logging interference — restore root logger after migrations

Alembic's fileConfig() in env.py was overwriting the root logger configuration when migrations ran, silencing INFO/DEBUG logs for the rest of the process.

Added save/restore pattern around command.upgrade() to preserve root logger level and handlers across migration runs. (d60e23f) - Fix: task-056 — Fix Frontend UI: Date Filter Refresh + Dark Mode Default

Add useEffect in ChargebackGrid to purge AG Grid cache on filter change
Create useTheme hook with localStorage persistence, dark mode default
Integrate useTheme into App.tsx with ConfigProvider theme algorithm
Add theme toggle button in Layout header
Add comprehensive tests for all new functionality (142 tests pass) (f36aaba)
Fix: task-055 — Fix API key identity resolution in metrics_derived path

API keys now correctly resolve to their owners in three places: 1. metrics_derived path in identity_resolution.py 2. _kafka_usage_allocation via context["api_key_to_owner"] remapping 3. tenant_period fallbacks now filter to OWNER_IDENTITY_TYPES

Also fixes pre-existing logger declaration in prometheus.py. (f15b557) - Fix: task-054 — Use range query mode for Flink CFU metrics

Change query_mode from "instant" to "range" for _FLINK_METRICS_PRIMARY and _FLINK_METRICS_FALLBACK. Same fix as task-051 (Kafka): instant queries capture only one scrape interval, undercounting CFU usage when billing windows exceed scrape frequency. (f8c7fbc) - Fix: task-053 — Add per-endpoint semaphore to PrometheusMetricsSource

Adds max_concurrent_requests config (default 20) and BoundedSemaphore to limit total in-flight HTTP requests per Prometheus endpoint. Prevents connection storms when parallel orchestrator query() calls compound. (20ff550) - Fix: task-052 — Simplify _aggregate_rows dual-dict pattern

Replace two parallel dicts (aggregated + templates) with single _Bucket dataclass holding both total and template row. Eliminates implicit coupling and redundant dict lookup in output loop. (6980a77) - Fix: task-051 — Use range query mode for CCloud Kafka metrics

Changed query_mode from "instant" to "range" for _KAFKA_READ_METRICS and _KAFKA_WRITE_METRICS. Instant queries capture only one scrape interval's worth of data; range queries sum all intervals across the billing window for accurate principal allocation ratios. (4459d46) - Fix: task-050 — Make discovery window configurable

Add discovery_window_hours config option to SelfManagedKafkaConfig (default=1, gt=0). Pass through to run_combined_discovery() in both _validate_principal_label and build_shared_context call sites. Allows operators to extend lookback window for low-traffic clusters. (0be237e) - Fix: task-049 — Cache validation query for first gather cycle

Add _cached_discovery field to SelfManagedKafkaPlugin. Validation query result is stored and consumed on first build_shared_context() call, eliminating duplicate Prometheus round-trip per pipeline run. (3c98edd) - Fix: task-048 — Add TTLCache for identity/resource repository lookups

Add repository-scoped TTLCache (cachetools) to SQLModelIdentityRepository and SQLModelResourceRepository, eliminating redundant DB round-trips for repeated get() calls within a single UoW session. Cache invalidation on upsert/mark_deleted.

pyproject.toml: add cachetools>=5.0, types-cachetools>=5.0
repositories.py: TTLCache with configurable maxsize/ttl, cache-check-first get()
20 new tests covering cache hits, invalidation, TTL expiry, LRU eviction (2be0763)
Fix: task-047 — Add metadata filtering to find_by_period for Flink

Add metadata_filter parameter to ResourceRepository.find_by_period for DB-side JSON filtering. Flink handlers now pass compute_pool_id to filter statements at the SQL layer instead of loading all statements and filtering in Python. (1da2405) - Fix: task-046 — Use correlated subquery for chargeback delete

Replace two-phase Python-mediated DELETE with single atomic correlated subquery. Eliminates memory overhead from dimension ID list materialization and removes race condition window between SELECT and DELETE. (66767f9) - Fix: task-045 — Cache billing_window() computation per line

Pre-compute billing windows once per billing line in CalculatePhase.run() and pass cache to _compute_billing_windows, _prefetch_metrics, and _process_billing_line. Reduces billing_window() calls from 3N to N. (8315536) - Fix: task-044 — Consolidate three Prometheus discovery queries into one

Replaces three separate MetricQuery objects (_BROKERS_QUERY, _TOPICS_QUERY, PRINCIPALS_QUERY) with a single _COMBINED_DISCOVERY_QUERY that groups by all three labels. Eliminates 2 redundant Prometheus round-trips per gather cycle.

Changes: - prometheus.py: Added run_combined_discovery() + converter functions - shared_context.py: Added discovered_brokers/topics/principals fields - plugin.py: build_shared_context() now runs combined query once - plugin.py: _validate_principal_label() uses run_combined_discovery() - handlers/kafka.py: Uses cached discovery results from shared_ctx - metrics.py: MetricQuery.resource_label now accepts str | None

Query count reduction: - prometheus+prometheus: 3/cycle + 1 startup → 1/cycle + 1 startup - prometheus+static: 2/cycle → 1/cycle

Tests: 1859 passed, coverage 98% (ea1da2e) - Fix: task-040 — Use bulk UPDATE for pipeline state mark_* methods

Replace SELECT-then-UPDATE pattern with direct UPDATE statements in all 4 mark_* methods. Eliminates 90+ redundant queries per calculate cycle for 30 billing dates. Add test_mark_resources_gathered for complete test coverage. (4e20093) - Fix: task-037 — Pass cached identity/resource data to handlers

Eliminates redundant find_by_period calls in handlers by passing pre-built caches via ResolveContext parameter. Handlers now use cached_identities/cached_resources when available, falling back to DB queries only when context is None.

Add ResolveContext TypedDict to protocols.py
Orchestrator builds and passes context to handler.resolve_identities
Kafka/SR handlers use cached_identities to skip identity queries
Flink handlers use cached_resources with _get_flink_statement_resources helper
All other handlers accept context parameter (signature-only change) (dacee22)
Fix: task-043 — Add count=False parameter to skip COUNT query

Add count: bool = True parameter to ResourceRepository and IdentityRepository find_active_at/find_by_period methods. When count=False, skips SELECT COUNT(*) and returns 0 for total. Updates 6 internal callers that discard the count to pass count=False, eliminating unnecessary database round-trips per billing cycle. (7c296f3) - Fix: task-042 — Add indexes on temporal columns

Add indexes to created_at and deleted_at columns on resources and identities tables for O(log N) temporal queries instead of full scans.

base_tables.py: Add index=True to ResourceTable and IdentityTable temporal column declarations
Migration 006: Creates 4 indexes (ix_resources_created_at, ix_resources_deleted_at, ix_identities_created_at, ix_identities_deleted_at)
Tests: 4 tests covering index presence, migration upgrade/downgrade, and query plan verification
Also fixes pre-existing logger declaration in cost_input.py (4a6172c)
Fix: task-041 — Compute billing_windows once per calculate cycle

Compute _compute_billing_windows() once in run() and pass result to both _build_tenant_period_cache() and _build_resource_cache(), eliminating duplicate O(N) iteration over billing lines. (31d3fe1) - Fix: task-036 — Cache dimension lookups in ChargebackRepository

Add in-memory dimension cache to SQLModelChargebackRepository to eliminate N+1 SELECT queries. Cache is scoped to repository instance (UoW lifetime). Remove redundant session.get() call from upsert(). (9a1fcae) - Fix: task-039 — Batch Prometheus queries in ConstructedCostInput

Single range query for full [start, end) window instead of N per-day calls. Fallback to per-day queries on MetricsQueryError preserves partial billing. (811c57f) - Fix: task-038 — Parallelize Prometheus metrics prefetch loop

Parallelizes CalculatePhase._prefetch_metrics() using ThreadPoolExecutor to reduce serial network wait time. Adds configurable metrics_prefetch_workers to TenantConfig (default=4, range 1-20). Includes partial-failure handling that logs warnings and returns empty dict for failing groups instead of aborting the entire calculation. (547d855) - Fix: task-035 — Remove per-entity flush from repository upserts

Removed 5 unnecessary session.flush() calls from upsert methods (Resource, Identity, Billing, Chargeback, PipelineState). The UoW.commit() already flushes all pending changes atomically at transaction end.

Preserved flush in _get_or_create_dimension() where auto-generated dimension_id is needed as FK before fact row creation. (416b980) - Fix: Accumulate billing costs for duplicate PKs from CCloud API

CCloud Billing API can return multiple rows with same PK containing partial costs. Changed upsert() to sum costs instead of overwriting. Added explicit PrimaryKeyConstraint for deterministic session.get() order. (bc3a763) - Fix: Wire plugin storage modules into storage backend creation

The CCloud billing infrastructure (CCloudBillingLineItem, CCloudBillingRepository, CCloudStorageModule) was built but never connected to the runtime. create_storage_backend() always used CoreStorageModule, ignoring the plugin's storage module.

Changes: - create_storage_backend() now accepts optional storage_module parameter - workflow_runner passes plugin.get_storage_module() when creating storage - API dependencies use get_storage_module_for_ecosystem() for correct repo - Move ecosystem→storage_module mapping to plugins/storage_modules.py (DIP compliance)

This ensures CCloud billing uses 7-field PK (with env_id) preventing cross-environment billing collisions. (14c56e2) - Fix: task-033 — Wire up CCloudBillingLineItem in cost_input and add migration

The previous commit created the infrastructure but didn't wire it up: - cost_input.py now uses CCloudBillingLineItem with env_id as direct field - Added migration 006 to create ccloud_billing table with 7-field PK - Migration includes data migration from billing to ccloud_billing for CCloud rows - Fixed tests to expect env_id as direct field, not in metadata - Excluded alembic migrations from logging coverage test (9b405d6) - Fix: task-033 — Plugin-Owned Storage Architecture

Move billing/resource/identity storage from core to plugins, fixing CCloud billing collision where env_id was missing from PK.

Key changes: - BillingLineItem, Resource, Identity converted from dataclasses to Protocols - StorageModule protocol added; EcosystemPlugin gains get_storage_module() - CCloudBillingLineItem with env_id (7-field composite PK) - Each plugin owns storage package (tables, repositories, module) - SMK/GMO inherit CoreStorageModule for shared core tables - SQLModelUnitOfWork/Backend now require StorageModule param - env.py imports plugin tables for Alembic discovery

4 review rounds, 98.09% coverage, 1815 tests passed. (65f5153) - Fix: task-032 — Billing table PK missing product_category

Add product_category to billing table primary key to prevent row collisions when CCloud API returns billing lines with same (resource_id, product_type, timestamp) but different product_category values.

Changes: - BillingTable.product_category promoted to primary_key=True - Added _billing_pk() helper for 6-field PK tuple extraction - Changed increment_allocation_attempts() to accept BillingLineItem - Updated RetryChecker/RetryManager signatures accordingly - Added migration 005 to alter billing table PK - Added 6 verification tests for PK and signature changes (e5bbbf4) - Fix: TASK-031 — Comprehensive logging to all Python modules

Added logging infrastructure to 91 Python files: - import logging + logger = logging.getLogger(name) boilerplate - Debug logs at method entry with context params - Info logs for significant events (counts, lifecycle) - Warning logs for fallback decisions - logger.exception() in all except blocks

Coverage: 98.22%, 1772 tests pass, 3 review rounds. (7dade96) - Fix: TASK-030 — Annotated example YAML configs for all ecosystems

Create 8 annotated example configurations with corresponding .env.example files in deployables/config/examples/: - ccloud-minimal, ccloud-complete, ccloud-multi-tenant, ccloud-with-flink - self-managed-minimal, self-managed-complete - generic-postgres, generic-redis

Each config includes inline [Required|Optional] comments explaining every field, env var placeholders with ${VAR:-default} syntax, and realistic example values. All configs validated via load_config() in test suite. (7fdf5fd) - Fix: TASK-028 — Identity resolution full-table scans

Replace O(N) scans with targeted lookups: - connector/ksqldb: find_by_period + loop → uow.resources.get() - connector/ksqldb: identity dict → direct uow.identities.get() - flink: add resource_type="flink_statement" filter to find_by_period - flink: identity dict → per-owner get() with resolved cache (b6c8b2b) - Fix: TASK-027 — Granularity extensibility via PluginSettingsBase

Add granularity_durations field to PluginSettingsBase allowing plugins to define custom billing cadences (e.g., weekly: 168 hours) without modifying core engine code. Resolves OCP violation in GRANULARITY_DURATION.

Add granularity_durations: dict[str, int] with validator (min 1 hour)
Rename GRANULARITY_DURATION to _DEFAULT_GRANULARITY_DURATION
billing_window() accepts pre-merged durations from caller
CalculatePhase pre-merges durations at init, passes to all call sites (757d490)
Fix: TASK-026 — Export streaming with iter_by_filters
Add iter_by_filters to ChargebackRepository Protocol for batched streaming
Add _build_chargeback_where and _overlay_tags helpers to eliminate duplication
Refactor find_by_filters and find_dimension_ids_by_filters to use helpers
Replace find_by_filters(limit=100000) with iter_by_filters in export route
No more silent 100K row truncation; memory bounded to batch_size rows (aceb940)
Fix: TASK-024 — allocate_evenly_with_fallback helper for DRY allocator chain

Add core helper encoding standard fallback: merged_active → tenant_period → UNALLOCATED. Delete 4 duplicate implementations across SMK and generic plugins. Update SR allocator to use helper. 14 new tests, 1707 total passing. (3acb8d4) - Fix: TASK-023 — BaseServiceHandler convenience class for DRY handler boilerplate

Introduces opt-in BaseServiceHandler[ConnT, CfgT] base class that eliminates ~80 lines of duplicated scaffolding across 5 CCloud handlers: - Standard 3-field init (connection, config, ecosystem) - Dict-lookup get_allocator() via class-level _ALLOCATOR_MAP - Empty gather_identities() returning iter(())

Handlers adopting BaseServiceHandler: - SchemaRegistryHandler, ConnectorHandler, KsqldbHandler: full adoption - FlinkHandler: partial (keeps custom init for _flink_regions) - KafkaHandler: partial (keeps gather_identities() override)

OrgWide and Default handlers not migrated (different constructor signatures). (8447efa) - Fix: TASK-021 — resolve_date_range helper for DRY date conversion

Extract duplicated date→datetime conversion logic from 5 API routes into a single resolve_date_range() helper in dependencies.py.

Add resolve_date_range(start_date, end_date) -> tuple[datetime, datetime]
Replace inline date logic in chargebacks, billing, aggregation, tags, export routes
Fix bug: tags.py bulk_add_tags_by_filter now validates start_date <= end_date
Clean up unused imports per file
Add 6 tests covering defaults, explicit dates, ordering guard, edge cases (02721c1)
Fix: TASK-020 — Retention cleanup reuses cached TenantRuntime storage

_cleanup_retention() now iterates _tenant_runtimes instead of _settings.tenants, using runtime.storage instead of creating a fresh backend. Eliminates redundant database engine creation and avoids SQLite single-writer conflicts. (144e82b) - Fix: TASK-009 — GenericMetricsOnlyPlugin for YAML-only ecosystems

Adds generic_metrics_only plugin enabling new metrics-only ecosystems via pure YAML config. No Python code required for new ecosystems.

CostQuantityConfig discriminated union: fixed, storage_gib, network_gib
CostTypeConfig with allocation_strategy: even_split or usage_ratio
GenericIdentitySourceConfig: prometheus, static, or both
Handler builds allocators and metrics queries from config at init
CostInput constructs billing lines from YAML rates + Prometheus data

Self-managed Kafka expressible as generic plugin YAML config. (6538840) - Fix: TASK-025 — Global exception handler for FastAPI

Add global exception handler that catches unhandled exceptions, logs full traceback server-side, and returns structured JSON error response with UUID for correlation. HTTPException passthrough remains unaffected. (7f57e0b) - Fix: TASK-002 — Add Emitter protocol and CSV implementation

Implements pluggable output stage for chargeback results: - Emitter protocol in core/plugin/protocols.py - EmitterRegistry with name-based registration - EmitPhase runs after calculate, supports per-emitter aggregation - CsvEmitter implementation with idempotent overwrites - EmitterSpec config model with aggregation validation - import_attr helper extracted from load_protocol_callable

67 new tests, 1608 total passing, 98.71% coverage. (7f7abce) - Fix: TASK-022 — Extract temporal query validation helper

Extract duplicated temporal validation logic from resources.py and identities.py into shared validate_temporal_params() function in dependencies.py. Adds TemporalParams frozen dataclass to carry validated, UTC-normalized values. (4acb1ff) - Fix: TASK-008 — Two-phase handler gather (LSP/DIP fix)

Eliminates implicit handler ordering via UoW side effects. Handlers now receive pre-gathered shared context from plugin's build_shared_context(), making them independently testable and substitutable.

Key changes: - Add CCloudSharedContext and SMKSharedContext frozen dataclasses - Add build_shared_context(tenant_id) to EcosystemPlugin protocol - Add shared_ctx param to ServiceHandler.gather_resources - GatherPhase calls build_shared_context once, threads to all handlers - Handlers use shared_ctx instead of querying UoW for prior handler output - SchemaRegistryHandler no longer calls gather_environments() (TD-028 resolved) - Handler ordering no longer affects correctness (TD-027 properly resolved)

49 new tests, 1533 total passing, 98.57% coverage. (b4b62fd) - Fix: TASK-006 — Decompose orchestrator god class into phases

Decompose ChargebackOrchestrator (590+ lines, 10+ responsibilities) into: - GatherPhase: resource/identity/billing gather, deletion detection, throttle - CalculatePhase: metrics prefetch, cache building, per-line allocation - RetryManager: retry counter persistence with RetryChecker protocol - ChargebackOrchestrator: thin coordinator (~150 lines with compat wrappers)

Module-level: _load_overrides (5-tuple), _ensure_unallocated_identity

30 new tests, 1484 total pass, 98.64% coverage, orchestrator.py 100%. (f602b79) - Fix: TASK-014 — Plugin path configurable via AppSettings

Add plugins_path field to AppSettings with configurable override. Default computed from file for CWD-independence. (42485f2) - Fix: TASK-013 — Centralize metrics step configuration

Add metrics_step_seconds to PluginSettingsBase with default 3600s. Eliminates 6 hardcoded timedelta(hours=1) call sites across orchestrator and self_managed_kafka plugin. Step now flows from YAML config through orchestrator, handlers, gathering functions, CostInput, and plugin validation. (2d6b1a4) - Fix: TASK-007 — Storage Backend Registry

Introduces StorageBackendRegistry pattern (mirrors PluginRegistry) to eliminate hardcoded SQLModelBackend instantiation in workflow_runner.py and dependencies.py. API now respects StorageConfig.backend instead of hardcoding.

src/core/storage/registry.py: new file with StorageBackendRegistry, create_storage_backend()
workflow_runner.py: removed _create_storage_backend, uses registry factory
dependencies.py: get_or_create_backend accepts StorageConfig, uses registry
tenants.py, pipeline.py: pass StorageConfig instead of connection_string
12 new tests for registry, 1427 total passing, 98.60% coverage (633c4ab)
Fix: TASK-019 — Eliminate DRY violation in _load_identity_resolver

Add IdentityResolver protocol to protocols.py and refactor _load_identity_resolver to delegate to load_protocol_callable, eliminating 25 lines of duplicated loading logic. Gains four safety checks: ImportError wrapping, AttributeError wrapping, class rejection, and protocol isinstance validation. (1d8cec9) - Fix: TASK-018 — Extract _detect_entity_deletions to eliminate DRY violation

Refactored _detect_deletions to use a single _detect_entity_deletions helper, eliminating ~50 lines of duplicated zero-gather-protected deletion logic. Replaced two separate counter attributes with _zero_gather_counters dict. Added _EntityRepo Protocol for structural typing (TYPE_CHECKING only). (9ddd121) - Fix: TASK-017 — Remove unused connection/config params from OrgWideCostHandler and DefaultHandler

OrgWideCostHandler.init now takes only ecosystem param
DefaultHandler.init same
Removed dead TYPE_CHECKING imports (CCloudConnection, CCloudPluginConfig) from both handlers
Updated plugin.py construction calls to pass only ecosystem
Removed misleading docstring sentence from DefaultHandler
Added constructor tests verifying new signature and rejecting old kwargs (519d894)
Fix: TASK-016 — Extract shared MetricsConnectionConfig and create_metrics_source factory

Consolidates duplicate Prometheus config models and factory logic from both plugins into core.metrics.config. CCloudMetricsConfig and MetricsConfig were byte-for-byte identical; inline factory blocks performed same unwrap/build sequence. (c5a1649) - Fix: TASK-015 — Standardize API routes on utc_today() instead of date.today()

Extract utc_today() helper in dependencies.py, replace date.today() in billing, aggregation, tags, export routes. Update chargebacks to use shared helper. Removes inline import in tags.py. All routes now use UTC-based date for default windows, preventing timezone drift. (407eb97) - Fix: TASK-012 — Add close() to EcosystemPlugin/MetricsSource protocols, remove hasattr duck-typing

Add close() lifecycle method to both protocols, retype TenantRuntime.plugin from object to EcosystemPlugin under TYPE_CHECKING, remove all hasattr guards in TenantRuntime.close(), and add _metrics_source cleanup to both plugin close() methods to prevent HTTP connection leaks. (632e60a) - Fix: TASK-011 — Move CCloud-specific enums out of core AllocationDetail

Remove NO_FLINK_STMT_NAME_TO_OWNER_MAP, FAILED_TO_LOCATE_FLINK_STATEMENT_OWNER, and CLUSTER_LINKING_COST from core AllocationDetail enum (DIP violation). Create plugins/confluent_cloud/constants.py with plain string constants. Delete FAILED_TO_LOCATE_FLINK_STATEMENT_OWNER (dead code, no consumers). (fd24249) - Fix: TASK-010 — Validate plugin_settings at config load via PluginSettingsBase

Add PluginSettingsBase Pydantic model with orchestrator-consumed fields (allocator_params, allocator_overrides, identity_resolution_overrides, min_refresh_gap_seconds). TenantConfig.plugin_settings now validates at load time instead of silently accepting any dict. Plugin configs inherit PluginSettingsBase. Orchestrator uses typed attribute access. (121ccf5) - Fix: TASK-004 — Replace sys.exit(1) with tenant suspension in WorkflowRunner

sys.exit(1) in worker threads is silently swallowed by CPython. This fix: - Removes sys.exit(1) from _run_tenant, lets GatherFailureThresholdError propagate - Adds _failed_tenants dict with lock for thread-safe tenant suspension tracking - Adds _build_cached_fatal_result() and _mark_tenant_permanently_failed() helpers - Updates run_once() and run_tenant() to skip/handle permanently failed tenants - Adds get_failed_tenants() API for visibility into suspended tenants - Adds all-tenants-failed CRITICAL alert in run_loop() - Adds fatal: bool field to PipelineRunResult

9 new tests verify threshold breach handling, tenant skip logic, thread safety. (7431224) - Fix: TASK-003 — Plugin discovery contract now works for both plugins

Both plugins now expose register() -> tuple[str, Callable] matching the loader contract. Previously self_managed_kafka had wrong signature (took registry arg) and confluent_cloud had no register() at all.

self_managed_kafka/init.py: register() returns tuple, no args
confluent_cloud/init.py: added register() returning tuple
16 new tests covering registration and discovery
Removed stale tests calling old broken signature (68da3db)
Fix: TASK-005 — Single shared WorkflowRunner in both mode + per-tenant run guard
Extract _create_runner() factory in main.py
run_worker() accepts injected runner + shutdown_event for both mode
Add _running_tenants set + lock to WorkflowRunner for per-tenant concurrency guard
Add is_tenant_running() method for API pre-check
Add drain(timeout) for graceful shutdown (waits for running tenants before close)
Add already_running field to PipelineRunResult
Add "skipped" status to PipelineRun for already-running cases
trigger_pipeline returns 200 with status="already_running" when tenant is running
app.py lifespan calls drain() via asyncio.to_thread before disposing backends (a333499)
Fix: TASK-001 — Flink fallback statement filter now uses is_stopped boolean

The _fallback_from_running_statements filter checked metadata["status"] which gathering never writes. Replaced with metadata["is_stopped"] (bool) to match what gathering actually populates, aligning with reference code behavior. Updated existing test fixtures and added 4 targeted tests. (a895d3f) - Fix: GAP-25 — Add X-RateLimit-Reset header support for rate limit handling

CCloudConnection now parses the legacy X-RateLimit-Reset header (Unix timestamp) when handling 429 responses. The server-provided reset time is used instead of falling back to generic exponential backoff. (d3c4761) - Fix: GAP-24 — Connector fallback now uses resource-local instead of tenant-period

When no active identities are found for a connector, cost is now assigned to the connector resource itself (identity_id=resource_id) instead of being spread across all tenant identities. This matches legacy behavior where unresolved connectors stay resource-local, not tenant-smeared. (be33aa8) - Fix: GAP-23 — Filter system identities from tenant-period cache

UNALLOCATED (identity_type="system") was leaking into tenant_period cache, causing allocators to split costs N+1 ways instead of N ways. Fixed by filtering system identities at cache population time in orchestrator. (bcef628) - Fix: GAP-22 — Decimal split helpers can crash on non-0.0001 amounts

Add quantize-before-split and modulo wraparound to split_amount_evenly and allocate_by_usage_ratio remainder loops. Prevents IndexError when input amounts have precision beyond the 0.0001 quantization step. (f0230c7) - Fix: GAP-21 — run_tenant() bootstrap flag now properly latched

Replace inline bootstrap in run_tenant() with delegation to bootstrap_storage(). Ensures all tenants are bootstrapped and _bootstrapped flag is set, preventing redundant DDL on subsequent calls. (9aca32b) - Fix: GAP-20 — Allocation retry attempts now persist across transaction rollback

When an allocator fails, the retry counter increment now uses a separate UoW that commits independently, ensuring the attempt count survives the main transaction's rollback. This enables proper escalation to UNALLOCATED after the configured retry limit is reached. (d915eef) - Fix: GAP-19 — Flink query uses shared resource-filter injection contract

Replace hardcoded {resource_id=~"lfcp-.+"} selector with {} placeholder in both _FLINK_METRICS_PRIMARY and _FLINK_METRICS_FALLBACK. This enables the shared _inject_resource_filter() mechanism to inject per-pool filters consistently with other handlers. (540999b) - Fix: GAP-18 — Metrics query instant vs range mode support

Add query_mode field to MetricQuery with "instant" and "range" options. CCloud handlers now use instant queries for parity with reference code.

MetricQuery.query_mode defaults to "range" (preserves existing behavior)
PrometheusMetricsSource routes by query_mode to /api/v1/query or /api/v1/query_range
Extracted _execute_cached_post for DRY cache handling
CCloud Kafka and Flink metrics use query_mode="instant"
13 new tests for instant mode routing, caching, and error handling (53029c5)
Fix: GAP-17 — Malformed billing handling can drop or collide costs
Add row_index parameter to _map_billing_item for unique fallback IDs
Add _map_malformed_item to preserve hard-failure rows with metadata flag
Update _fetch_window to use enumerate and yield malformed items instead of dropping
Malformed rows get resource_id=malformed_billing_{idx} and metadata["malformed"]=True (6897b78)
Fix: GAP-16 — Flink CFU metric name fallback support

Add dual-metric query support for Flink CFU metrics. Primary metric (confluent_flink_num_cfu) takes precedence; falls back to legacy metric (confluent_flink_statement_utilization_cfu_minutes_consumed) for tenants still exporting the old metric name. Pre-filters metrics_data to prevent double-counting when both metrics present. (dd5589b) - Fix: GAP-15 — ksqlDB owner resolution reads wrong field

Read owner_id from top-level Resource field instead of metadata dict, with metadata fallback for legacy resources. Also moved sentinel helper to shared _identity_helpers module and applied Pythonic improvements. (d4ea7d5) - Fix: GAP-14 — Kafka network direction-specific allocation

Handler returns direction-specific metrics: READ → bytes_out only, WRITE → bytes_in only, CKU → both. Prevents cross-attribution of asymmetric producer/consumer traffic. (f36090e) - Fix: GAP-13 — Self-managed Kafka correctness issues

Issue 1: Split self_kafka_network_allocator into ingress/egress variants to correctly attribute directional network costs (was conflating bytes_in and bytes_out, giving 50/50 split regardless of actual usage direction)
Issue 2: Rename BYTES_PER_GB to _BYTES_PER_GIB and config fields storage_per_gb_hourly, network*_per_gb to _per_gib (breaking change: matches actual 2^30 math being performed)
Issue 3: Add principal label validation in plugin.initialize() with static fallback when Prometheus lacks principal labels or is unreachable (22860f0)
Fix: GAP-12 — Connector auth mode fallback probing + masked key detection
Add auth mode fallback probing in gather_connectors(): when kafka.auth.mode is absent, probe for kafka.api.key or kafka.service.account.id to infer mode
Add masked API key detection: keys that are all asterisks or empty string now return connector_api_key_masked sentinel instead of shared unknown
Add distinct connector_api_key_not_found sentinel for keys not in DB
UNKNOWN auth mode now uses connector_id as identity for per-connector attribution instead of shared connector_credentials_unknown sentinel (edafc7f)
Fix: GAP-11 — ksqlDB allocator fallback cost type semantics

ksqlDB CSU allocator now uses correct cost types for each fallback tier: - merged_active identities → USAGE (attributed consumption) - tenant_period fallback → SHARED (can't attribute specifically) - No identities → assign to resource_id with SHARED via allocate_to_resource

Previously all paths were forced to USAGE, losing the semantic distinction between attributed and unattributed costs. (7cf3364) - Fix: GAP-10 — Identity pool gathering and org-wide allocator deduplication

Add gather_identity_providers + gather_identity_pools calls to KafkaHandler
Add IdentitySet.ids_by_type() for filtering by identity type
Filter org_wide_allocator to owner types (SA, user, pool) — excludes API keys
Filter connector_allocators tenant_period fallback to owner types
Add OWNER_IDENTITY_TYPES constant to eliminate DRY violation (d50c451)
Fix: GAP-09 — Flink no-metrics fallback for cost attribution

Add secondary fallback path when Prometheus metrics are unavailable: - Identity resolution queries running Flink statements from resource DB - Allocator falls back to even split across merged_active identities - Both paths preserve USAGE cost type for Flink consumption costs (d10aca6) - Fix: GAP-08 — Add missing CUSTOM_CONNECT product types

Add CUSTOM_CONNECT_NUM_TASKS and CUSTOM_CONNECT_THROUGHPUT to ConnectorHandler. Both map to connect_tasks_allocator (USAGE cost type), matching reference code behavior. (b1169c0) - Fix: GAP-07 — Per-date resource lookup cache

Pre-fetch resources per billing window and pass as dict to _process_billing_line, eliminating redundant uow.resources.get() calls. 10 billing lines sharing same resource_id now trigger 1 find_by_period instead of 10 get() calls. (a952858) - Fix: GAP-06 — Prometheus cache TTL alignment with LRU eviction

Change cache_ttl_seconds default from 300s to 3600s (survives 30-min run cycle)
Support cache_ttl_seconds=None for lifetime caching (matches reference lru_cache)
Replace dict cache with OrderedDict for proper LRU semantics
Add move_to_end() on cache hit for LRU promotion
Add popitem(last=False) when full for LRU eviction (replaces skip-caching)
Add race-condition guard in cache-store to handle concurrent fetches (88dbfe8)
Fix: GAP-05 — Per-endpoint page size tuning for CCloud API

Add endpoint-specific page_size overrides to prevent timeouts on complex Flink endpoints and align with reference code behavior: - Flink compute pools: 50 (complex nested objects) - Flink statements: 50 (complex nested objects) - ksqlDB clusters: 100 (moderate complexity) - API keys: 100 (moderate complexity) - Schema Registry: 50 (moderate complexity) (0f87905) - Fix: GAP-03 test logging pollution from alembic

Alembic's command.upgrade/downgrade calls logging.config.fileConfig() which sets disable_existing_loggers=True, disabling the core.storage.backends.sqlmodel.repositories logger. This broke the billing revision test when run after migration tests.

Add fixture to restore logger state after each migration test. (1a2338a) - Fix: GAP-04 — API object refresh throttle with failure escalation

Add 30-minute (configurable) throttle to prevent excessive CCloud API calls during orchestrator gather phase. Skip resource/identity/billing gather if last successful gather was within min_refresh_gap.

Additionally, add gather failure escalation: after N consecutive gather failures (default 5, configurable via gather_failure_threshold), raise GatherFailureThresholdError causing program exit for operator attention.

Add _last_resource_gather_at and _min_refresh_gap to orchestrator
Add gather_failure_threshold to TenantConfig
Add GatherFailureThresholdError exception with sys.exit(1) handler
10 new tests covering throttle and escalation behavior (85dcff1)
Fix: GAP-03 — Billing revision detection with logging

Add change detection to SQLModelBillingRepository.upsert(): - Check for existing record before merge - Log warning when total_cost differs (billing revision detected) - Allow overwrite per approved design divergence

Tests: 3 new tests covering revision detection, no-op for same cost, and no warning for new records. (b9ba1d5) - Fix: GAP-02 — Default/cluster-linking allocators assign to resource_id

Allocators now assign costs to billing_line.resource_id instead of UNALLOCATED. Preserves resource lineage for Tableflow and cluster-linking costs that can't be attributed to specific identities. (cface12) - Fix: GAP-01 — Network allocator tiered fallback for audit granularity

Adds 7-branch tiered fallback to Kafka network allocators matching reference code behavior. Each fallback path now writes a distinct AllocationDetail code so auditors can determine why a particular allocation path was taken.

Key changes: - Add 3 AllocationDetail enum values for Tier 2/3 fallback branches - Replace _kafka_shared_allocation with _fallback_no_metrics + _fallback_zero_usage - Add _even_split_with_detail and _to_resource_with_detail helpers - Terminal fallback assigns to resource_id (not UNALLOCATED) - 12 new tests covering all 7 reachable branches + edge cases - 100% coverage on kafka_allocators.py (dfbd2d1) - Fix(ccloud): extract shared sentinel helper, fix connector identity nits (433c0c6) - Fix(ksqldb): address nits from quality review

Replace redundant comment on cost type override with one that explains WHY allocate_evenly's SHARED default is overridden (compute consumption semantics require USAGE)
Move ksqldb_csu_allocator import to module level in test file; remove six repeated local imports inside test methods (fce1733)
Fix(kafka): correct PromQL query templates for resource filtering

Bug: Query templates used invalid {{kafka_id="{resource_id}"}} syntax. - Double braces don't escape in regular strings (only f-strings/.format()) - {resource_id} and {step} placeholders were never substituted - Result was invalid PromQL sent to Prometheus

Fix: - Use proper {} placeholder pattern matching reference code - _inject_resource_filter replaces {} with {kafka_id="lkc-xxx"} - Corrected metric names: request_bytes/response_bytes (per reference)

This ensures metrics are filtered to the correct cluster and uses valid PromQL syntax that Prometheus can parse. (9bbe80d) - Fix: address review issues for chunk 2.2

MAJORS: - Rename gather() params to match CostInput protocol (start/end) - Fix uow type annotation (UnitOfWork, not Optional)

MINORS: - Remove dead Z replace in _parse_iso_datetime (Python 3.14) - Add comment for connector created_at omission - Use json.dumps for deterministic hash in flink sentinel - Replace global _tenant_counter with uuid

NITS: - Add ECOSYSTEM constant comment - Move _last_request_time update after request - Add Flink cache key assumption comment - Remove ValueError from _safe_decimal except - Add floor guard to _get_rate_limit_wait - Move cast to module-level import - Remove redundant continue (daec279) - Fix(config): reject shared connection strings across tenants

Tenant isolation is convention-enforced (query-level), not structural. Until full isolation is implemented (TD-025), prevent two tenants from sharing a storage connection_string at config validation time. (bd22efc) - Fix: post-review hardening round 2

Move create_tables() to bootstrap_storage() — called once at startup, not per-tenant per-cycle. run_once auto-bootstraps if not already done.
Parenthesized except + fmt:skip for Python 3.12/3.13 compatibility
Single-cycle path now logs results via shared _log_results()
DRY: extract _log_results() used by both loop and single-cycle paths
4 new bootstrap tests, 1 new single-cycle logging test (53106ee)
Fix: post-review hardening (items 1,3,7)
loading.py: parenthesized except + fmt:skip for 3.12/3.13 compat
workflow_runner.py: single-cycle path now logs results (parity with loop) (d9aa30e)
Fix(runner): move timeout=0 comment to effective_timeout line (AR-002) (12852d1)
Fix(orchestrator): GAP-001+004+006 — resources_gathered per billing date, UTC reassignment, error propagation

GAP-001: Mark resources_gathered per billing date, not just today. GAP-004: Reassign _ensure_utc() return values using dataclasses.replace(). GAP-006: _gather() returns errors list, propagated to PipelineRunResult. (a9e00ba) - Fix(mappers): reject naive datetimes on write path (GAP-014)

Split ensure_utc into permissive (read) and strict (write). Write-path _to_table() functions now raise on naive datetimes. Read-path _to_domain() functions remain permissive for DB compat. (b06b855) - Fix(helpers): use UNALLOCATED fallback in allocators (GAP-013) (5f74e9e) - Fix(loading): correct Python 2 except syntax (GAP-012)

except ValueError, TypeError → except (ValueError, TypeError) (03ac8cd) - Fix(ccloud): type safety and lint fixes for chunk 2.1

Add cast() for resp.json() return (mypy no-any-return)
Add explicit type annotations in _calculate_backoff
Add validation for auth_type='none' with credentials
Add compare=False to _auth field
Constrain allocator_params to primitive types with validator
Add responses library for HTTP mocking in tests (b5806d9)