Agent Operating Procedures

skills

Installable coding-agent skills that turn recurring engineering judgment into reusable Markdown workflows. Includes premortem, postmortem, and 4x4 branch-tournament procedures for reducing agent slop.

tags
  • #agent-skills
  • #developer-tools
  • #operating-procedures
Agent Runtime / Legal Workflow

murdock

Provenance-first legal workflow system for turning messy document work into traceable agent operations. Built around typed contracts, preservation artifacts, and backend-owned evidence paths instead of opaque chat output.

tags
  • #agent-runtime
  • #provenance
  • #workflow-systems
Agent Runtime / 3D Systems

mobisim

Browser-native 3D vehicle inspection workspace with a voice-first AI agent. Built semantic app-layer hooks that ground tool calls in GLB asset structure and act as least-privilege permission boundaries for reversible, deterministic mutations.

tags
  • #agent-runtime
  • #context-engineering
  • #typescript
Robotics / Data Flywheel

droid_loop

Multimodal failure-mining harness for DROID robot episodes. Uses SigLIP embeddings, IsolationForest, HDBSCAN, temporal jump detection, and VLM review to turn rare manipulation failures into validated training-data slices.

tags
  • #physical-ai-robotics
  • #dataset-curation
  • #vlm
Training / Verifier Systems

sympy-rlvr

GRPO training system built from scratch on Qwen2.5-1.5B with symbolic verification rewards. Improved GSM8K from 22% to 36% without GSM8K supervision using staged SFT + RLVR, LoRA adapters, curriculum synthesis, and MLflow tracking.

tags
  • #rlvr
  • #verifier-rewards
  • #math-reasoning
Benchmark Harness / Measurement

bench_lab

Benchmarking workspace for making model and system comparisons repeatable instead of notebook-local. Focused on explicit run setup, evaluation loops, result inspection, and reusable measurement contracts.

tags
  • #benchmark-harness
  • #evaluation
  • #python
Model Observability / Tracing

trace_lm

Tracing layer for LLM systems that turns prompts, tool calls, and execution paths into inspectable structured runs. Built around the idea that agent behavior should be debugged as runtime state, not reconstructed from opaque logs.

tags
  • #observability
  • #agent-runtime
  • #tracing
Context Engineering / Retrieval

ragops

Retrieval evaluation toolkit for measuring how context assembly fails. Covers ingestion, indexing, hybrid retrieval experiments, Recall@k, MRR, and generation-quality checks so RAG changes can be tuned with evidence.

tags
  • #context-engineering
  • #retrieval-eval
  • #rag
Eval Harness / Agent QA

llm-evals-lab

Evaluation harness for frontier-model natural-language tasks using Chain-of-Verification agents, automated rubrics, LangSmith tracing, DuckDB run storage, and dbt-modeled quality metrics. Async batching improved eval throughput by about 53%.

tags
  • #eval-harness
  • #rubrics
  • #langsmith

World-model research for stochastic cellular automata using entropy-aware patch selection and autoregressive transformers. Improved BPC to 1.7x baseline and converged nearly 2x faster while preserving close train/eval alignment.

tags
  • #embeddings
  • #research-project
  • #stochastic-processes