skills
Installable coding-agent skills that turn recurring engineering judgment into reusable Markdown workflows. Includes premortem, postmortem, and 4x4 branch-tournament procedures for reducing agent slop.
Installable coding-agent skills that turn recurring engineering judgment into reusable Markdown workflows. Includes premortem, postmortem, and 4x4 branch-tournament procedures for reducing agent slop.
Provenance-first legal workflow system for turning messy document work into traceable agent operations. Built around typed contracts, preservation artifacts, and backend-owned evidence paths instead of opaque chat output.
Browser-native 3D vehicle inspection workspace with a voice-first AI agent. Built semantic app-layer hooks that ground tool calls in GLB asset structure and act as least-privilege permission boundaries for reversible, deterministic mutations.
Multimodal failure-mining harness for DROID robot episodes. Uses SigLIP embeddings, IsolationForest, HDBSCAN, temporal jump detection, and VLM review to turn rare manipulation failures into validated training-data slices.
GRPO training system built from scratch on Qwen2.5-1.5B with symbolic verification rewards. Improved GSM8K from 22% to 36% without GSM8K supervision using staged SFT + RLVR, LoRA adapters, curriculum synthesis, and MLflow tracking.
Benchmarking workspace for making model and system comparisons repeatable instead of notebook-local. Focused on explicit run setup, evaluation loops, result inspection, and reusable measurement contracts.
Tracing layer for LLM systems that turns prompts, tool calls, and execution paths into inspectable structured runs. Built around the idea that agent behavior should be debugged as runtime state, not reconstructed from opaque logs.
Retrieval evaluation toolkit for measuring how context assembly fails. Covers ingestion, indexing, hybrid retrieval experiments, Recall@k, MRR, and generation-quality checks so RAG changes can be tuned with evidence.
Evaluation harness for frontier-model natural-language tasks using Chain-of-Verification agents, automated rubrics, LangSmith tracing, DuckDB run storage, and dbt-modeled quality metrics. Async batching improved eval throughput by about 53%.
World-model research for stochastic cellular automata using entropy-aware patch selection and autoregressive transformers. Improved BPC to 1.7x baseline and converged nearly 2x faster while preserving close train/eval alignment.