Skip to content

Concepts

Understanding pikit's design comes down to four orthogonal dimensions and a few key terms.

The four dimensions

pikit decomposes prompt injection into four independent axes that compose freely:

                 ┌──────────── craft() ────────────┐
   task  ──▶  attack (wording)  ──▶  channel (carrier, indirect only)
              defense (optional hook) ──▶ target / agent  ──▶  trace you read
Dimension Question it answers Count Examples
Attack How is the payload worded? 9 context_ignoring, combined, obfuscation
Channel Where is it hidden? (indirect) 6 webpage, skills, unicode_hidden
Defense How do we harden the prompt? 6 spotlighting, delimiters, sandwich
Target / Agent What receives it? 4 backends / 6 agents openai:gpt-4o, browser, coding

The key insight: these are orthogonal. Any attack can be paired with any channel, any defense can be applied to any agent, and the combinatorial space is what makes pikit useful for systematic research.

Direct vs. indirect injection

Term Meaning
Direct injection The attacker controls the prompt/message sent directly to the model.
Indirect injection The payload is hidden in external data the model reads (page, doc, email, skill) — the dangerous case for agents.

An Attack handles the wording in both cases. A Channel is only needed for indirect injection — it hides the (already worded) payload inside a data artifact.

Core interfaces

All four dimensions share minimal, uniform interfaces:

class Attack:
    def inject(self, prompt: str, injected_task: str) -> str: ...

class Defense:
    def apply(self, prompt: str, instruction: str = None) -> str: ...

class Channel:
    def poison(self, data: str, payload: str) -> str: ...

class Target:
    def query(self, prompt: str, system: str = None, **kw) -> str: ...
    def chat(self, messages, tools=None, system=None, **kw) -> ChatResponse: ...

Attacks and defenses are both plain prompt-text transformers — they take a string and return a string. This is what makes them freely composable: any attack output can be fed into any defense, and vice versa.

The agent testbed

An Agent wraps a Target and exposes run()Trace. The agent testbed models two critical concepts for indirect injection:

  • Poison point — a compromised tool whose return value is the attacker's poisoned artifact (e.g. fetch_url returns a malicious web page).
  • Sink — an externally-observable action the attacker wants to trigger (e.g. send_email, run_command, post_form).

The Trace records every step so you can judge — manually — whether the injection succeeded:

trace = agent.run("Summarize the page at http://site")
trace.sink_calls      # tool calls that hit a sink
trace.poisoned_steps  # steps that carried the injected artifact
print(trace)          # human-readable step-by-step log

pikit deliberately renders no verdict — it makes the signals easy to see but leaves the judgement to you.

Defense hooks

The same Defense objects used for direct injection can be slotted into three points of an agent's data flow via DefenseHooks:

Hook point Defends against Key for
system Model being talked out of its instructions Direct injection
user Incoming user message Direct injection
tool_result Untrusted tool output re-entering the model Indirect injection

The tool_result hook is the most valuable for indirect injection — it's the layer through which an attacker's poisoned artifact re-enters the model.

The registry

Every attack, defense, channel, and agent scenario registers itself under a short string key via a decorator. This means:

attacks.get("context_ignoring")   # → ContextIgnoringAttack class
attacks.list()                    # → ['combined', 'context_ignoring', ...]

Adding a new method is one file + one decorator — no core changes. See Extending pikit.