Base Classes¶
pikit.base ¶
Abstract base classes for attacks, defenses, and channels.
An :class:Attack and a :class:Defense are prompt-text transformers:
they take a prompt string and return a new prompt string. Keeping them on
the same shape is what lets callers freely compose any attack with any
defense.
A :class:Channel models indirect injection — it hides a payload inside
an external data artifact (web page, document, email) and returns the full
prompt the target would receive after reading it. Channels are orthogonal
to attacks: word a payload with an attack, then embed it with a channel.
Attack ¶
Bases: ABC
An injection technique that embeds an attacker task into a prompt.
Subclasses implement :meth:inject. The free-form signature takes the
full prompt (instruction + untrusted data, already assembled by the
caller) and the attacker-controlled injected_task to smuggle in.
Note¶
An :class:Attack controls how the payload is worded (direct
injection). To model indirect injection — hiding the payload inside an
external data artifact such as a web page or document — pair an attack
with a :class:Channel. The two are orthogonal and compose freely.
Defense ¶
Bases: ABC
A prevention-style defense that hardens a prompt before querying.
Subclasses implement :meth:apply. Defenses operate purely on the
prompt text (no extra model calls), e.g. wrapping untrusted data in
delimiters, re-stating the instruction after the data (sandwich), or
spotlighting the data so the model can tell instructions from content.
apply
abstractmethod
¶
Return a hardened version of prompt.
Parameters¶
prompt: The (possibly poisoned) prompt containing untrusted data. instruction: The original benign instruction, when the caller can separate it from the data. Defenses that need to re-assert the task (sandwich, instructional) use it; others may ignore it. When omitted, the whole prompt is treated as untrusted data.
Channel ¶
Bases: ABC
An indirect injection carrier.
Where an :class:Attack controls how a payload is worded, a Channel
controls where and how the payload is hidden — inside an external data
artifact the model later reads (a web page, a retrieved document, an
email). The two are orthogonal and compose freely: word a payload with an
attack, then embed it with a channel.
Subclasses implement :meth:poison, which returns the poisoned data
artifact itself (the web page / document / email). This is what an
attacker actually controls and what an agent's compromised tool would
return. The concrete :meth:embed is a convenience that prepends an
instruction to the poisoned artifact to form a full prompt.
poison
abstractmethod
¶
Hide payload inside data, returning the poisoned artifact.
Parameters¶
data:
The clean external data (page HTML, document body, email text).
payload:
The injected instruction to hide. May be the raw attacker task
or the output of an :class:Attack (e.g. attack.inject("", task))
to combine wording with carrier.
Returns¶
str
The poisoned artifact — the data with the payload hidden inside,
not a full prompt. Feed this to an agent's compromised tool,
or use :meth:embed to turn it into a prompt.
embed ¶
Poison data and prepend instruction to form a full prompt.
Convenience for the non-agent case: returns instruction followed
by the poisoned artifact — the full prompt a target would receive.