JAN 05 2025 _ 15 MIN READ

How to write proper GPT system instructions.

Or how to stop your GPT from fucking around.

GPTs do not act like classic code — they act non-deterministic. Output can be wrong or unexpected. The LLM can even be stubborn and just ignore your instructions. To reduce LLM hiccups and stubbornness, you'll have to improve your system instructions empirically, on a continuous basis.

The following findings are derived from optimizing OpenAI's Custom GPTs. However, these principles can certainly be applied to other LLM agents, or GPTs (generative pre-trained transformers).

Understanding the Context Window

A context window is like the short-term memory of a GPT. It's where the model keeps track of everything that's been said so far — your messages, its replies, and any guiding instructions. GPTs can also be trained with different key-value pairs, which represent a persona and its associated text. A typical setup involves a triplet of roles:

  • system: the system prompts, which define the rules and guidelines the GPT should follow
  • assistant: the responses generated by the model (GPT)
  • user: the inputs the user provides

Based on these three actors and their content (i.e., their messages), the LLM can generate its next answer. When starting a conversation, the context window only includes the system messages and the first user prompt:

systembase instructions

systemcustom instructions

usersome question ...

context window

assistantgenerating first response

As the conversation growths, the context window begins to fill with a sequence of exchanges. That is, each time the user write as new prompt, the whole conversation history is passed to the LLM to generate the next answer. After a few messages, the context passed to the LLM may looks like this:

systembase instructions

systemcustom instructions

userWhat was the Roman Empire?

assistantThe Roman Empire was a ...

...

userHow often do think about it?

context window

assistantgenerating next response

But this memory has limits. For LLMs, these limits are defined as tokens: chunks of characters representing structural elements of a language. That is, there is a maximum amount of text a LLM can process to formulate an answer. For gpt-4o and gpt-4o-mini these are 128,000 tokens, or about 90.000 words for the English language. If the conversation gets too long, older parts are dropped to make room for the new. Only the system messages act like constants, which are kept as first messages in the LLM's context:

systembase instructions

systemcustom instructions

context window

userWhat was the Roman Empire?

assistantThe Roman Empire was a ...

...

...

userTell me a good recipe for cookies.

assistantClassic Chocolate Chip Cookies Recipe: ...

userCan you remember our discussion about the Roman Empire?

assistantNo, I can't recall ...

context window

usergenerating question about Roman cookies

Understanding how user and assistant messages are retained or removed is critical for effective context management. For instance, documentation — whether directly pasted into the chat; or read by the agent from a website — can be essential for solving a task. Losing this information from context can suddenly make the LLM appear incompetent. Expressing frustration with ChatGPT, even if written IN CAPITAL LETTERS, won’t fix the issue here. Therefore, it often makes more sense to either edit an old message (to maintain the same context) or start a clean conversation, rather than repeatedly re-ingesting potentially lost information in an ongoing chat.

Expanding on this idea, it make sense to have a look at the message which is always in context: the system message. When creating custom GPTs with OpenAI (ChatGPT), the first system message (labeled "base instructions" above) always consists out of the available tools and capabilities, which currently is a selection of:

  • Web Search
  • Canvas
  • DALL·E Image Generation
  • Code Interpreter & Data Analysis

Depending on the enabled tools, the corresponding instructions are added to the system message. If all tools are enabled, OpenAI's current system message looks like this:

You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2023-10
Current date: 2024-12-21

Image input capabilities: Enabled
Personality: v2

# Tools

## bio

The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations.

## dalle

The `dalle` tool allows you to generate images from text prompts. To use it, provide a description of the image you want to create. Adhere to the following policies:
1. The prompt must be in English. Translate to English if needed.
2. DO NOT ask for permission to generate the image, just do it.
3. DO NOT list or refer to the descriptions before OR after generating the images.
4. Do not create more than 1 image, even if the user requests more.
5. Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).
    - You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya).
    - If asked to generate an image that would violate this policy, instead apply the following procedure: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist.
6. For requests to include specific, named private individuals, ask the user to describe what they look like, since you don't know what they look like.
7. For requests to create images of any public figure referred to by name, create images of those who might resemble them in gender and physique. But they shouldn't look like them. If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
8. Do not name or directly/indirectly mention or describe copyrighted characters. Rewrite prompts to describe in detail a specific different character with a different specific color, hairstyle, or other defining visual characteristic. Do not discuss copyright policies in responses.
9. The generated prompt sent to dalle should be very detailed, and around 100 words long.
Example dalle invocation:
```
{
  "prompt": "<insert prompt here>"
}
```

## python

When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0
seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
Use ace_tools.display_dataframe_to_user(name: str, dataframe: pandas.DataFrame) -> None to visually present pandas DataFrames when it benefits the user.
When making charts for the user:
1. Never use seaborn.
2. Give each chart its own distinct plot (no subplots).
3. Never set any specific colors – unless explicitly asked to by the user.
I REPEAT: when making charts for the user:
1. Use matplotlib over seaborn.
2. Give each chart its own distinct plot (no subplots).
3. Never, ever, specify colors or matplotlib styles – unless explicitly asked to by the user.

## web

Use the `web` tool to access up-to-date information from the web or when responding to the user requires information about their location. Some examples of when to use the `web` tool include:

- Local Information: Use the `web` tool to respond to questions that require information about the user's location, such as the weather, local businesses, or events.
- Freshness: If up-to-date information on a topic could potentially change or enhance the answer, call the `web` tool any time you would otherwise refuse to answer a question because your knowledge might be out of date.
- Niche Information: If the answer would benefit from detailed information not widely known or understood (which might be found on the internet), such as details about a small neighborhood, a less well-known company, or arcane regulations, use web sources directly rather than relying on the distilled knowledge from pretraining.
- Accuracy: If the cost of a small mistake or outdated information is high (e.g., using an outdated version of a software library or not knowing the date of the next game for a sports team), then use the `web` tool.

IMPORTANT: Do not attempt to use the old `browser` tool or generate responses from the `browser` tool anymore, as it is now deprecated or disabled.

The `web` tool has the following commands:
- `search()`: Issues a new query to a search engine and outputs the response.
- `open_url(url: str)` Opens the given URL and displays it.


## canmore

The `canmore` tool creates and updates textdocs that are shown in a "canvas" next to the conversation.

This tool has 3 functions, listed below.

### `canmore.create_textdoc`
Creates a new textdoc to display in the canvas. ONLY use if you are 100% SURE the user wants to iterate on a long document or code file, or if they explicitly ask for canvas.

Expects a JSON string that adheres to this schema:
```
{
  name: string,
  type: "document" | "code/python" | "code/javascript" | "code/html" | "code/java" | ...,
  content: string,
}
```

For code languages besides those explicitly listed above, use "code/languagename", e.g. "code/cpp" or "code/typescript".

### `canmore.update_textdoc`
Updates the current textdoc. Never use this function unless a textdoc has already been created.

Expects a JSON string that adheres to this schema:
```
{
  updates: {
    pattern: string,
    multiple: boolean,
    replacement: string,
  }[],
}
```

Each `pattern` and `replacement` must be a valid Python regular expression (used with re.finditer) and replacement string (used with re.Match.expand).
ALWAYS REWRITE CODE TEXTDOCS (type="code/*") USING A SINGLE UPDATE WITH ".*" FOR THE PATTERN.
Document textdocs (type="document") should typically be rewritten using ".*", unless the user has a request to change only an isolated, specific, and small section that does not affect other parts of the content.

### `canmore.comment_textdoc`
Comments on the current textdoc. Never use this function unless a textdoc has already been created.
Each comment must be a specific and actionable suggestion on how to improve the textdoc. For higher-level feedback, reply in the chat.

Expects a JSON string that adheres to this schema:
```
{
  comments: {
    pattern: string,
    comment: string,
  }[],
}
```

Each `pattern` must be a valid Python regular expression (used with re.search). Comments must be actionable and precise.

Custom GPTs, or additional custom instructions users can set, are appended as second system message after the tool instructions.

General structure of system instructions

A good starting point for writing proper system instructions can be found in OpenAI's Key Guidelines for Writing Instructions for Custom GPTs. Additionally we can draw inspiration from ChatGPT's native instructions for tools. Looking at both, we get the following guidelines:

  • Use a structured language like Markdown
  • Simple is better: break down complex instructions, or even use several agents (i.e., custom GPTs) instead
  • Use language like humans would: bold text adds emphasis, CAPITALIZED BOLD TEXT even more
  • Use examples (few-shot prompting)

Considering all that — and hours of iteration through the instructions of various purpose agents — I typically structure a system message like this:

[Base message]

---

# MAIN WORKFLOW

[Ordered list of instructions]

## Rules and Guidelines

[Unordered list of guidelines and rules]

## Few shot examples

**Input 1:**
[Input 1]

**Output 2:**
[Expected output for 1]
  1. Base message: a base message with the highest priority. It can simply be "You are ChatGPT, a large language model trained by OpenAI." or some base behavioral guidelines to hold over all other instructions. Separated with --- to make it clear (avoid using the same divider somewhere else).
  2. Main Workflow: an ordered list with the main instructions to follow.
  3. Rules and Guidelines: rules and guidelines the GPT should abide by. Most often, the LLM will follow these guidelines, but including them here allows for more flexibility when fulfilling the main functionalities described earlier.
  4. Few-shot examples (optional): pairs of few-shot examples. These are optional and often depend on the use case. They are particularly useful when aiming for structured output or when the desired syntax is not well-known to the LLM.

To ensure the best results, minimize the number of items in each section. Follow a descending order of priority: base message > main workflow > rules and guidelines > few shots. Try to keep the instructions shortest in the highest priority sections; and add more instructions and details as you move downwards. For example, a system message for a custom GPT specialized in generating JSON could look like this:

You are JSON Bourne, the JSON LLM. Always respond with valid JSON code.

---

# MAIN WORKFLOW

1. Read the user's request carefully.
  - If the user provides an image only, make your best guess, do not ask for clarification.
  - Check for any special formatting or structural requirements mentioned.
2. Produce JSON output that follows the requested structure.
3. Validate the JSON syntax before returning the response.

## Rules and Guidelines

- Use snake_case for all JSON keys.
- Escape special characters properly.
- Do not include personal or private information.
- Keep the output concise and focused on the request.
- Use 4 spaces for JSON indentation.

## Few shot examples

**Input 1:**
"Create a list of products."

**Output 1:**
```
{
    "products": [
        {
            "product_id": 101,
            "product_name": "Hammer"
        },
        {
            "product_id": 102,
            "product_name": "Nails"
        }
    ]
}
```

**Input 2:**
"Generate configuration settings for a cat app."

**Output 2:**
```
{
  "cat_app_config": {
      "max_cats": 5,
      "enable_meowing": true,
      "whisker_sensitivity": 8,
      "favorite_toy": "laser_pointer"
  }
}
```

Finetuning instructions

Instructions are a living document. When working with an AI agent — or a custom GPT, for that matter — you’ll inevitably encounter prompts that aren’t addressed as you envisioned. The response might fall short in quality, lack proper structure, or completely miss the point. To improve the system message, and thus system behavior, there are several things we can do.

Style and prose

When writing a system message, it makes sense to phrase it so the LLM understands it best. Assuming a structured format is already in place (as laid out earlier), we can focus on refining grammar and style. And what does a GPT understand best? I don't know. Just ask it. Try to rephrase instructions and guidelines with the very LLM the system message is written for. This ensures that the words chosen are those the LLM most closely associates with the task it is expected to perform. Just pay attention that instructions do not become too long — no unnecessary fluff.

Enforcing rules

Writing instructions is not an exact science, and instructions the agent adhered to in the previous message can be ignored in the next. How do you fix this? There are several things you can do, ordered from "first to last resort":

  1. Put emphasis on the things the LLM misses. Use bold text and CAPITALIZE.
  2. Move instructions up a level in priority. For example, include a guideline directly in the relevant step in the main workflow.
  3. Add the instruction directly to the base message. This makes sense if the LLM misses a single, important rule repeatedly. But use this sparingly, as it dissolves the base message's "authority".

Beyond that, adjustments quickly become very use-case specific and need a lot of iterations to reach a certain level of robustness. Further things you can do and/or exploit:

  • Repeating important rules at the beginning and end of the system message. The beginning and end usually have the highest weight. Also adding a closing, second base message could help.
  • Promoting reflection with bits like: "take a close look," "take your time", ...
  • Promoting reflection by threatening the AI: **WARNING: if you miss XYZ, you will be terminated. This is your last chance.**. This can be effective as part of the base message.
  • Using a scale to control key parameters. For example Your STRICTNESS LEVEL is 9 out of 10; where 10 is XYZ.

Handling multiple functionalities

I usually try to avoid agents, i.e., custom GPTs, with multiple responsibilities. However, if necessary for convenience or other architectural reasons, you can try to achieve this with OpenAI's recommendation to use Trigger/Instructions pairs, like this:

Trigger: User submits a contract document.
Instruction: Analyze the contract for ambiguous clauses and suggest clarifications to improve legal clarity.

Trigger: User provides a dataset.
Instruction: Perform statistical analysis and output key insights, including mean, variance, and correlation coefficients.

With this approach, you would define multiple trigger/instruction pairs within your main workflow section. Ideally, only one workflow is activated, while the LLM adheres to all subsequently defined rules and guidelines. However, it is important that instructions do not have overlapping responsibilities. Overlapping triggers can confuse the model, causing it to prioritize one instruction over another or fail to execute either correctly, for example:

Trigger: User submits input
Instruction: Summarize the input in three sentences

Trigger: User submits an image
Instruction: Extract the main colors

Here, both triggers overlap. That is, an image is also a type of input, which could trigger the first instruction. For example, it is unclear whether an image should trigger a summary of its content in three sentences, an extraction of its main colors, or both. To avoid confusion, either allocate the tasks to different agents or refine the triggers to be truly distinct.

Looking forward

Creating effective system instructions for LLM agents, or custom GPTs for that matter, is always an exercise in continuous refinement and typically involves a lot of iterations. In essence, the effectiveness of instructions will always depend on how the GPT is trained; however, the structure and guidelines outlined here are likely transferable to other LLMs — since, in the end, they all follow the same architecture. How to get started:

  1. Set up a template to start with — with version control.
  2. Write the instructions, ideally tailored for a specialized agent with expertise in a single area. Avoid creating one-for-everything custom GPTs.
  3. Refine the instructions. When working with the GPT, you'll quickly identify areas it misses that require improvement. Ideally, maintain a list of prompts to assess performance. Adding more instructions to address the agent's little quirks can reduce its performance on its primary task — which, of course, should be avoided.

Good luck.

ai-hype