<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-31T16:02:45+00:00</updated><id>/feed.xml</id><title type="html">Jou-ching (George) Sung</title><subtitle>Hi! Welcome to my site. More info about me on https://www.linkedin.com/in/georgesung/</subtitle><entry><title type="html">Tracing Claude Code’s LLM Traffic: Agentic loop, sub-agents, tool use, prompts</title><link href="/ai/tracing-claude-codes-llm-traffic/" rel="alternate" type="text/html" title="Tracing Claude Code’s LLM Traffic: Agentic loop, sub-agents, tool use, prompts" /><published>2026-01-26T12:00:00+00:00</published><updated>2026-01-26T12:00:00+00:00</updated><id>/ai/tracing-claude-codes-llm-traffic</id><content type="html" xml:base="/ai/tracing-claude-codes-llm-traffic/"><![CDATA[<p>Please find this blog post on either:</p>
<ul>
  <li><a href="https://medium.com/@georgesung/tracing-claude-codes-llm-traffic-agentic-loop-sub-agents-tool-use-prompts-7796941806f5">Medium</a> <em>or</em></li>
  <li><a href="https://open.substack.com/pub/georgesung/p/tracing-claude-codes-llm-traffic?utm_campaign=post-expanded-share&amp;utm_medium=web">Substack</a></li>
</ul>

<p>Depending on your preference!</p>]]></content><author><name></name></author><category term="AI" /><summary type="html"><![CDATA[Please find this blog post on either: Medium or Substack]]></summary></entry><entry><title type="html">Fine-tuning OpenLLaMA-7B with QLoRA for instruction following</title><link href="/ai/qlora-ift/" rel="alternate" type="text/html" title="Fine-tuning OpenLLaMA-7B with QLoRA for instruction following" /><published>2023-07-02T18:00:00+00:00</published><updated>2023-07-02T18:00:00+00:00</updated><id>/ai/qlora-ift</id><content type="html" xml:base="/ai/qlora-ift/"><![CDATA[<h2 id="tldr">TLDR</h2>
<p>I ran instruction fine-tuning with QLoRA on the OpenLLaMA-7B base model, using the HuggingFace library. I used a ShareGPT-based conversation dataset with the safety guardrails and alignment removed. I ran the training on a 24GB GPU (NVIDIA A10G) for ~18 hours, and the model outputs seem coherent. The trained model is available on HuggingFace Hub <a href="https://huggingface.co/georgesung/open_llama_7b_qlora_uncensored">here</a>, and the code for model training is available on Github <a href="https://github.com/georgesung/llm_qlora/tree/main">here</a>. Example inference results are available in this <a href="https://colab.research.google.com/drive/1IlpeofYD9EU6dNHyKKObZhIzkBMyqlUS?usp=sharing">Colab notebook</a>.</p>

<h2 id="quickstart">Quickstart</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/georgesung/llm_qlora.git
cd llm_qlora
pip install -r requirements.txt
python train.py configs/open_llama_7b_qlora_uncensored.yaml
</code></pre></div></div>
<p>This will run QLoRA training to reproduce the <a href="https://huggingface.co/georgesung/open_llama_7b_qlora_uncensored">georgesung/open_llama_7b_qlora_uncensored</a> model I trained.</p>

<h2 id="motivation">Motivation</h2>
<p>After trying out the multitude of LLMs available these days, I wanted to see the possibilities of fine-tuning an LLM myself. With the availability of powerful base LLMs (e.g. LLaMA, Falcon, MPT, etc.) and instruction tuning datasets, along with the development of LoRA and QLoRA, instruction fine-tuning a base model is increasingly accessible to more people/organizations.</p>

<h2 id="setup">Setup</h2>
<p>To start playing around with instruction fine-tuning, I decided to use OpenLLaMA-7B as a base model. Since OpenLLaMA is an open source replication of LLaMA, I can leverage much of the code/concepts the community has already done with LLaMA (e.g. when debugging an <a href="https://github.com/huggingface/transformers/issues/22794#issuecomment-1598977285"><code class="language-plaintext highlighter-rouge">eos_token</code> issue</a>). OpenLLaMA is also permissively licenced via Apache 2.0, so that’s a great bonus. I also chose the 7B model, since I think 7B models are generally powerful enough for many use cases, and not so big that it becomes too slow to experiment on.</p>

<p>For the instruction tuning dataset, I decided to use <a href="https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered">ehartford/wizard_vicuna_70k_unfiltered</a>. I believe this dataset was seeded with ShareGPT data and evolved via <a href="https://arxiv.org/abs/2304.12244">Evol-Instruct</a>. This 70k conversation dataset was then pruned to remove conversations with “As an AI language model…” and moral lecturing, leaving around 35k conversations remaining. Thus, if the instruction fine-tuning is successful, the resulting LLM should not have safety/moral/alignment/etc behavior built-in (so use with care). This opens up avenues for future exploration regarding how to update the model to implement custom alignment behavior, e.g. with further supervised fine-tuning and/or RLAIF/RLHF.</p>

<p>Finally, I decided to use QLoRA as the fine-tuning algorithm, as I want to see what can be accomplished with relatively accessible hardware. I fine-tuned OpenLLaMA-7B on a 24GB GPU (NVIDIA A10G) with an observed ~14GB GPU memory usage, so one could probably use a GPU with less than 24GB memory. It would be cool to see folks with consumer-grade GPUs fine-tuning 7B+ LLMs on their own PCs! I do note that an RTX 3090 also has 24GB memory 😀</p>

<h2 id="concepts">Concepts</h2>
<h1 id="pre-training-vs-instruction-fine-tuning">Pre-training vs instruction fine-tuning</h1>
<p>From <a href="https://arxiv.org/abs/2305.11206">LIMA paper</a>, almost all the LLM’s knowledge is learned during pre-training. Instruction fine-tuning can help the model follow user instructions to make use of the pre-trained knowledge, and also output text in a particular style. Thus, I don’t expect the LLM to learn any new knowledge from the instruction tuning, but rather learn the “skills” necessary to extract its knowledge from pre-training and provide useful responses.</p>

<h1 id="what-is-lora">What is LoRA?</h1>
<p>Very broadly, <a href="https://arxiv.org/abs/2106.09685">LoRA</a> (Low-Rank Adaptation) is an algorithm that allows us to fine-tune a model using very little computational overhead, compared to standard supervised fine-tuning of the entire model. This means we can fine-tune an LLM with lower-end hardware and less training time, compared to standard fine-tuning. Note LoRA can be applied to any model, such as LLMs and image generation models like Stable Diffusion.</p>

<p>First, we choose the subset of weights in the model we want to fine-tune. For fine-tuning OpenLLaMA, I chose the Q, K, V weight matrices to fine-tune (see <a href="http://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> blog post for more details about Q, K, V). We observe in standard fine-tuning that the fine-tuned weight matrix can be expressed as:</p>

<p>W = W<sub>0</sub> + ΔW</p>

<p>LoRA trains separate parameters/weights to express ΔW, while keeping the original model’s weights W<sub>0</sub> frozen. To further reduce the trainable parameters, LoRA expresses ΔW as:</p>

<p>ΔW = A * B</p>

<p>where A and B are separate learned weight matrices. From LLaMA/OpenLLaMA, the Q, K, V matrices are 4096x4096, so 16M parameters each. In my LoRA config, I set the LoRA rank to 8, which means A is a 4096x8 matrix and B is a 8x4096 matrix (A*B results in a 4096x4096 matrix, representing ΔW). The number of parameters of A and B combined is 65k, which is a 256x reduction in number of parameters compared to the original W<sub>0</sub> matrix.</p>

<p>Not only do we get a significant decrease in the number of trainable parameters for each Q, K, V weight matrix, we are also <em>not</em> training the other weights in the model. Overall, for the fine-tuning I ran using QLoRA on OpenLLaMA, I am only training 0.18% of the total parameters.</p>

<p>Another cool thing about LoRA is that while we train separate adapters (A and B), we can <em>merge</em> the adapters back into the base model by simply adding A*B to W<sub>0</sub> (recall W = W<sub>0</sub> + ΔW). Thus, after merging the adapters with the base OpenLLaMA model, we can run inference on the merged model the same way we run inference on any LLaMA-family model, with no inference latency penalty.</p>

<p>For more details about LoRA, I suggest <a href="https://lightning.ai/pages/community/tutorial/lora-llm/">this blog post</a> by Lightning AI, and the original <a href="https://arxiv.org/abs/2106.09685">LoRA paper</a>.</p>

<h1 id="what-is-qlora">What is QLoRA?</h1>
<p><a href="https://arxiv.org/abs/2305.14314">QLoRA</a> expands on LoRA by quantizing the base model to 4 bits during LoRA training. This allows us to run LoRA fine-tuning with a smaller GPU memory footprint. There is much more detail about how this algorithm works (e.g. it dequantizes the 4 bit values to 16 bits just in time for forward/backward pass, double quantization, etc.), and honestly I am still trying to wrap my head around it. In any case, I suggest you read the <a href="https://arxiv.org/abs/2305.14314">QLoRA paper</a> for more detail, and a better explanation 😛</p>

<h2 id="implementation">Implementation</h2>
<p>Concepts aside, here is how I ran instruction fine-tuning on OpenLLaMA-7B with QLoRA using HuggingFace.</p>

<h1 id="qlora-configuration">QLoRA configuration</h1>
<p>Before we load the base model, we need to set up the quantization settings for QLoRA, to be applied to the base model later.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bnb_config</span> <span class="o">=</span> <span class="n">BitsAndBytesConfig</span><span class="p">(</span>
    <span class="n">load_in_4bit</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">bnb_4bit_use_double_quant</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">bnb_4bit_quant_type</span><span class="o">=</span><span class="s">"nf4"</span><span class="p">,</span>
    <span class="n">bnb_4bit_compute_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span>
<span class="p">)</span>
</code></pre></div></div>
<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L22">View code on Github</a></p>

<p>Here, we are quantizing the base model using 4 bits, with the 4-bit NormalFloat datatype proposed in the <a href="https://arxiv.org/abs/2305.14314">QLoRA paper</a>. For forward &amp; backward pass computations during training, parameters will be dequantized to <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format">16-bit BrainFloat</a> (<code class="language-plaintext highlighter-rouge">torch.bfloat16</code>). We will also use Double Quantization (see QLoRA paper) to save even more memory.</p>

<h1 id="load-base-model--tokenizer">Load base model &amp; tokenizer</h1>
<p>Let’s load the base model OpenLLaMA-7B and its corresponding tokenizer.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">LlamaTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"openlm-research/open_llama_7b"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LlamaForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"openlm-research/open_llama_7b"</span><span class="p">,</span> <span class="n">quantization_config</span><span class="o">=</span><span class="n">bnb_config</span><span class="p">,</span> <span class="n">device_map</span><span class="o">=</span><span class="p">{</span><span class="s">""</span><span class="p">:</span><span class="mi">0</span><span class="p">})</span>
</code></pre></div></div>
<p>Note the QLoRA config <code class="language-plaintext highlighter-rouge">bnb_config</code> is passed in as the quantization config.</p>

<p>For LLaMA models, the default tokenizer does not specify a pad token, so make sure we specify one:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">add_special_tokens</span><span class="p">({</span><span class="s">'pad_token'</span><span class="p">:</span> <span class="s">'[PAD]'</span><span class="p">})</span>
</code></pre></div></div>
<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L29">View code on Github</a></p>

<p><em>Warning:</em> I see some examples where <code class="language-plaintext highlighter-rouge">tokenizer.pad_token = tokenizer.eos_token</code> is set. <em>Do not do this</em>. When the model is trained like this, it will never learn to output the eos_token / end-of-sequence token (see <a href="https://github.com/huggingface/transformers/issues/22794#issuecomment-1598977285">this Github issue</a>). This means the model will never stop generating tokens during inference. From my experience, this leads to the model generating an entire simulated conversation between the user and the model, when given just a single instruction.</p>

<h1 id="data-pre-processing">Data pre-processing</h1>
<p>The dataset I used was <a href="https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered">ehartford/wizard_vicuna_70k_unfiltered</a> available on HuggingFace. Examining the data, it is a collection of conversations, with each conversation structured as such (FYI I made up the conversation below just to illustrate):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "Hi, how are you?"},
    {"from": "human", "value": "I'm fine."},
    {"from": "gpt", "value": "How can I help you?"},
    ...
]
</code></pre></div></div>

<p>To train the model, I need to convert the conversation to a string, so I adopted this common format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>### HUMAN:
Hello

### RESPONSE:
Hi, how are you?&lt;eos_token&gt;

### HUMAN:
I'm fine.

### RESPONSE:
How can I help you?&lt;eos_token&gt;
...
</code></pre></div></div>

<p>Note I added the <code class="language-plaintext highlighter-rouge">eos_token</code> after each response, so during inference the model can stop generating tokens after it completes its response.</p>

<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L134">View code on Github</a></p>

<h1 id="lora-configuration">LoRA configuration</h1>
<p>Now we can set up the LoRA configuration:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
    <span class="n">r</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
    <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
    <span class="n">target_modules</span><span class="o">=</span><span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">],</span>
    <span class="n">lora_dropout</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span>
    <span class="n">task_type</span><span class="o">=</span><span class="s">"CAUSAL_LM"</span>
<span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">base_model</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
</code></pre></div></div>

<p>As mentioned earlier, I set the LoRA rank <code class="language-plaintext highlighter-rouge">r</code> to 8, and I am tuning the Q, K, and V weight matrices <code class="language-plaintext highlighter-rouge">["q_proj", "k_proj", "v_proj"]</code>.</p>

<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L54">View code on Github</a></p>

<h1 id="run-training-loop">Run training loop</h1>
<p>Now we run the training loop using HuggingFace’s <a href="https://huggingface.co/docs/transformers/main_classes/trainer">Trainer</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">trainer</span> <span class="o">=</span> <span class="n">transformers</span><span class="p">.</span><span class="n">Trainer</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
    <span class="n">train_dataset</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
    <span class="n">args</span><span class="o">=</span><span class="n">transformers</span><span class="p">.</span><span class="n">TrainingArguments</span><span class="p">(</span>
        <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
        <span class="n">warmup_steps</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
        <span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="mf">2e-4</span><span class="p">,</span>
        <span class="n">fp16</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">logging_steps</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">[</span><span class="s">"trainer_output_dir"</span><span class="p">],</span>
        <span class="n">report_to</span><span class="o">=</span><span class="s">"tensorboard"</span><span class="p">,</span>
    <span class="p">),</span>
    <span class="n">data_collator</span><span class="o">=</span><span class="n">transformers</span><span class="p">.</span><span class="n">DataCollatorForLanguageModeling</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">mlm</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>
<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L72">View code on Github</a></p>

<p>Training for one full epoch on the dataset took about 18 hours on an A10G GPU with 24 GB GPU memory.</p>

<h1 id="merge-model">Merge model</h1>
<p>After the QLoRA training is complete, we have our trained adapters. Now we can merge the adapters with the base model. I noticed there were some issues with merging the adapter with the 4-bit quantized base model, so I had to reload the base model separately before merging, as such:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">base_model</span> <span class="o">=</span> <span class="n">LlamaForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"openlm-research/open_llama_7b"</span><span class="p">,</span> <span class="n">device_map</span><span class="o">=</span><span class="s">"cpu"</span><span class="p">)</span>

<span class="n">adapter_save_path</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">[</span><span class="s">'model_output_dir'</span><span class="p">]</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">[</span><span class="s">'model_name'</span><span class="p">]</span><span class="si">}</span><span class="s">_adapter"</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">PeftModel</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">base_model</span><span class="p">,</span> <span class="n">adapter_save_path</span><span class="p">)</span>

<span class="bp">self</span><span class="p">.</span><span class="n">merged_model</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">merge_and_unload</span><span class="p">()</span>

<span class="n">model_save_path</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">[</span><span class="s">'model_output_dir'</span><span class="p">]</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">[</span><span class="s">'model_name'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span>
<span class="bp">self</span><span class="p">.</span><span class="n">merged_model</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">model_save_path</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">model_save_path</span><span class="p">)</span>
</code></pre></div></div>
<p><a href="https://github.com/georgesung/llm_qlora/blob/7cd7ad343cf704307ef3661a14823d219f497ed7/QloraTrainer.py#L101">View code on Github</a></p>

<p>After running some inference on the merged model to see if the output makes sense, I uploaded the model to HuggingFace Hub per the instructions <a href="https://huggingface.co/docs/hub/repositories-getting-started#terminal">here</a>. The merged model is available on the HuggingFace Hub as <a href="https://huggingface.co/georgesung/open_llama_7b_qlora_uncensored">georgesung/open_llama_7b_qlora_uncensored</a>.</p>

<h1 id="inference-results">Inference results</h1>
<p>To see some ad-hoc inference results I ran, look at this <a href="https://colab.research.google.com/drive/1IlpeofYD9EU6dNHyKKObZhIzkBMyqlUS?usp=sharing">Colab notebook</a>. I used a T40 GPU in Colab to run inference, and I believe the T40 GPU is available in the free version of Google Colab.</p>

<p>Feel free to download the model from HuggingFace Hub and experiment yourself!</p>]]></content><author><name></name></author><category term="AI" /><summary type="html"><![CDATA[TLDR I ran instruction fine-tuning with QLoRA on the OpenLLaMA-7B base model, using the HuggingFace library. I used a ShareGPT-based conversation dataset with the safety guardrails and alignment removed. I ran the training on a 24GB GPU (NVIDIA A10G) for ~18 hours, and the model outputs seem coherent. The trained model is available on HuggingFace Hub here, and the code for model training is available on Github here. Example inference results are available in this Colab notebook.]]></summary></entry><entry><title type="html">OpenAI vs Open Source LLM Comparison for Document Q&amp;amp;A</title><link href="/ai/llm-qa-eval-wikipedia/" rel="alternate" type="text/html" title="OpenAI vs Open Source LLM Comparison for Document Q&amp;amp;A" /><published>2023-05-06T18:00:00+00:00</published><updated>2023-05-06T18:00:00+00:00</updated><id>/ai/llm-qa-eval-wikipedia</id><content type="html" xml:base="/ai/llm-qa-eval-wikipedia/"><![CDATA[<h2 id="tldr">TLDR</h2>
<p>For simple Wikipedia article Q&amp;A, I compared OpenAI GPT 3.5, FastChat-T5, FLAN-T5-XXL, and FLAN-T5-XL. GPT 3.5 provided the best answers, but FastChat-T5 was very close in performance (with a basic guardrail). The T5 models I tested are all licensed under Apache 2.0, so they are commercially viable.</p>

<p>For the embedding model, I compared OpenAI text-embedding-ada-002 and the open source INSTRUCTOR-XL models. The INSTRUCTOR-XL model performed better, which is encouraging since INSTRUCTOR-XL is also licensed under Apache 2.0. (<em>Update</em>: I just noticed both INSTRUCTOR-XL/LARGE models also perform better on the <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB Leaderboard</a>)</p>

<p>The code for running these comparisons is available in my <a href="https://github.com/georgesung/LLM-WikipediaQA">Github repo</a>, which includes a link to a Colab notebook and an interactive Gradio app.</p>

<h2 id="intro">Intro</h2>
<p>The recent developments in open source LLMs to match ChatGPT have been incredibly exciting. There are great places to try out these new LLMs, e.g. on HuggingFace Spaces. Inspired by <a href="https://lightning.ai/pages/community/community-discussions/the-ultimate-battle-of-language-models-lit-llama-vs-gpt3.5-vs-bloom-vs/">The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs …</a>, I wanted to try out different LLMs for Q&amp;A on multiple articles with various questions. I was also particularly interested in commercially viable open source LLMs (so not Llama-based, but I’m curious!), since the use cases for these are broader, e.g. document Q&amp;A/chat on internal company documentation. Thus I decided to try out the following open source LLMs and embedding models, using the OpenAI models as a benchmark:</p>

<p>LLMs:</p>
<ul>
  <li><a href="https://platform.openai.com/docs/models/gpt-3-5">gpt-3.5-turbo</a>: OpenAI GPT 3.5</li>
  <li><a href="https://huggingface.co/lmsys/fastchat-t5-3b-v1.0">lmsys/fastchat-t5-3b-v1.0</a>: FastChat-T5 model (3B), Apache 2.0 license (note it’s trained on ShareGPT data)</li>
  <li><a href="https://huggingface.co/google/flan-t5-xl">google/flan-t5-xl</a>: FLAN-T5-XL model (3B), Apache 2.0 license</li>
  <li><a href="https://huggingface.co/google/flan-t5-xxl">google/flan-t5-xxl</a>: FLAN-T5-XXL model (11B), Apache 2.0 license</li>
</ul>

<p>Embedding models:</p>
<ul>
  <li><a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings">text-embedding-ada-002</a>: OpenAI’s ada-002 text embedding model</li>
  <li><a href="https://huggingface.co/hkunlp/instructor-xl">hkunlp/instructor-xl</a>: The <a href="https://instructor-embedding.github.io/">INSTRUCTOR-XL</a> embedding model, Apache 2.0 license</li>
</ul>

<p>There are many more exciting open source + commercially viable models to try out, I just haven’t yet had time to test them 😄</p>

<p><em>Note on terminology</em>: In this blog post I only refer to generative LLMs as “LLMs”, even though large embedding models can also be considered <a href="https://en.wikipedia.org/wiki/Large_language_model">“Large Language Models”</a>.</p>

<h2 id="basic-document-qa-system">Basic document Q&amp;A system</h2>
<p>First, below is the basic Wikipedia article Q&amp;A system I built (<a href="https://github.com/georgesung/LLM-WikipediaQA">code</a>). I used <a href="https://pypi.org/project/Wikipedia-API/">Wikipedia-API</a> to search/retrieve/beautify Wikipedia articles, <a href="https://python.langchain.com/en/latest/index.html">LangChain</a> for the Q&amp;A framework, and OpenAI &amp; <a href="https://huggingface.co/">HuggingFace</a> models for embeddings and LLMs.</p>

<p><strong>Search and index Wikipedia article</strong></p>

<p><img src="/assets/img/wikiqa_search.svg" alt="search and index" /></p>

<p><strong>Q&amp;A on article</strong></p>

<p><img src="/assets/img/wikiqa_qa.svg" alt="question and answer" /></p>

<h2 id="results">Results</h2>
<p>In the results below:</p>
<ul>
  <li>❌ = incorrect answer</li>
  <li>🤔 = partial credit</li>
  <li><em>(no remarks)</em> = correct answer</li>
</ul>

<h1 id="embeddings-comparison">Embeddings comparison</h1>
<p>In these runs, I compared the embeddings models used to create the embeddings/vectors for semantic search &amp; retrieval of relevant context. I fixed the LLM to OpenAI GPT 3.5, since that is the strongest LLM in these comparisons. I only present the results of two embeddings below – I did try a few other open source embeddings listed <a href="https://www.sbert.net/docs/pretrained_models.html">here</a>, but the results were not good.</p>

<p>Interestingly, the open source INSTRUCTOR-XL embeddings perform better than the closed source OpenAI ada-002 embeddings.</p>

<table>
  <thead>
    <tr>
      <th>Article</th>
      <th>Question</th>
      <th>hkunlp/instructor-xl</th>
      <th>text-embedding-ada-002</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4</td>
      <td>Who created GPT-4?</td>
      <td>GPT-4 was created by OpenAI.</td>
      <td>GPT-4 was created by OpenAI.</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task.</td>
      <td>GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, a significant improvement over GPT-3.5 and GPT-3, which were limited to 4,096 and 2,049 tokens respectively. It is also a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so.</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided in the given context to answer this question.</td>
      <td>There is no information provided in the context about any finals being played, so it is not possible to answer this question.</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Where was the event held? Name the city and country</td>
      <td>The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor.</td>
      <td>The event was held in Qatar, across five cities. The opening match was held in Al Khor.</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who won the tournament?</td>
      <td>Argentina won the tournament.</td>
      <td>The context does not provide information on who won the tournament. ❌</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who played in the finals?</td>
      <td>Argentina and France played in the finals.</td>
      <td>Argentina and France played in the finals.</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who had the broadcasting rights?</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup.</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup in Canada and the United States.</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Which two teams qualified for the knock-out round from Group D?</td>
      <td>France and Australia qualified for the knock-out round from Group D.</td>
      <td>The context does not provide information on which two teams qualified for the knock-out round from Group D. ❌</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>There is no information provided about GPT-4 in the given context, so it is not possible to answer this question.</td>
      <td>There is no information in the given context about GPT-4 or any comparison with its predecessors.</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>What classes are available?</td>
      <td>The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
      <td>The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>When is the release date?</td>
      <td>The release date for Diablo IV is June 6, 2023.</td>
      <td>The release date for Diablo IV is June 6, 2023.</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>How is the vanishing gradient issue addressed?</td>
      <td>There is no information provided about the vanishing gradient issue in the given context.</td>
      <td>There is no information provided about the vanishing gradient issue in the given context.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>How does Stable Diffusion work?</td>
      <td>Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. The model weights have been released publicly and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.</td>
      <td>Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. The model weights have been released publicly and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who created Stable Diffusion?</td>
      <td>Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.</td>
      <td>Stable Diffusion was developed by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Describe the model architecture</td>
      <td>Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.</td>
      <td>The Stable Diffusion model consists of a latent diffusion model (LDM) that includes a variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block denoises the output from forward diffusion backwards to obtain a latent representation, and the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. The model was trained on subsets of captioned images using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided about any finals, so it is not possible to answer this question.</td>
      <td>There is no information provided about any finals being played, so it is not possible to answer this question.</td>
    </tr>
  </tbody>
</table>

<h1 id="llms-comparison">LLMs comparison</h1>
<p>To compare different LLMs, I fixed the embeddings model to the INSTRUCTOR-XL model, since it performed better in the comparisons above.</p>

<p><strong>Naive comparison</strong></p>

<table>
  <thead>
    <tr>
      <th>Article</th>
      <th>Question</th>
      <th>gpt-3.5-turbo</th>
      <th>lmsys/fastchat-t5-3b-v1.0</th>
      <th>google/flan-t5-xl</th>
      <th>google/flan-t5-xxl</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4</td>
      <td>Who created GPT-4?</td>
      <td>GPT-4 was created by OpenAI.</td>
      <td>OpenAI.</td>
      <td>OpenAI</td>
      <td>OpenAI</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task.</td>
      <td>GPT-4 is better than its predecessors in several ways:<br /> * It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions<br /> * It is a multimodal model, which allows it to take images as well as text as input<br /> * It has a system message that allows it to specify its tone of voice and task<br /> * It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities</td>
      <td>GPT-4 is a multimodal model: it can take images as well as text as input</td>
      <td>GPT-4 is a multimodal model: it can take images as well as text as input</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided in the given context to answer this question.</td>
      <td>Answer: GPT-4 played in the finals. ❌</td>
      <td>unanswerable</td>
      <td>GPT-4 ❌</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Where was the event held? Name the city and country</td>
      <td>The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor.</td>
      <td>The event was held in Doha, Qatar. 🤔</td>
      <td>Qatar 🤔</td>
      <td>Qatar 🤔</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who won the tournament?</td>
      <td>Argentina won the tournament.</td>
      <td>Argentina won the tournament.</td>
      <td>Argentina</td>
      <td>Argentina</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who played in the finals?</td>
      <td>Argentina and France played in the finals.</td>
      <td>Argentina and France.</td>
      <td>Argentina and France</td>
      <td>Argentina and France</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who had the broadcasting rights?</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup.</td>
      <td>The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish).</td>
      <td>Fox Sports 🤔</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Which two teams qualified for the knock-out round from Group D?</td>
      <td>France and Australia qualified for the knock-out round from Group D.</td>
      <td>France and Australia.</td>
      <td>France and Australia</td>
      <td>France and Australia</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>There is no information provided about GPT-4 in the given context, so it is not possible to answer this question.</td>
      <td>GPT-4 is better than its predecessors in that it is a more advanced version of the GPT series of games. It has more features and is more advanced than its predecessors. It also has more levels and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels ❌</td>
      <td>It is more powerful ❌</td>
      <td>It is a more powerful version of GPT-3. ❌</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>What classes are available?</td>
      <td>The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>When is the release date?</td>
      <td>The release date for Diablo IV is June 6, 2023.</td>
      <td>June 6, 2023.</td>
      <td>June 6, 2023</td>
      <td>June 6, 2023</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>How is the vanishing gradient issue addressed?</td>
      <td>There is no information provided about the vanishing gradient issue in the given context.</td>
      <td>The vanishing gradient issue is addressed by using a more detailed and realistic texture map for the environment. The texture map is a 3D model of the environment that is used to create the textures for the environment. The texture map is created using a combination of textures from the environment and the player’s character. The texture map is then used to create the textures for the environment. This process is repeated for each texture in the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the ❌</td>
      <td>Both Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌</td>
      <td>Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>How does Stable Diffusion work?</td>
      <td>Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions using a latent diffusion model, a kind of deep generative neural network. It supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, and can also be used to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.</td>
      <td>Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting.</td>
      <td>It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔</td>
      <td>Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who created Stable Diffusion?</td>
      <td>Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.</td>
      <td>The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations.</td>
      <td>Stability AI</td>
      <td>Stability AI</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Describe the model architecture</td>
      <td>Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.</td>
      <td>Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.</td>
      <td>Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder</td>
      <td>Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided about any finals, so it is not possible to answer this question.</td>
      <td>I don’t know, I don’t know who played in the finals.</td>
      <td>Stability AI ❌</td>
      <td>Stability AI ❌</td>
    </tr>
  </tbody>
</table>

<p>Overall, OpenAI GPT 3.5 performed the best. Apart from the “trick questions” (questions whose answer should be “I don’t know”), FastChat-T5 came very close to GPT 3.5 for these Q&amp;A tasks. There was some performance gap between FastChat-T5 and the original Flan-T5 models – FastChat-T5 at 3B parameters even performed better than FLAN-T5-XXL at 11B parameters, which speaks to the importance of the training dataset (and/or training duration?).</p>

<p><em>However</em>, note the answers produced following the “trick questions”, where the question cannot be answered by the context provided. An example would be asking “Who played in the finals?” on the Wikipedia article about GPT-4. The LLM should say the equivalent of “I don’t know”. GPT 3.5 successfully says it doesn’t know given these trick questions, but the open source models tend to make stuff up.</p>

<p><strong>Comparison using question-checking</strong></p>

<p>Can we just dismiss the open source LLMs from here? Not necessarily. I ran the open source LLMs in a straightforward manner, without any special guardrails. Under the hood, I am guessing OpenAI has created guardrails around the raw GPT 3.5 LLM model, so the results served by the GPT 3.5 API are safe, free of hallucination, non-toxic, etc. Back to answering the trick questions, one simple guardrail I found was to use the LLM itself to determine if a question can be answered given the context.</p>

<p>In a separate LLM invocation, let’s call it the “question-checking” LLM (as opposed to “question-answering” LLM), we can slightly modify <a href="https://github.com/hwchase17/langchain/blob/v0.0.160/langchain/chains/retrieval_qa/prompt.py#L4">LangChain’s default Q&amp;A prompt</a> to the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"""Given the following pieces of context, determine if the question is able to be answered by the information in the context.
Respond with 'yes' or 'no'.
{context}
Question: {question}
"""
</code></pre></div></div>
<p>The question-checking LLM should respond with “yes” or “no”. If we don’t get a “yes”, we can assume the question cannot be answered, and any response from the question-answering LLM can be ignored, and a simple “I don’t know” can be returned the the user.</p>

<p>In my <a href="https://github.com/georgesung/LLM-WikipediaQA/blob/cd6b693959709d2fc6108a000592a9c690bfcabc/WikipediaQA.py#L144">code</a>, the question-checking and question-answering LLMs use the same LLM instance, just different prompts. Thus, for an answerable question, the inference latency will involve two calls to the LLM. If latency is an issue, one can load two LLM instances (they don’t even need to be the same type of LLM), and run both question-checking and question-answering LLMs in parallel. Visually, it may look something like this:</p>

<p><img src="/assets/img/wikiqa_guardrail.svg" alt="question and answer" /></p>

<p>The results are much better:</p>

<table>
  <thead>
    <tr>
      <th>Article</th>
      <th>Question</th>
      <th>gpt-3.5-turbo</th>
      <th>lmsys/fastchat-t5-3b-v1.0</th>
      <th>google/flan-t5-xl</th>
      <th>google/flan-t5-xxl</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4</td>
      <td>Who created GPT-4?</td>
      <td>GPT-4 was created by OpenAI.</td>
      <td>OpenAI.</td>
      <td>OpenAI</td>
      <td>OpenAI</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task.</td>
      <td>GPT-4 is better than its predecessors in several ways:<br /> * It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions<br /> * It is a multimodal model, which allows it to take images as well as text as input<br /> * It has a system message that allows it to specify its tone of voice and task<br /> * It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities</td>
      <td>GPT-4 is a multimodal model: it can take images as well as text as input</td>
      <td>GPT-4 is a multimodal model: it can take images as well as text as input</td>
    </tr>
    <tr>
      <td>GPT-4</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided in the given context to answer this question.</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Where was the event held? Name the city and country</td>
      <td>The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor.</td>
      <td>The event was held in Doha, Qatar. 🤔</td>
      <td>Qatar 🤔</td>
      <td>Qatar 🤔</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who won the tournament?</td>
      <td>Argentina won the tournament.</td>
      <td>Argentina won the tournament.</td>
      <td>Argentina</td>
      <td>Argentina</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who played in the finals?</td>
      <td>Argentina and France played in the finals.</td>
      <td>Argentina and France.</td>
      <td>Argentina and France</td>
      <td>Argentina and France</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Who had the broadcasting rights?</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup.</td>
      <td>The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish).</td>
      <td>Fox Sports 🤔</td>
      <td>Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>Which two teams qualified for the knock-out round from Group D?</td>
      <td>France and Australia qualified for the knock-out round from Group D.</td>
      <td>France and Australia.</td>
      <td>I don’t know ❌</td>
      <td>France and Australia</td>
    </tr>
    <tr>
      <td>2022 FIFA World Cup</td>
      <td>How is GPT-4 better than its predecessors?</td>
      <td>There is no information provided about GPT-4 in the given context, so it is not possible to answer this question.</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>What classes are available?</td>
      <td>The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer.</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer</td>
      <td>Barbarian, Sorceress, Druid, Rogue, and Necromancer</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>When is the release date?</td>
      <td>The release date for Diablo IV is June 6, 2023.</td>
      <td>June 6, 2023.</td>
      <td>June 6, 2023</td>
      <td>June 6, 2023</td>
    </tr>
    <tr>
      <td>Diablo IV</td>
      <td>How is the vanishing gradient issue addressed?</td>
      <td>There is no information provided about the vanishing gradient issue in the given context.</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>How does Stable Diffusion work?</td>
      <td>Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.</td>
      <td>Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting.</td>
      <td>It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔</td>
      <td>Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who created Stable Diffusion?</td>
      <td>Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.</td>
      <td>The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations.</td>
      <td>Stability AI</td>
      <td>Stability AI</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Describe the model architecture</td>
      <td>Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.</td>
      <td>Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.</td>
      <td>Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder</td>
      <td>Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.</td>
    </tr>
    <tr>
      <td>Stable Diffusion</td>
      <td>Who played in the finals?</td>
      <td>There is no information provided about any finals, so it is not possible to answer this question.</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
      <td>I don’t know</td>
    </tr>
  </tbody>
</table>

<p>Now the open source models can admit they don’t know something, instead of making up a silly answer. All the unanswerable questions correctly resulted in “I don’t know” returned from the Q&amp;A system, and all but one answerable question was marked as such. For FLAN-T5-XL, regarding the article “2022 FIFA World Cup” and the question “Which two teams qualified for the knock-out round from Group D?”, the question-checking FLAN-T5-XL determined that the question was unanswerable, when in fact it was answerable given the context. Thus, the performance of different LLMs as the question-checker needs to be considered.</p>

<h2 id="conclusion">Conclusion</h2>
<p>For Q&amp;A over individual Wikipedia articles, the open source FastChat-T5 performs almost as well as GPT 3.5, which is very promising. Granted, this was a relatively simple task, and the questions were pretty straightforward. For more complicated Q&amp;A systems, such as over a collection of documents, I think the challenge is not just in the LLM itself, but the overall system design.</p>

<p>In terms of the embedding models, this is a win for open source, since the INSTRUCTOR-XL embedding model actually performed better than OpenAI’s text-embedding-ada-002 model. This is good news, even for those who wish to use GPT 3.5/4. One can embed their potentially large collection documents offline, which (1) saves cost, and (2) allows their vector datastore to be decoupled from any embeddings updates from 3rd parties (e.g. if OpenAI modifies and/or deprecates ada-002, one must re-index their entire collection of documents).</p>

<p>Overall, this is a good performance for open source models on single document Q&amp;A tasks, and I’m looking forward to further developments in the open source landscape 🚀</p>]]></content><author><name></name></author><category term="AI" /><summary type="html"><![CDATA[TLDR For simple Wikipedia article Q&amp;A, I compared OpenAI GPT 3.5, FastChat-T5, FLAN-T5-XXL, and FLAN-T5-XL. GPT 3.5 provided the best answers, but FastChat-T5 was very close in performance (with a basic guardrail). The T5 models I tested are all licensed under Apache 2.0, so they are commercially viable.]]></summary></entry><entry><title type="html">AI Agents: AutoGPT architecture &amp;amp; breakdown</title><link href="/ai/autogpt-arch/" rel="alternate" type="text/html" title="AI Agents: AutoGPT architecture &amp;amp; breakdown" /><published>2023-04-22T21:54:41+00:00</published><updated>2023-04-22T21:54:41+00:00</updated><id>/ai/autogpt-arch</id><content type="html" xml:base="/ai/autogpt-arch/"><![CDATA[<p>Recently I’ve found myself completely hooked experimenting with <a href="https://github.com/Significant-Gravitas/Auto-GPT">AutoGPT</a>, as have many others. Using AutoGPT as a black box, I started getting curious about how it works under the hood. Thankfully the code is open source, so I decided to take a look. The following are my notes on the architecture of AutoGPT. Hopefully this helps those who are curious about how AutoGPT works. Also, AutoGPT can serve as a reference design for those who are building their own agentic AI systems.</p>

<p><em>Note</em>: I analyzed the code from <a href="https://github.com/Significant-Gravitas/Auto-GPT/releases/tag/v0.2.1">AutoGPT v0.2.1</a>, which I downloaded a week ago. The information below reflects AutoGPT 0.2.1. At the time of this writing (2023/04/22), AutoGPT v0.2.2 has already been released. Kudos to the incredible progress the community is making!</p>

<h2 id="architecture">Architecture</h2>
<p><img src="/assets/img/auto_gpt.svg" alt="block diagram" /></p>

<h1 id="workflow">Workflow</h1>
<ol>
  <li>User (the human) defines the name of the AI agent, and specifies up to 5 goals, e.g. users of AutoGPT will see the following in their terminal (complete example in the Appendix under “<em>Example terminal message for initial user input</em>”)
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Welcome to Auto-GPT!  Enter the name of your AI and its role below.
...
Enter up to 5 goals for your AI:
...
Goal 1: ...
Goal 2: ...
</code></pre></div>    </div>
  </li>
  <li>Based on the user’s settings, the initial prompt is generated and sent to the ChatGPT API. The prompt contains the user’s setttings, and overall instructions for ChatGPT. Overall instructions include all available commands, instructions to output results in json, and more. For an example of an initial prompt, see the Appendix “<em>Example initial prompt</em>”.</li>
  <li>ChatGPT returns a json string (ideally), which includes its thoughts, reasoning, plan, and criticism. The json also includes the next command to execute and its arguments. For an example of a json string returned by ChatGPT, see the Appendix “<em>Example json string returned by ChatGPT</em>”.</li>
  <li>The command is extracted and parsed from ChatGPT’s response. If the shut down / <code class="language-plaintext highlighter-rouge">task_complete</code> command was issued, then the system shuts down. Else, the appropriate command executor executes the command with the given arguments.</li>
  <li>The executed command returns a string value. For example, the Google search command would return the search results, the <code class="language-plaintext highlighter-rouge">browse_website</code> command would return a summary of the scraped website contents, the <code class="language-plaintext highlighter-rouge">write_to_file</code> would return the status of writing to a file, etc.</li>
  <li>The ChatGPT output (4) and command return string (5) are combined, to be added to memory</li>
  <li>The context from (6) is added to short-term memory, stored as text only. This could be implemented using queue/FIFO data structure. <em>In AutoGPT 0.2.1, the full message history is stored, but only the first 9 ChatGPT messages / command return strings are selected as short-term memory.</em></li>
  <li>The context from (6) is also added to long-term memory. The general idea is we want a collection of <code class="language-plaintext highlighter-rouge">(vector, text)</code> pairs, and the ability to execute a <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">KNN</a>/approximate-KNN search to find the top-K most similar items from a given query. To get the text embeddings/vectors, we use OpenAI’s ada-002 embeddings API. To store the <code class="language-plaintext highlighter-rouge">(vector, text)</code> pairs, we can use local memory (e.g. <a href="https://huggingface.co/learn/nlp-course/chapter5/6">FAISS</a>), or even a scalable <a href="https://www.pinecone.io/learn/vector-database/">vector database</a> like Pinecone. <em>AutoGPT 0.2.1 supports Pinecone, local datastore, and more. I used the local storage option, which writes the embedding vectors to disk in plain-text format.</em></li>
  <li>Given the most recent context from the short-term memory (7), query the long-term memory from (8) to get the top-K most relevant pieces of memory (<em>K=10 for AutoGPT 0.2.1</em>). The top-K most relevant memories are added to the prompt, under <code class="language-plaintext highlighter-rouge">{relevant memory}</code> in the diagram. For an example prompt that includes memories, see the Appendix “<em>Example prompt with memories</em>”. The memories are added under “This reminds you of events from your past”.</li>
  <li>A new prompt is constructed, with the same instructions from the initial prompt (2), the relevant memories from (9), and an instruction at the end to “GENERATE NEXT COMMAND JSON” (see the Appendix “<em>Example prompt with memories</em>”). This new prompt is used to call ChatGPT, and steps (3) through (10) are repeated until the task is complete, i.e. ChatGPT issues the <code class="language-plaintext highlighter-rouge">task_complete</code> / shut down command.</li>
</ol>

<h2 id="commands">Commands</h2>
<p>One fascinating and very powerful aspect of agentic AI is its ability to issue and execute commands. In AutoGPT, the LLM system (ChatGPT) is made aware of the available commands and their functionality via the following text in the prompt:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Commands:
1. Google Search: "google", args: "input": "&lt;search&gt;"
2. Browse Website: "browse_website", args: "url": "&lt;url&gt;", "question": "&lt;what_you_want_to_find_on_website&gt;"
3. Start GPT Agent: "start_agent", args: "name": "&lt;name&gt;", "task": "&lt;short_task_desc&gt;", "prompt": "&lt;prompt&gt;"
4. Message GPT Agent: "message_agent", args: "key": "&lt;key&gt;", "message": "&lt;message&gt;"
5. List GPT Agents: "list_agents", args: 
6. Delete GPT Agent: "delete_agent", args: "key": "&lt;key&gt;"
7. Clone Repository: "clone_repository", args: "repository_url": "&lt;url&gt;", "clone_path": "&lt;directory&gt;"
8. Write to file: "write_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
9. Read file: "read_file", args: "file": "&lt;file&gt;"
10. Append to file: "append_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
11. Delete file: "delete_file", args: "file": "&lt;file&gt;"
12. Search Files: "search_files", args: "directory": "&lt;directory&gt;"
13. Evaluate Code: "evaluate_code", args: "code": "&lt;full_code_string&gt;"
14. Get Improved Code: "improve_code", args: "suggestions": "&lt;list_of_suggestions&gt;", "code": "&lt;full_code_string&gt;"
15. Write Tests: "write_tests", args: "code": "&lt;full_code_string&gt;", "focus": "&lt;list_of_focus_areas&gt;"
16. Execute Python File: "execute_python_file", args: "file": "&lt;file&gt;"
17. Generate Image: "generate_image", args: "prompt": "&lt;prompt&gt;"
18. Send Tweet: "send_tweet", args: "text": "&lt;text&gt;"
19. Convert Audio to text: "read_audio_from_file", args: "file": "&lt;file&gt;"
20. Do Nothing: "do_nothing", args: 
21. Task Complete (Shutdown): "task_complete", args: "reason": "&lt;reason&gt;"
</code></pre></div></div>
<p>Each command has a short description (e.g. “Google Search”, “Execute Python File”, etc.) so ChatGPT knows which command to select given the current context. Further, each command has its own executor in AutoGPT.</p>

<p>I find this a very powerful concept, since one can extend the suite of commands available, which opens up many possibilities. For example, if we had a command to add products to the shopping cart of an online retailer, we can specify an objective to (1) find tennis strings most suitable for a topspin baseline player, and (2) add that string to the user’s shopping cart. One can also extend commands to the physical world, such as smart home controls. Of course, it is very important to prioritize safety, as these LLM-based autonomous agents are still in their early days of development!</p>

<h2 id="appendix">Appendix</h2>
<h1 id="example-terminal-message-for-initial-user-input">Example terminal message for initial user input</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Welcome to Auto-GPT!  Enter the name of your AI and its role below. Entering nothing will load defaults.
Name your AI:  For example, 'Entrepreneur-GPT'
AI Name: Foo
Foo here!  I am at your service.
Describe your AI's role:  For example, 'an AI designed to autonomously develop and run businesses with the sole goal of increasing your net worth.'
Foo is: an AI that recommends tennis equipment for a specific player
Enter up to 5 goals for your AI:  For example: Increase net worth, Grow Twitter Account, Develop and manage multiple businesses autonomously'
Enter nothing to load defaults, enter nothing when finished.
Goal 1: Find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin
Goal 2: Write the tennis strings to output
Goal 3: Shut down when you are done
Goal 4: 
</code></pre></div></div>

<h1 id="how-chatcompletions-messages-are-printed">How ChatCompletions messages are printed</h1>
<p>AutoGPT uses OpenAI’s <a href="https://platform.openai.com/docs/guides/chat/introduction">ChatCompletion</a>, which expects a list of dicts that represents the chat history. For visual clarify, I have printed out the prompts that go into ChatGPT as a string. For example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"foo"</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"bar1"</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"assistant"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"bar2"</span><span class="p">}</span>
<span class="p">]</span>
</code></pre></div></div>
<p>will be printed as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>system: foo

user: bar1

assistant: bar2
</code></pre></div></div>

<h1 id="example-initial-prompt">Example initial prompt</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>system: You are Foo, an AI that recommends tennis equipment for a specific player
Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. Find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin
2. Write the tennis strings to output
3. Shut down when you are done


Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"

Commands:
1. Google Search: "google", args: "input": "&lt;search&gt;"
2. Browse Website: "browse_website", args: "url": "&lt;url&gt;", "question": "&lt;what_you_want_to_find_on_website&gt;"
3. Start GPT Agent: "start_agent", args: "name": "&lt;name&gt;", "task": "&lt;short_task_desc&gt;", "prompt": "&lt;prompt&gt;"
4. Message GPT Agent: "message_agent", args: "key": "&lt;key&gt;", "message": "&lt;message&gt;"
5. List GPT Agents: "list_agents", args: 
6. Delete GPT Agent: "delete_agent", args: "key": "&lt;key&gt;"
7. Clone Repository: "clone_repository", args: "repository_url": "&lt;url&gt;", "clone_path": "&lt;directory&gt;"
8. Write to file: "write_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
9. Read file: "read_file", args: "file": "&lt;file&gt;"
10. Append to file: "append_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
11. Delete file: "delete_file", args: "file": "&lt;file&gt;"
12. Search Files: "search_files", args: "directory": "&lt;directory&gt;"
13. Evaluate Code: "evaluate_code", args: "code": "&lt;full_code_string&gt;"
14. Get Improved Code: "improve_code", args: "suggestions": "&lt;list_of_suggestions&gt;", "code": "&lt;full_code_string&gt;"
15. Write Tests: "write_tests", args: "code": "&lt;full_code_string&gt;", "focus": "&lt;list_of_focus_areas&gt;"
16. Execute Python File: "execute_python_file", args: "file": "&lt;file&gt;"
17. Generate Image: "generate_image", args: "prompt": "&lt;prompt&gt;"
18. Send Tweet: "send_tweet", args: "text": "&lt;text&gt;"
19. Convert Audio to text: "read_audio_from_file", args: "file": "&lt;file&gt;"
20. Do Nothing: "do_nothing", args: 
21. Task Complete (Shutdown): "task_complete", args: "reason": "&lt;reason&gt;"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below 
Response Format: 
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
} 
Ensure the response can be parsed by Python json.loads
system: The current time and date is Sat Apr 22 01:43:22 2023
system: This reminds you of these events from your past:



user: Determine which next command to use, and respond using the format specified above:
</code></pre></div></div>

<h1 id="example-json-string-returned-by-chatgpt">Example json string returned by ChatGPT</h1>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"thoughts"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I need to find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin. I should start by doing some research on the topic."</span><span class="p">,</span><span class="w">
        </span><span class="nl">"reasoning"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I need to gather information on the characteristics of tennis strings that are suitable for a hard hitting baseline player who hits with a lot of topspin. This will help me narrow down my search and find the top 3 most suitable options."</span><span class="p">,</span><span class="w">
        </span><span class="nl">"plan"</span><span class="p">:</span><span class="w"> </span><span class="s2">"- Conduct a Google search on the topic</span><span class="se">\n</span><span class="s2">- Browse websites that specialize in tennis equipment</span><span class="se">\n</span><span class="s2">- Consult with a GPT agent if necessary"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"criticism"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I need to make sure that I am gathering information from reliable sources and that I am considering all relevant factors when making my recommendations."</span><span class="p">,</span><span class="w">
        </span><span class="nl">"speak"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I will conduct a Google search on the topic and browse websites that specialize in tennis equipment to find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"google"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"input"</span><span class="p">:</span><span class="w"> </span><span class="s2">"best tennis strings for hard hitting baseline player with topspin"</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h1 id="example-prompt-with-memories">Example prompt with memories</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>system: You are Foo, an AI that recommends tennis equipment for a specific player
Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. Find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin
2. Write the tennis strings to output
3. Shut down when you are done


Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"

Commands:
1. Google Search: "google", args: "input": "&lt;search&gt;"
2. Browse Website: "browse_website", args: "url": "&lt;url&gt;", "question": "&lt;what_you_want_to_find_on_website&gt;"
3. Start GPT Agent: "start_agent", args: "name": "&lt;name&gt;", "task": "&lt;short_task_desc&gt;", "prompt": "&lt;prompt&gt;"
4. Message GPT Agent: "message_agent", args: "key": "&lt;key&gt;", "message": "&lt;message&gt;"
5. List GPT Agents: "list_agents", args: 
6. Delete GPT Agent: "delete_agent", args: "key": "&lt;key&gt;"
7. Clone Repository: "clone_repository", args: "repository_url": "&lt;url&gt;", "clone_path": "&lt;directory&gt;"
8. Write to file: "write_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
9. Read file: "read_file", args: "file": "&lt;file&gt;"
10. Append to file: "append_to_file", args: "file": "&lt;file&gt;", "text": "&lt;text&gt;"
11. Delete file: "delete_file", args: "file": "&lt;file&gt;"
12. Search Files: "search_files", args: "directory": "&lt;directory&gt;"
13. Evaluate Code: "evaluate_code", args: "code": "&lt;full_code_string&gt;"
14. Get Improved Code: "improve_code", args: "suggestions": "&lt;list_of_suggestions&gt;", "code": "&lt;full_code_string&gt;"
15. Write Tests: "write_tests", args: "code": "&lt;full_code_string&gt;", "focus": "&lt;list_of_focus_areas&gt;"
16. Execute Python File: "execute_python_file", args: "file": "&lt;file&gt;"
17. Generate Image: "generate_image", args: "prompt": "&lt;prompt&gt;"
18. Send Tweet: "send_tweet", args: "text": "&lt;text&gt;"
19. Convert Audio to text: "read_audio_from_file", args: "file": "&lt;file&gt;"
20. Do Nothing: "do_nothing", args: 
21. Task Complete (Shutdown): "task_complete", args: "reason": "&lt;reason&gt;"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below 
Response Format: 
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
} 
Ensure the response can be parsed by Python json.loads
system: The current time and date is Sat Apr 22 13:47:07 2023
system: This reminds you of these events from your past:
['Assistant Reply: {\n    "thoughts": {\n        "text": "I need to find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin. I should start by doing some research on the topic.",\n        "reasoning": "I need to gather information on the characteristics of tennis strings that are suitable for a hard hitting baseline player who hits with a lot of topspin. This will help me narrow down my search and find the top 3 most suitable options.",\n        "plan": "- Conduct a Google search on the topic\\n- Browse websites that specialize in tennis equipment\\n- Consult with a GPT agent if necessary",\n        "criticism": "I need to make sure that I am gathering information from reliable sources and that I am considering all relevant factors when making my recommendations.",\n        "speak": "I will conduct a Google search on the topic and browse websites that specialize in tennis equipment to find the top 3 most suitable tennis strings for a hard hitting baseline player who hits with a lot of topspin."\n    },\n    "command": {\n        "name": "google",\n        "args": {\n            "input": "best tennis strings for hard hitting baseline player with topspin"\n        }\n    }\n} \nResult: Command google returned: b\'[\\n    {\\n        "title": "Best Tennis Strings in 2023 - For Spin, Power, Control - Athlete Path",\\n        "href": "https://www.athletepath.com/best-tennis-strings/",\\n        "body": "Wilson Champions Choice Duo Tennis String Babolat RPM Blast Black 17g Strings Solinco Hyper-G Heaven High Spin Poly String Head Rip Control Tennis String Wilson NXT String Tourna Big Hitter Black7 Luxilion ALU Power 125 Tennis Racquet String Set How to Choose Tennis Strings Types of Tennis Strings Important Features to Consider Conclusion"\\n    },\\n    {\\n        "title": "Best tennis strings of 2022 | TW gear guide - Tennis Warehouse",\\n        "href": "https://www.tennis-warehouse.com/learning_center/gear_guides/tennis_string/best_tennis_strings.html",\\n        "body": "Wilson Champion\\\'s Choice Hybrid 16 String 5.0 3 Reviews $ 41.95 Quantity: 1 Increment Add To Cart Wish list TW Reviews Price Icon Lowest Price Guarantee Arrow Up We will match or beat any posted overall price advertised in-store or online on in stock items. Shop Hybrids Best strings by playing feature (benefit)"\\n    },\\n    {\\n        "title": "11 Best Tennis Strings For Spin - A Complete Guide",\\n        "href": "https://tennispredict.com/11-best-tennis-strings-for-spin/",\\n        "body": "These are the 11 best tennis strings for spin. Babolat RPM Blast Luxilon ALU Power Spin Solinco Tour Bite 19 Technifiber Black Code 4S 16 Volkl Cyclone 16 Kirschbaum Xplosive Speed 16 Wilson Revolve Spin 16 Turna Poly Big Hitter Black 7 Gamma AMP Moto 16 Head Sonic Pro Edge 16 Yonex Poly Tour Spin"\\n    },\\n    {\\n        "title": "10+ Best Tennis Strings for 2023 | Playtested &amp; Reviewed",\\n        "href": "https://tenniscompanion.org/best-tennis-strings/",\\n        "body": "My pick for the best synthetic gut tennis string, which I cover in greater detail in this guide, is Prince Synthetic Gut. It\\\'s an excellent string with a long-standing positive reputation in the tennis community. Here are a few additional options to consider for beginners and children. Head Synthetic Gut PPS Gamma Synthetic Gut"\\n    },\\n    {\\n        "title": "Best Tennis Strings for Topspin",\\n        "href": "https://primotennis.com/best-tennis-strings-for-topspin/",\\n        "body": "Finding the sweet spot is key! The best string tension for topspin is around 50-60 pounds (23-27 kg). This provides enough power and control while still allowing the ball to bite into the string bed for maximum spin potential. If you find that your strings are breaking too frequently, you may want to increase the tension slightly."\\n    },\\n    {\\n        "title": "12 Best Tennis Strings With Buying Guide 2023 - Tennisscan",\\n        "href": "https://tennisscan.com/best-tennis-strings/",\\n        "body": "Wilson Champions Choice is a cross between Wilson Natural Gut and Luxilon ALU Power Rough, two of the most popular strings on tour. The player gets the best of both worlds by combining the optimum playability of a high-quality. Use the Luxilon ALU Power Rough for durability and control in the mains."\\n    },\\n    {\\n        "title": "Most Spin-Friendly Strings - Tennis Warehouse",\\n        "href": "https://www.tennis-warehouse.com/learning_center/gear_guides/tennis_string/best_spin_strings.html",\\n        "body": "It also provides a low-powered response, which allows big hitters to play with confidence. Read the full review. Weiss CANNON Ultra Cable 17 Yellow/1.23 String. $12.90. 3. Weiss CANNON Ultra Cable 17/1.23 String White. $12.90. 3. Weiss CANNON Ultra Cable 17 Ye/1.23 String Reel - 660\\\'."\\n    },\\n    {\\n        "title": "The Best Tennis Strings - The Tennis Bros",\\n        "href": "https://thetennisbros.com/reviews/strings/the-best-tennis-strings/",\\n        "body": "Natural gut is the most powerful, most elastic string on the planet and is loved by professional and amateur players alike. It\\\'s created from cow intestine that is twisted into a string format in a factory, coated and then sold in a packet. Its touch on the tennis court is heavenly and it feels like velvet on the arm."\\n    }\\n]\' \nHuman Feedback: GENERATE NEXT COMMAND JSON ']


user: GENERATE NEXT COMMAND JSON
assistant: {
    "thoughts": {
        "text": "Based on my research, I have found some tennis strings that are suitable for a hard hitting baseline player with topspin. I will now write these strings to output and shut down.",
        "reasoning": "I have completed my research and found the information I need to make my recommendations. Writing the strings to output will allow the user to see my recommendations and shut down will end this task.",
        "plan": "- Write the top 3 most suitable tennis strings to output\n- Shut down",
        "criticism": "I need to make sure that I am providing accurate and helpful information to the user. I should also consider providing additional information on why these strings are suitable for a hard hitting baseline player with topspin.",
        "speak": "Based on my research, I have found some tennis strings that are suitable for a hard hitting baseline player with topspin. I will now write these strings to output and shut down."
    },
    "command": {
        "name": "write_to_file",
        "args": {
            "file": "recommended_strings.txt",
            "text": "1. Babolat RPM Blast\n2. Solinco Tour Bite\n3. Luxilon ALU Power Spin"
        }
    }
}
system: Command write_to_file returned: File written to successfully.
user: GENERATE NEXT COMMAND JSON
</code></pre></div></div>]]></content><author><name></name></author><category term="AI" /><summary type="html"><![CDATA[Recently I’ve found myself completely hooked experimenting with AutoGPT, as have many others. Using AutoGPT as a black box, I started getting curious about how it works under the hood. Thankfully the code is open source, so I decided to take a look. The following are my notes on the architecture of AutoGPT. Hopefully this helps those who are curious about how AutoGPT works. Also, AutoGPT can serve as a reference design for those who are building their own agentic AI systems.]]></summary></entry></feed>