<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://poonai.xyz/feed.xml" rel="self" type="application/atom+xml" /><link href="https://poonai.xyz/" rel="alternate" type="text/html" /><updated>2026-03-02T02:00:53+00:00</updated><id>https://poonai.xyz/feed.xml</id><title type="html">Poonai’s space</title><subtitle>Place to express my learnings</subtitle><entry><title type="html">brute force attempt to understand RNN</title><link href="https://poonai.xyz/interpretability/2026/03/01/brute-force-attempt-to-understand-RNN.html" rel="alternate" type="text/html" title="brute force attempt to understand RNN" /><published>2026-03-01T00:00:00+00:00</published><updated>2026-03-01T00:00:00+00:00</updated><id>https://poonai.xyz/interpretability/2026/03/01/brute-force-attempt-to-understand-RNN</id><content type="html" xml:base="https://poonai.xyz/interpretability/2026/03/01/brute-force-attempt-to-understand-RNN.html"><![CDATA[<p>As of today, it’s not feasible to fully understand LLM. Therefore, scientists are hypothesizing that studying a toy model would help us to understand the big model. I came across such <a href="https://www.lesswrong.com/posts/x8BbjZqooS4LFXS8Z/algzoo-uninterpreted-models-with-fewer-than-1-500-parameters">toy model</a> released by ARC and I tried to understand their understanding.</p>

<p>As an Mechanistic Interpretability enthusiast, I were curious to study the model by myself and it’s an attempt to explain the understanding in my words and I also think it’s beneficial to have multiple explanation for same thing.</p>

<h2 id="problem-setup">Problem Setup</h2>
<p>ARC released multiple toy model, trained to do different algorithmic task. However, I chose a model named <code class="language-plaintext highlighter-rouge">argmax2</code> which is trained to predict the second highest number from the input sequence. eg: <code class="language-plaintext highlighter-rouge">3</code> is the second largest number among the input sequence <code class="language-plaintext highlighter-rouge">[3,4]</code>. It is an RNN based model with different parameter size. But we chose the one with 2 hidden size and 2 input sequence to start with.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">init_state</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="n">batch_size</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">if</span> <span class="n">init_state</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="c1"># initial hidden state set to zero
</span>        <span class="n">h</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="nf">new_zeros</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">hidden_size</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">init_state</span>
    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]):</span>
        <span class="n">xt</span> <span class="o">=</span> <span class="n">x</span><span class="p">[:,</span> <span class="n">t</span> <span class="p">:</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="c1"># shape of i2h and h2h are: (2,1) and (2,2)
</span>        <span class="n">h</span> <span class="o">=</span> <span class="n">th</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="nf">relu</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">i2h</span><span class="p">(</span><span class="n">xt</span><span class="p">)</span> <span class="o">+</span> <span class="n">self</span><span class="p">.</span><span class="nf">h2h</span><span class="p">(</span><span class="n">h</span><span class="p">))</span>
    <span class="c1"># shape of output: (2,2)
</span>    <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">output</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
</code></pre></div></div>

<p>The chosen <code class="language-plaintext highlighter-rouge">argmax2</code> model returns two neuron and the neuron with the highest value is the position of second highest number among the inputs. The below example takes <code class="language-plaintext highlighter-rouge">[3,4]</code> as input and return <code class="language-plaintext highlighter-rouge">[0.4,0.1]</code> since 3 is the second highest number.</p>

\[\text{Input} =
\begin{bmatrix}
x_0 \\
x_1
\end{bmatrix} = 
\begin{bmatrix}
3 \\
4
\end{bmatrix}
\qquad
\text{Output} =
\begin{bmatrix}
0.4 \\
0.1
\end{bmatrix}\]

<p>We asked ourself a question; What makes the \(output_{00}&gt; output_{10}\). It’s obvious that answering the question would answer <strong>how model works</strong></p>
<h2 id="sheer-brute-force">Sheer Brute Force</h2>

<p>To answer the question, we must know what’s going inside the model. RNN calculates hidden state for each time step. the hidden state is calculated based on the current input entry and the previous hidden state, with number of time step decided by the length of the sequence.</p>

<p><img src="/assets/images/rnn.png" alt="rnn" /></p>

<p>In our model, there are two entries in the input sequence so we calculate two hidden state to uncover the working of the model.</p>

<h4 id="hidden-state-1">Hidden state 1:</h4>

\[h_1 = ReLU(W_{hh}.h_0 + W_{hi}.{x_0}); W_{hi} = \begin{bmatrix} hi_{00} \\ -hi_{01} \end{bmatrix}\]

<p>Here \(h_0 = 0\), since there is no previous hidden state for the current step.</p>

\[h_1 = ReLU(\begin{bmatrix} 
x_0.hi_{00} \\
-x_0.hi_{01}
\end{bmatrix}); \text{ where } ReLU(x) = max(0,x)\]

\[h_1 = \begin{cases}
\begin{bmatrix} x_0.hi_{00} \\ 0 \end{bmatrix} \text{ ; } x_0&gt; 0 \\
\begin{bmatrix} 0 \\ x_0.hi_{01} \end{bmatrix} \text{ ; } x_0&lt; 0
\end{cases}\]

<p>The first neuron is activated when \(x_0 &gt;0\) and second neuron is activated when \(x_0&lt;0\), while the other neuron of the each case are zero. \(h_1\) itself is not giving us any valuable information. However, we can infer that <code class="language-plaintext highlighter-rouge">ReLu</code> converting the function into piecewise function based on the signs of the input.</p>
<h4 id="hidden-state-2">Hidden state 2:</h4>
<p>We have directly jumped to result of \(h_2\) without showing the intermediate steps. Including those intermediate step would make this post unnecessarily long without adding any substance. The detailed steps are in <a href="https://drive.google.com/drive/folders/13ELZoA2F1h_YAj41v1xN5IpkRlHiMrSt?usp=sharing">my notes</a>, please check that if that is of your interest. Here, two cases in \(h_1\) further branched into six cases when deriving \(h_2\). It is sufficient to discuss only the two cases from those six cases to explain the entire model. Since other cases also express similar behavior.</p>

\[h_2 = \begin{cases} 
\begin{bmatrix} 
0 \\
	-1.48.x_1 + 2.09.x_0 
\end{bmatrix} \text{ ; } x_0 &gt; 0 \text{ ; } 0.7.x_1 &lt; 2.x_1 &lt; x_0 \\
\begin{bmatrix} 
1.61.x_1 - 0.58..x_0 \\
	-1.48.x_1 + 2.09.x_0 
\end{bmatrix} \text{ ; } x_0 &gt; 0 \text{ ; } 0.7.x_1 &lt; x_0 &lt; 2.x_1 \\
\text{ ... } \\ 
\text{ ... } \\
\text{ ... }
\end{cases}\]

<h4 id="first-case">First case:</h4>
<p>I plotted the first case to analyze it visually and It unfolded series of insights on itself:</p>

<p style="text-align: center;"><img src="/assets/images/rnn_piecewise_case_1.png" alt="piecewise case 1" /></p>

<ul>
  <li>first case of \(h_2\) carved a region in first quadrant and the entire fourth quadrant.</li>
  <li>\(x_0\) is always greater than \(x_1\) for all the coordinates belongs to that region. try to pick a coordinate in the shaded region and see for yourself.</li>
  <li>in \(h_2\), the second neuron is active, while the first neuron is inactive.</li>
</ul>

<p>We can conclude that model is trying to learn region where it can tell, \(x_0\) is always greater/lesser than \(x_1\) But, this technique could only work when anyone of the neuron is in active state and the other neuron is in inactive state. Therefore, we’ll analyze the second case where both the neurons are active.</p>

<h4 id="second-case">Second case:</h4>
<p>The second case of the piecewise function spans the region in first quadrant, around the line \(x_0=x_1\).</p>

<p style="text-align: center;"><img src="/assets/images/rnn_piecewise_case_2.png" alt="piecewise case 2" /></p>

<p>It also worth to note that, \(x_0\) is always lesser than \(x_1\) in any of the coordinates above the line \(x_0=x_1\). Conversely, in the other side of the line \(x_0=x_1\),  \(x_0\) is always greater than \(x_1\). So, let’s plug the coordinates from both side to gain insights about this case:</p>

<ul>
  <li>coordinate above the line \(x_0=x_1\):</li>
</ul>

\[\text{Input} =
\begin{bmatrix}
x_0 \\
x_1
\end{bmatrix} = 
\begin{bmatrix}
1.5 \\
2
\end{bmatrix}
\qquad
h_2 =
\begin{bmatrix}
4.09 \\
0.175
\end{bmatrix}\]

<ul>
  <li>coordinate below the line \(x_0=x_1\):</li>
</ul>

\[\text{Input} =
\begin{bmatrix}
x_0 \\
x_1
\end{bmatrix} = 
\begin{bmatrix}
5 \\
3
\end{bmatrix}
\qquad
h_2 =
\begin{bmatrix}
1.93 \\
6.01
\end{bmatrix}\]

<p>If the output of the different coordinate sparked any insight in you, then you are in the right track. In then second case of \(h_2\), first neuron is always greater than second neuron when \(x_0&gt;x_1\) and inversely when \(x_0&lt;x_1\).</p>

<h2 id="conclusion">Conclusion</h2>
<p>From the discussed cases of \(h_2\), we can say that model is dividing the space into the region, where it can say which of the entry is greater/smaller among the input sequence. I did this experiment to deeply understand the method which was described in the post by the ARC. As I invested more time on this, I was intrigued by several questions and each of the question itself a separate research direction. I’m glad that I did this experiment and also feeling thankful to the ARC team for releasing the toy model. Otherwise, I wouldn’t have had opportunity to work on this experiment. The current brute force method will not work for slightly bigger model. Next, I’m on to learn other mathematical tools like torus to study bigger models!!</p>]]></content><author><name></name></author><category term="interpretability" /><summary type="html"><![CDATA[As of today, it’s not feasible to fully understand LLM. Therefore, scientists are hypothesizing that studying a toy model would help us to understand the big model. I came across such toy model released by ARC and I tried to understand their understanding.]]></summary></entry><entry><title type="html">I competed against Anthropic to know my self-worth</title><link href="https://poonai.xyz/ai/optimization/2026/01/31/i-competed-against-anthropic-to-know-my-self-worth.html" rel="alternate" type="text/html" title="I competed against Anthropic to know my self-worth" /><published>2026-01-31T00:00:00+00:00</published><updated>2026-01-31T00:00:00+00:00</updated><id>https://poonai.xyz/ai/optimization/2026/01/31/i-competed-against-anthropic-to-know-my-self-worth</id><content type="html" xml:base="https://poonai.xyz/ai/optimization/2026/01/31/i-competed-against-anthropic-to-know-my-self-worth.html"><![CDATA[<p>The whole ego competition started when I received a text message saying, <em>“Anyone here is trying this?”</em> with a blog title: <strong>Designing AI-resistant technical tech evaluation</strong>. As I read the blog, I was excited to see how far AI agents have caught up with coding tasks, and it also created a subtle fear in me of not being valuable anymore. I bet some of you have already felt that way at some point. Anthropic’s open challenge to beat Claude seemed like an invitation to test my self-worth.</p>

<p>After the initial glance at the assignment, I was genuinely impressed by the way they designed it. It contains a toy accelerator and a kernel program that needed to be optimized. Interestingly, the toy accelerator comes with tracing and debug functionality. Such level of detail energized me to solve the problem.</p>

<h3 id="before-we-optimize-the-kernel-let-us-understand-what-the-kernel-does">Before we optimize the kernel, let us understand what the kernel does:</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">reference_kernel</span><span class="p">(</span><span class="n">t</span><span class="p">:</span> <span class="n">Tree</span><span class="p">,</span> <span class="n">inp</span><span class="p">:</span> <span class="n">Input</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Reference implementation of the kernel.

    A parallel tree traversal where at each node we set
    cur_inp_val = myhash(cur_inp_val ^ node_val)
    and then choose the left branch if cur_inp_val is even.
    If we reach the bottom of the tree we wrap around to the top.
    </span><span class="sh">"""</span>
    <span class="k">for</span> <span class="n">h</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">inp</span><span class="p">.</span><span class="n">rounds</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">inp</span><span class="p">.</span><span class="n">indices</span><span class="p">)):</span>
            <span class="n">idx</span> <span class="o">=</span> <span class="n">inp</span><span class="p">.</span><span class="n">indices</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
            <span class="n">val</span> <span class="o">=</span> <span class="n">inp</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
            <span class="n">val</span> <span class="o">=</span> <span class="nf">myhash</span><span class="p">(</span><span class="n">val</span> <span class="o">^</span> <span class="n">t</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span>
            <span class="n">idx</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">idx</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="k">if</span> <span class="n">val</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="mi">2</span><span class="p">)</span>
            <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">idx</span> <span class="o">&gt;=</span> <span class="nf">len</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">values</span><span class="p">)</span> <span class="k">else</span> <span class="n">idx</span>
            <span class="n">inp</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span>
            <span class="n">inp</span><span class="p">.</span><span class="n">indices</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">idx</span>
</code></pre></div></div>
<p>It is a good old binary tree traversal with a twist. Each node is hashed with a value. The evenness of the hash will 
determine the path of the traversal. If the hash is even-valued, the program traverse left; otherwise, it traverse right.
The same traversal logic is replicated for a list of values. Please stare at the above python code to get a cleaner intuition. It is easier to understand code than paragraph.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="nb">round</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">rounds</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">batch_size</span><span class="p">):</span>
        <span class="n">i_const</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">scratch_const</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
        <span class="c1"># idx = mem[inp_indices_p + i]
</span>        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_indices_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">load</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">load</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_idx</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">debug</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">compare</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_idx</span><span class="p">,</span> <span class="p">(</span><span class="nb">round</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="sh">"</span><span class="s">idx</span><span class="sh">"</span><span class="p">))))</span>
        <span class="c1"># val = mem[inp_values_p + i]
</span>        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_values_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">load</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">load</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_val</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">debug</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">compare</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_val</span><span class="p">,</span> <span class="p">(</span><span class="nb">round</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="sh">"</span><span class="s">val</span><span class="sh">"</span><span class="p">))))</span>
        <span class="p">..</span><span class="bp">...</span>
        <span class="p">..</span><span class="bp">...</span>
        <span class="c1"># mem[inp_indices_p + i] = idx
</span>        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_indices_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">store</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">store</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">tmp_idx</span><span class="p">)))</span>
        <span class="c1"># mem[inp_values_p + i] = val
</span>        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_values_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">)))</span>
        <span class="n">body</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="sh">"</span><span class="s">store</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">store</span><span class="sh">"</span><span class="p">,</span> <span class="n">tmp_addr</span><span class="p">,</span> <span class="n">tmp_val</span><span class="p">)))</span>
</code></pre></div></div>
<p>The provided kernel is an exact reimplementation of the Python code without any vectorization. At this point, I was deluding myself into thinking that vectorization would solve the problem. But I discovered my disappointment only after vectorizing the program. It did not even cut half of the compute cycles.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="nf">int</span><span class="p">(</span><span class="n">batch_size</span> <span class="o">/</span> <span class="n">VLEN</span><span class="p">)):</span>
    <span class="n">i_const</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">scratch_const</span><span class="p">(</span><span class="n">i</span> <span class="o">*</span> <span class="n">VLEN</span><span class="p">)</span>
    <span class="nf">batch_slots</span><span class="p">(</span>
        <span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span>
        <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">indices_index</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_indices_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">),</span>
    <span class="p">)</span>
    <span class="nf">batch_slots</span><span class="p">(</span>
        <span class="sh">"</span><span class="s">alu</span><span class="sh">"</span><span class="p">,</span> <span class="p">(</span><span class="sh">"</span><span class="s">+</span><span class="sh">"</span><span class="p">,</span> <span class="n">value_index</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">scratch</span><span class="p">[</span><span class="sh">"</span><span class="s">inp_values_p</span><span class="sh">"</span><span class="p">],</span> <span class="n">i_const</span><span class="p">)</span>
    <span class="p">)</span>
<span class="nf">flush_slots</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Before
<span class="o">{</span><span class="s1">'valu'</span>: <span class="o">[(</span><span class="s1">'+'</span>, 600, 16, 24<span class="o">)]}</span>
<span class="o">{</span><span class="s1">'valu'</span>: <span class="o">[(</span><span class="s1">'+'</span>, 640, 16, 64<span class="o">)]}</span>

// After
<span class="o">{</span><span class="s1">'valu'</span>: <span class="o">[(</span><span class="s1">'+'</span>, 600, 16, 24<span class="o">)</span>, <span class="o">(</span><span class="s1">'+'</span>, 608, 16, 32<span class="o">)</span>,<span class="o">(</span><span class="s1">'+'</span>, 616, 16, 40<span class="o">)</span>, <span class="o">(</span><span class="s1">'+'</span>, 624, 16, 48<span class="o">)</span>,
<span class="o">(</span><span class="s1">'+'</span>, 632, 16, 56<span class="o">)</span>, <span class="o">(</span><span class="s1">'+'</span>, 640, 16, 64<span class="o">)]}</span>
</code></pre></div></div>

<p>The next obvious optimization is to pack multiple independent vectorized operations into one VLIW bundle. VLIW operations are energy efficient and are also a key reason behind Google’s TPU adoption. Essentially, we can batch six operations into one cycle. There were multiple independent operations that can be batched together. One of them is <code class="language-plaintext highlighter-rouge">indices</code> and <code class="language-plaintext highlighter-rouge">values</code> offset calculation.</p>

<p>A tiny detour to say something else which is also useful here. I was quietly studying
undergrad math for 8 months without any sort of connection with tech world. I was away from 
Github, Reddit; any sort of connection that would pull me back into the tech echo chamber. At
that time, I noticed something that most of the mathematicians agree. <strong>Many of the problems are
solved by rewriting the problem into a different form;</strong> <code class="language-plaintext highlighter-rouge">a^2-1</code> can be written as <code class="language-plaintext highlighter-rouge">(a+1)(a-1)</code>. 
That is your key to optimize further and you can even beat Claude in this game by
understanding the structure of the tree.</p>

<h3 id="starting-line">Starting Line</h3>
<p>The interviewing candidates at Anthropic starts at 18,532. Because most of the LLM models can figure vectorization at the first pass. Hence, they adjusted the baseline to reflect that.</p>

<p>Now, our compute cycle also would be around
19,000. This marks the official starting line of the game, whatever the cool things we did so 
far were just an warmup.</p>

<p>Revealing the solution would defeat the purpose 
of the assignment. However, the purpose has been already defeated. Some bad actors have released parts or the full solution. You can check that if you want. Best humans have gone beyond Claude’s performance. So, decide whether you want to be a mere spectator or the adreline rushed player in the game.</p>

<p>The released version baseline is 1,47,734 cycles. My kernel version took 3,465 cycles and the Claude’s best version is at 1,487 cycles, which is 1,978 cycles fewer than mine. That makes the claude version 2.3x faster; This revelation made the protective nature of my ego
to think otherwise.</p>

<p>Before you all close the tab, I want you to watch this <a href="https://www.youtube.com/shorts/1MOzJuDF1cc">Youtube Shorts</a> that is closely related to this blog context.</p>]]></content><author><name></name></author><category term="AI" /><category term="optimization" /><summary type="html"><![CDATA[The whole ego competition started when I received a text message saying, “Anyone here is trying this?” with a blog title: Designing AI-resistant technical tech evaluation. As I read the blog, I was excited to see how far AI agents have caught up with coding tasks, and it also created a subtle fear in me of not being valuable anymore. I bet some of you have already felt that way at some point. Anthropic’s open challenge to beat Claude seemed like an invitation to test my self-worth.]]></summary></entry><entry><title type="html">Building an Visual Language Model from scratch</title><link href="https://poonai.xyz/ai/image/llm/natural-language-processing/2025/10/02/building-visual-language-model.html" rel="alternate" type="text/html" title="Building an Visual Language Model from scratch" /><published>2025-10-02T11:12:35+00:00</published><updated>2025-10-02T11:12:35+00:00</updated><id>https://poonai.xyz/ai/image/llm/natural-language-processing/2025/10/02/building-visual-language-model</id><content type="html" xml:base="https://poonai.xyz/ai/image/llm/natural-language-processing/2025/10/02/building-visual-language-model.html"><![CDATA[<p>My initial aim was to build a document processing model. But the idea was far fetched for my skill at the time. So, I settled down for building a toy version of visual language model for better understanding of VLM. I’m documenting my intuition for the benefit of myself and others. My model will receive image as an input and return its caption as an output. Luckily, I found a <a href="https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched">dataset</a> with image and it’s caption to train.</p>

<p>Instead of training all the component from scratch, I’m going to use pre existing LLM model and customize to our need. You could ask why did you mention <code class="language-plaintext highlighter-rouge">scratch</code> in the title and say the title is 
misleading. In general, this is how VLM are built. Correct me if I’m wrong me.</p>

<p>The off the shelf models are:</p>
<ul>
  <li><a href="https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k">ViT - Vision Transformer Image Classification</a></li>
  <li><a href="https://huggingface.co/openai-community/gpt2">GPT2 - A Text Generation Model</a></li>
</ul>

<h2 id="vit">ViT</h2>
<p>This model is an image embedding model. It returns embedding tensor representation for an image. That means, the vector distance between similar image is shorter. The vector distance between lion and cat will shorter than the vector distance between lion and car. Since, the lion and cat are animals.</p>

<p><img src="/assets/images/image_embedding.png" alt="image embedding" /></p>

<h2 id="gpt2">GPT2</h2>
<p>This model is a text generation model. Given a list of word, it’ll predict the next series of words.</p>

<table>
  <thead>
    <tr>
      <th>Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">The sky is</code></td>
      <td><code class="language-plaintext highlighter-rouge">The sky is blue</code></td>
    </tr>
  </tbody>
</table>

<h2 id="projection-layer">Projection layer.</h2>

<p><img src="/assets/images/vlm_architecture.png" alt="vlm architecture" /></p>

<p>This is the key component of our VLM model. This layer transform the image embedding output into a GPT2 textual embedding space. In other words we are converting the images into an intermediate state for the GPT2 to produce meaningful output.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Projection</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">in_features</span><span class="p">,</span> <span class="n">out_features</span><span class="p">):</span>
        <span class="nf">super</span><span class="p">().</span><span class="nf">__init__</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">network</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="nc">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="nc">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="p">,</span> <span class="n">in_features</span> <span class="o">*</span> <span class="mi">3</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="nc">GELU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="nc">Linear</span><span class="p">(</span><span class="n">in_features</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="n">out_features</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">network</span><span class="p">(</span><span class="nb">input</span><span class="p">)</span>
</code></pre></div></div>

<p>The tensor output size of ViT mode is <code class="language-plaintext highlighter-rouge">512</code> and the tensor input size of gpt2 model is <code class="language-plaintext highlighter-rouge">768</code>. This projection layer will convert the ViT output tensor size into gpt2 input tensor size. With training, the image tensor are transformed into a tensor that GPT2 understands.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">self</span><span class="p">.</span><span class="n">projection</span> <span class="o">=</span> <span class="nc">Projection</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span><span class="mi">768</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>I’ve put all the component together and trained it with the dataset. Here is the result:</p>

<p>Input Image:</p>

<p><img src="/assets/images/boy_holding_fish.png" alt="A boy holding image" /></p>

<p>Output Text:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a boy holding a fish in the woods
</code></pre></div></div>

<p>This project gave me an solid intuition of building a VLM model. In the upcoming days I’m aspiring to build a document processing VLM model that beats state-of-the-art(SOTA) benchmarks. You can see
the entire code <a href="https://github.com/poonai/imagenet-caption">here</a></p>]]></content><author><name></name></author><category term="ai" /><category term="image" /><category term="llm" /><category term="natural-language-processing" /><summary type="html"><![CDATA[My initial aim was to build a document processing model. But the idea was far fetched for my skill at the time. So, I settled down for building a toy version of visual language model for better understanding of VLM. I’m documenting my intuition for the benefit of myself and others. My model will receive image as an input and return its caption as an output. Luckily, I found a dataset with image and it’s caption to train.]]></summary></entry><entry><title type="html">Context Pruning in conversational agents</title><link href="https://poonai.xyz/ai/context-engineering/llm/context-pruning/2025/09/08/context-pruning-in-conversational-agents.html" rel="alternate" type="text/html" title="Context Pruning in conversational agents" /><published>2025-09-08T11:12:35+00:00</published><updated>2025-09-08T11:12:35+00:00</updated><id>https://poonai.xyz/ai/context-engineering/llm/context-pruning/2025/09/08/context-pruning-in-conversational-agents</id><content type="html" xml:base="https://poonai.xyz/ai/context-engineering/llm/context-pruning/2025/09/08/context-pruning-in-conversational-agents.html"><![CDATA[<h1 id="my-journey-to-building-agentic-apps">My Journey to Building Agentic Apps</h1>

<p>My childhood dream of having a personal <strong>J.A.R.V.I.S</strong> has come true. The recent advancement in <strong>LLMs (Large Language Models)</strong> made me look at it as everyone. As an industry, we all are figuring out how to build agentic apps <em>“the right way.”</em></p>

<p>The good news is there is no <em>“the right way”</em> yet, and the bad news is the same. Since there is no right way, we are seeing a huge influx of frameworks showing up every day, and I’m skeptical about the practical usage of such frameworks. So, my action plan is to use a minimal framework and borrow concepts from those frameworks to build AI apps.</p>

<hr />

<h2 id="why-baml">Why BAML?</h2>

<p>The minimal framework is <strong>BAML</strong>. I’ve been using BAML for the past couple of months and am delighted with the developer experience it offers.</p>

<p>It offers <strong>function-style LLM calling</strong>. All you have to do is define your prompt, input, and output format in BAML language. Then BAML generates type-safe functions that take typed input and return typed output. This is like any other function you use every day.</p>

<p>As you read the blog, you’ll understand how easy BAML is — or you can check their <a href="https://docs.boundaryml.com/guide/introduction/what-is-baml">docs</a> as well.</p>

<hr />

<h2 id="context-engineering">Context Engineering</h2>

<p><strong>Context engineering</strong> is an important skill in terms of agentic application building. It is essentially providing the right context to steer the LLM in the right direction. There are several ways to do context engineering.</p>

<p>However, I want to demonstrate <strong>context pruning</strong> in conversational agents. An agent that does a lot of tool calls will be ideal to demonstrate the technique. So, I built an agent that solves mathematical equations which require to stitch multiple tool results. The source code can be seen <a href="https://github.com/poonai/context-trimming-sample">here</a></p>

<p><strong>Example:</strong><br />
Solve the quadratic equation: <code class="language-plaintext highlighter-rouge">x^2 + 5*x + 6 = 0</code> and find its derivative.</p>

<p>To answer this question:</p>
<ul>
  <li>it has to solve an equation with a tool call</li>
  <li>it has to find a derivative with another tool call</li>
</ul>

<p>Basically, the agent has to call two tools sequentially to answer the user query. The agent can be implemented by a single prompt. Here is the respective BAML code and prompt for it:</p>

<pre><code class="language-baml">class AgentResponse {
    tools QuandraticSolver  | QuadraticDerivative | QuadraticEvaluator | MessageToUser
}

function MathChat(message_history: string[]) -&gt; AgentResponse {
    client OpenRouter
    prompt #"
        You're a helpful Math AI assistant. You have tools to solve equations related to quadratic 
        equations. The user query could be simple, or complex that requires you to take multiple turns
        between you and tools to resolve the user query.

        Use message_to_user tools to reply or ask clarification questions to the user.

        &lt;CONVERSATION HISTORY&gt;
        
        &lt;/CONVERSATION HISTORY&gt;
       
        
    "#
}
</code></pre>

<p>BAML will generate a type-safe Python function where you can pass <code class="language-plaintext highlighter-rouge">message_history</code> as input and <code class="language-plaintext highlighter-rouge">AgentResponse</code> as output.</p>

<p><strong>Example:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">baml_client</span> <span class="kn">import</span> <span class="n">b</span>

<span class="c1"># Example of using the BAML-generated MathChat function
</span><span class="n">message_history</span> <span class="o">=</span> <span class="p">[</span>
    <span class="sh">"</span><span class="s">{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="s">: </span><span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="s">, </span><span class="sh">'</span><span class="s">msg</span><span class="sh">'</span><span class="s">: </span><span class="sh">'</span><span class="s">Solve the quadratic equation: x^2 + 5*x + 6 = 0</span><span class="sh">'</span><span class="s">}</span><span class="sh">"</span>
<span class="p">]</span>

<span class="c1"># Call the BAML-generated function
</span><span class="n">agent_response</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="nc">MathChat</span><span class="p">(</span><span class="n">message_history</span><span class="p">)</span>

<span class="c1"># The agent_response will contain a tool call, which could be:
# - QuadraticSolver: To solve the equation
# - QuadraticDerivative: To find the derivative
# - QuadraticEvaluator: To evaluate the equation at a specific value
# - MessageToUser: To respond directly to the user
</span></code></pre></div></div>

<p>The AI response could be a tool call or a response to the user. The AI responds based on the feedback obtained from the external system. All the interactions are stored in <code class="language-plaintext highlighter-rouge">message_history</code> and used as context to decide the flow of the conversation.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">chat</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user_message</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="c1"># Add user message to history
</span>    <span class="n">self</span><span class="p">.</span><span class="n">message_history</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
        <span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">msg</span><span class="sh">'</span><span class="p">:</span> <span class="n">user_message</span>
    <span class="p">})</span>
    
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="c1"># Get response from BAML MathChat
</span>        <span class="n">agent_response</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="nc">MathChat</span><span class="p">([</span><span class="nf">str</span><span class="p">(</span><span class="n">message</span><span class="p">)</span> <span class="k">for</span> <span class="n">message</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">message_history</span><span class="p">])</span>
        
        <span class="c1"># Use the tool and get the response
</span>        <span class="n">tool_response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">use_tool</span><span class="p">(</span><span class="n">agent_response</span><span class="p">)</span>
               
        <span class="c1"># If it's a direct message to the user, return it
</span>        <span class="k">if</span> <span class="n">tool_response</span><span class="p">[</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">assistant</span><span class="sh">'</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">tool_response</span><span class="p">[</span><span class="sh">'</span><span class="s">msg</span><span class="sh">'</span><span class="p">]</span>
</code></pre></div></div>

<hr />

<h2 id="example-message-history">Example Message History</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User: Solve the quadratic equation: x² + 5x + 6 = 0  
Assistant (PROCESSING): I need to solve this quadratic equation. [Calling `quadratic_solver`]  
Tool: Solved equation: x² + 5x + 6 = 0. Roots: [-3, -2]  
Assistant (COMPLETED): The roots of the quadratic equation x² + 5x + 6 = 0 are x = -3 and x = -2.  

User: Find the derivative of: 2x² + 4x − 6  
Assistant (PROCESSING): I need to find the derivative of this quadratic equation. [Calling `quadratic_derivative`]  
Tool: Derivative of equation: 2x² + 4x − 6. Result: 4x + 4  
Assistant (COMPLETED): The derivative of the quadratic equation 2x² + 4x − 6 is 4x + 4.  
</code></pre></div></div>
<hr />

<h2 id="pruning-the-context">Pruning the Context</h2>

<p>The interaction between tool and AI is not always needed to answer future user questions. Since the AI response already encapsulates the context in its response, tool-related messages can be removed when irrelevant.</p>

<p>However, we can’t always prune immediately. The AI might still be clarifying missing information before solving. Intermediate tool-related messages should be removed <strong>only when not relevant anymore</strong>.</p>

<h2 id="clarifying-conversation-example">Clarifying Conversation Example</h2>

<p>Sometimes the AI must clarify before solving. Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User: Integrate this equation  
Assistant (PROCESSING): Sure! Could you provide the equation you want me to integrate?  

User: x² + 3x  
Assistant (PROCESSING): I need to integrate this equation. [Calling `integral_solver`]  
Tool: Equation received: x² + 3x. Integration requires clarification: definite or indefinite?  
Assistant (PROCESSING): Do you want me to compute the definite integral (with limits) or the indefinite integral of x² + 3x?  

User: Indefinite integral  
Assistant (PROCESSING): I need to compute the indefinite integral of x² + 3x. [Calling `integral_solver`]  
Tool: Result of ∫(x² + 3x) dx = (x³)/3 + (3x²)/2 + C  
Assistant (COMPLETED): The indefinite integral of x² + 3x is (x³)/3 + (3x²)/2 + C. 
</code></pre></div></div>
<hr />

<h2 id="task-status">Task Status</h2>

<p>When to prune context will be decided based on the current phase of the Agent. It could be in <strong>PROCESSING</strong> phase or <strong>COMPLETED</strong> phase. We’ll let the LLM itself tell us by adding the <code class="language-plaintext highlighter-rouge">task_status</code> flag to <code class="language-plaintext highlighter-rouge">message_to_user</code> tool calls.</p>

<pre><code class="language-baml">enum TaskStatus {
    COMPLETED 
    PROCESSING
}

class MessageToUser {
    type "message_to_user" @description(#"
        DESCRIPTION: MessageToUser is used to respond to the user
    "#)

    response string @description(#"
        The assistant response to the user
    "#)

    task_status TaskStatus
}
</code></pre>

<p><strong>Example pruning logic:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">context_trimming</span> <span class="ow">and</span> <span class="n">tools</span><span class="p">.</span><span class="n">task_status</span> <span class="o">==</span> <span class="sh">"</span><span class="s">COMPLETED</span><span class="sh">"</span><span class="p">:</span>
                
    <span class="c1"># Convert message history to string list for BAML function
</span>    <span class="n">message_history_str</span> <span class="o">=</span> <span class="p">[</span><span class="nf">str</span><span class="p">(</span><span class="n">message</span><span class="p">)</span> <span class="k">for</span> <span class="n">message</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">message_history</span><span class="p">]</span>
                
    <span class="c1"># Call SummarizeContext BAML function
</span>    <span class="n">summarized_context</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="nc">SummarizeContext</span><span class="p">(</span><span class="n">message_history_str</span><span class="p">)</span>
                
    <span class="c1"># Replace message history with summarized context
</span>    <span class="n">self</span><span class="p">.</span><span class="n">message_history</span> <span class="o">=</span> <span class="n">summarized_context</span>
</code></pre></div></div>

<hr />

<h2 id="summarization">Summarization</h2>

<p>The pruning logic could be as simple as removing all tool calls, or we can use an LLM to <strong>summarize the context</strong> for us.</p>

<pre><code class="language-baml">function SummarizeContext(message_history: string[]) -&gt; string []{
    client OpenRouter
    prompt #"
         Your job is to summarize the user conversation to reduce the token length.
         Ideas to summarize:
         - remove intermediate tool calls and tool responses 
         - summarize the user question and response in concise form

         &lt;EXAMPLE INPUT&gt;
           {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}  
           {'role': 'assistant', 'msg': 'I need to solve this quadratic equation', 'tool_name': 'quadratic_solver', 'tool_args': {'equation': 'x^2 - 5*x + 6 = 0'}}  
           {'role': 'tool', 'msg': 'Solved equation: x^2 - 5*x + 6 = 0. Roots: [2, 3]', 'tool_name': 'quadratic_solver', 'metadata': {'equation': 'x^2 - 5*x + 6 = 0', 'result': '[2, 3]'}}  
           {'role': 'assistant', 'msg': 'I need to find the derivative of this quadratic equation', 'tool_name': 'quadratic_derivative', 'tool_args': {'equation': 'x^2 - 5*x + 6 = 0'}}  
           {'role': 'tool', 'msg': 'Derivative of equation: x^2 - 5*x + 6 = 0. Result: 2*x - 5', 'tool_name': 'quadratic_derivative', 'metadata': {'equation': 'x^2 - 5*x + 6 = 0', 'result': '2*x - 5'}}
           {'role': 'assistant', 'msg': 'The roots of the equation x^2 - 5*x + 6 = 0 are 2 and 3. The derivative of the equation is 2*x - 5. If you want me to evaluate the equation or the derivative at a specific x value, please let me know.', 'tool_name': 'message_to_user'}    
         &lt;/EXAMPLE INPUT&gt;

        &lt;EXAMPLE OUTPUT&gt;
           {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}  
           {'role': 'assistant', 'msg': 'The roots of the equation x^2 - 5*x + 6 = 0 are 2 and 3. The derivative of the equation is 2*x - 5.', 'tool_name': 'message_to_user'}    
        &lt;/EXAMPLE OUTPUT&gt;

        &lt;BAD OUTPUT&gt;
          "User asked to solve x^2 - 5*x + 6 = 0, find its derivative, and the squares of the roots; assistant provided roots (2, 3), derivative (2*x - 5), and squares (4, 9)."
        &lt;/BAD OUTPUT&gt;

        &lt;GOOD OUTPUT&gt;
        {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}
        {'role': 'assistant', 'msg': 'roots (2, 3), derivative (2*x - 5)', 'tool_name': 'message_to_user'} 
        &lt;/GOOD OUTPUT&gt;
        
        &lt;INPUT&gt;
        
        &lt;/INPUT&gt;
        
    "#
} 
</code></pre>

<hr />

<h2 id="token-usage-comparison">Token Usage Comparison</h2>

<p>Let’s <a href="https://github.com/poonai/context-trimming-sample/blob/main/benchmark.py">measure</a> the token usage with context pruning and without context pruning.</p>

<table>
  <thead>
    <tr>
      <th>Agent Type</th>
      <th>Input Tokens</th>
      <th>Output Tokens</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Smart Agent (with context pruning)</strong></td>
      <td>2333</td>
      <td>309</td>
    </tr>
    <tr>
      <td><strong>Agent (without context pruning)</strong></td>
      <td>4883</td>
      <td>368</td>
    </tr>
  </tbody>
</table>

<p>The lower value is better. We reduced token usage by <strong>47%</strong> while getting the same output by pruning unnecessary context.</p>

<p>Lesser token has the following advatages:</p>
<ul>
  <li>less hallucination</li>
  <li>less cost</li>
  <li>less response time</li>
</ul>]]></content><author><name></name></author><category term="ai" /><category term="context-engineering" /><category term="llm" /><category term="context-pruning" /><summary type="html"><![CDATA[My Journey to Building Agentic Apps]]></summary></entry><entry><title type="html">Simplest backpropagation explainer without chain rule</title><link href="https://poonai.xyz/2025/04/27/simplest-backpropagation-explainer-without-chain-rule.html" rel="alternate" type="text/html" title="Simplest backpropagation explainer without chain rule" /><published>2025-04-27T00:00:00+00:00</published><updated>2025-04-27T00:00:00+00:00</updated><id>https://poonai.xyz/2025/04/27/simplest-backpropagation-explainer-without-chain-rule</id><content type="html" xml:base="https://poonai.xyz/2025/04/27/simplest-backpropagation-explainer-without-chain-rule.html"><![CDATA[<p>Neural Networks learn to predict by backpropagation. This article aims to help you, build a solid intuition about the concept using a simple example. The ideas we learn here can be expanded for bigger nerual 
network. I assume that you already know how feed forward neural network works.</p>

<p>Before reading the article further, take a pen and paper. The calculation used in this article can be done in the head. But I still want you to do by hand.</p>

<blockquote>
  <blockquote>
    <p>“Mathematics is not a spectator sport.” — George Pólya</p>
  </blockquote>
</blockquote>

<h2 id="calculus-the-art-of-change-">Calculus: The Art of Change :</h2>

<p>Derivation is used throughout the backpropogation, so it’s crucial for us to revise calculus before reaching our desired goal.  As title suggests, derivation is used to find how change in value of an variable affect the result. In the context of neural network, how change in weights affects the results of neural networks.</p>

<p>Let’s look at a simple equation:</p>

<p>[
y = x^3
]</p>

<p>If we plug in (x = 2), we get:</p>

<p>[
y = 2^3 = 8
]</p>

<p>Now, what happens if we slightly increase (x) by (0.01)? Instead of calculating everything again, we can use the derivative.</p>

<p>The derivative of (y) is:</p>

<p>[
\frac{dy}{dx} = 3x^2
]</p>

<p>[
dy = 3x^2 \times dx
]</p>

<p>Substituting (x = 2) and (dx = 0.01):</p>

<p>[
dy = 3(2)^2 \times 0.01 = 12 \times 0.01 = 0.12
]</p>

<p>So, if (x) increases by (0.01), (y) should increase by about (0.12) which is (8.12).</p>

<p>Let’s check it:</p>

<ul>
  <li>At (x = 2), (y = 8).</li>
  <li>At (x = 2.01), plugging into the original equation:</li>
</ul>

<p>[
y = (2.01)^3 = 8.120601
]</p>

<p>The actual change is about (8.1206), which is very close to our estimate of (8.12).</p>

<p><strong>Note:</strong> The derivative is a good approximation function for small changes, does not work well with
bigger number. Curious?? plug in (dx = 0.5) and see yourself.</p>

<h2 id="no-hidden-layer">No hidden layer</h2>

<p>It almost took two days for me to understand backpropagation clearly.
The idea finally clicked when I removed the hidden layer and made it a simple one-to-one network.
We’ll take the same route to build up intuition, and later we can stack hidden layers to play with multiple weights.</p>

<p><img src="/assets/images/one-one.png" alt="one-one-network" /></p>

<p>For this simple network, we’ll consider the following parameters:</p>

<ul>
  <li>
    <p>Input ( x = 2 )</p>
  </li>
  <li>
    <p>Weight ( w = 4 )</p>
  </li>
  <li>
    <p>Target output ( y = 10 )</p>
  </li>
</ul>

<p>The prediction formula is:</p>

<p>[
\hat{y} = x \times w
]
Substituting the values:</p>

<p>[
\hat{y} = 2 \times 4 = 8
]
Let’s define a cost function to determine the error rate:</p>

<p>[
\text{Cost} = \hat{y} - y = (x \times w) - 10
]
[
\text{Cost} = (2 \times 4) - 10 = 8 - 10 = -2
]
When the cost approaches zero, the predicted output correlates closely with the target output.
But in our case, we are off by 2 units.</p>

<h2 id="how-do-we-decrease-the-cost">How do we decrease the cost?</h2>

<p>To reduce the cost, we need to tweak the weight parameter. However, randomly adjusting weights won’t help — it would be like searching for a needle in a haystack. Instead, we use the derivative to understand how the weight affects the cost.</p>

<p>[
\frac{dC}{dw} = x = 2
]</p>

<p>The derivative tells us that any change in the weight will change the cost by twice that amount. In other words, if we increase the weight by 1 unit, the cost will change by 2 units.</p>

<p>Since our current cost is negative, it signals that the weight should be increased.(If the cost were positive, we would need to decrease the weight.) Thus, we increase the weight to ( w = 5 ) to move the cost toward zero.</p>

<h2 id="with-a-hidden-layer">With a Hidden Layer</h2>

<p>Let’s add a hidden layer to the same simple network:</p>

<ul>
  <li>Input ( x = 2 )</li>
  <li>Weight ( w_1 = 4 )</li>
  <li>Weight ( w_2 = 3 )</li>
  <li>Target output ( y_{\text{target}} = 10 )</li>
</ul>

<p><img src="/assets/images/with_hidden.png" alt="hidden-network" /></p>

<p>The prediction is given by:</p>

<p>[
\hat{y} = (x \cdot w_1) \cdot w_2
]</p>

<p>Substituting the values:</p>

<p>[
\hat{y} = (2 \times 4) \times 3 = 24
]</p>

<p>The cost is the difference between the prediction and the target:</p>

<p>[
\text{Cost} = \hat{y} - y_{\text{target}} = (x \cdot w_1) \cdot w_2 - 10 = 24 - 10 = 14
]</p>

<p>Now, let’s compute the derivatives:</p>

<p>[
\frac{dC}{dw_1} = x \cdot w_2 = 2 \times 3 = 6
]
[
\frac{dC}{dw_2} = x \cdot w_1 = 2 \times 4 = 8
]</p>

<p>The derivatives tell us that the (w_2) influences the network more than the (w_1).</p>

<p>Now, I want you to <strong>pause reading</strong> and try this quick exercise:</p>
<ul>
  <li>Increase ( w_1 ) by 0.1 and observe how much ( \hat{y} ) changes.</li>
  <li>Increase ( w_2 ) by 0.1 and observe how much ( \hat{y} ) changes.</li>
  <li>Verify that changing ( w_2 ) causes a bigger change in the output than changing ( w_1 ).</li>
</ul>

<h2 id="how-do-computers-adjust-weights">How Do Computers Adjust Weights?</h2>

<p>In our first simple network, we manually found the correct weight using our intelligence.<br />
However, computers work much more rudimentary — they adjust the weights using the corresponding derivatives.</p>

<p>The idea is simple:</p>
<ul>
  <li><strong>Weights with higher influence</strong> (higher derivative) are adjusted more.</li>
  <li><strong>Weights with lower influence</strong> are adjusted less.</li>
</ul>

<p>But here’s the catch:<br />
If the derivative values are large, the weights can change abruptly — causing the cost to fluctuate wildly.<br />
This phenomenon is known as the <strong>exploding gradient problem</strong>.</p>

<p>To prevent this, we multiply the derivative by a small number called the <strong>learning rate</strong> (e.g., ( 0.01 )) to ensure smoother learning:</p>

<p>[
w_1 = w_1 - \text{learning_rate} \times \frac{dC}{dw_1}
]
[
w_2 = w_2 - \text{learning_rate} \times \frac{dC}{dw_2}
]</p>

<p>By training the model over a large number of samples, the weights are <strong>gradually smoothened</strong> toward their optimal values, leading to better predictions.</p>

<h2 id="last-words">Last Words</h2>

<p>I’ve intentionally avoided the chain rule to wrap the core idea in our head. There are a lot of examples out in the wild that use chain rule. Here, is one of my <a href="https://www.youtube.com/watch?v=sIX_9n-1UbM">personal favorite</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Neural Networks learn to predict by backpropagation. This article aims to help you, build a solid intuition about the concept using a simple example. The ideas we learn here can be expanded for bigger nerual network. I assume that you already know how feed forward neural network works.]]></summary></entry><entry><title type="html">Part 1: DIY debugger in Golang</title><link href="https://poonai.xyz/2021/09/02/part-1-diy-debugger-in-golang.html" rel="alternate" type="text/html" title="Part 1: DIY debugger in Golang" /><published>2021-09-02T00:00:00+00:00</published><updated>2021-09-02T00:00:00+00:00</updated><id>https://poonai.xyz/2021/09/02/part-1-diy-debugger-in-golang</id><content type="html" xml:base="https://poonai.xyz/2021/09/02/part-1-diy-debugger-in-golang.html"><![CDATA[<p>The first thing I do when I create a project is to create the debugger launch config at the <code class="language-plaintext highlighter-rouge">.vscode</code> folder. Debuggers help me to avoid putting print statements and building the program again. I always wondered how a debugger can stop the program on the line number I want and be able to inspect variables. Debugger workings have always been dark magic for me. At last, I managed to learn dark art by reading several articles and groking the source code of <a href="https://github.com/go-delve">delve</a>.</p>

<p>In this post, I’ll talk about my learning while demystifying the dark art of debugger.</p>

<h2 id="problem-statement">Problem statement</h2>
<p>Let’s define the problem statement before coding. I have a sample golang program that prints random ints every second. The goal which I want to achieve is that our debugger program should print <code class="language-plaintext highlighter-rouge">breakpoint hit</code> before our sample program prints the random integer.</p>

<p>Here is the sample program which prints random ints at every second.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">1.</span> <span class="k">package</span> <span class="n">main</span>
<span class="m">2.</span> 
<span class="m">3.</span> <span class="k">import</span> <span class="p">(</span>
<span class="m">4.</span>  <span class="s">"fmt"</span>
<span class="m">5.</span>  <span class="s">"math/rand"</span>
<span class="m">6.</span>  <span class="s">"time"</span>
<span class="m">7.</span> <span class="p">)</span>
<span class="m">8.</span> 
<span class="m">9.</span> <span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="m">10.</span>     <span class="k">for</span> <span class="p">{</span>
<span class="m">11.</span>         <span class="n">variableToTrace</span> <span class="o">:=</span> <span class="n">rand</span><span class="o">.</span><span class="n">Int</span><span class="p">()</span>
<span class="m">12.</span>         <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">variableToTrace</span><span class="p">)</span>
<span class="m">13.</span>         <span class="n">time</span><span class="o">.</span><span class="n">Sleep</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">)</span>
<span class="m">14.</span>     <span class="p">}</span>
<span class="m">15.</span> <span class="p">}</span>
<span class="m">16.</span> 
</code></pre></div></div>
<h2 id="solution">Solution</h2>
<p>Now that we know what we want to achieve. Let’s go step by step and solve the problem statement.</p>

<p>The first step is to pause the sample program before it prints the random int. That means we have to set the breakpoint at line number 11.</p>

<p>To set the breakpoint at line number 11, we must gather the address of instruction at line number 11.</p>

<p>Some of us know from high school that all high-level language is converted into assembly language at the end. So, how do we find the address of the instruction in the assembly language?</p>

<p><img src="/assets/images/cathow.jpg" alt="cathow" /></p>

<p>Luckily, compilers add debug information along with the optimized assembly instruction on the output binary. Debug information contains information related to the mapping of assembly code to high-level language.
For Linux binaries, debug information is usually encoded in the DWARF format.</p>

<blockquote>
  <p>DWARF is a debugging file format used by many compilers and debuggers to support source level debugging. It addresses the requirements of a number of procedural languages, such as C, C++, and Fortran, and is designed to be extensible to other languages. DWARF is architecture independent and applicable to any processor or operating system. It is widely used on Unix, Linux and other operating systems, as well as in stand-alone environments. source: http://www.dwarfstd.org/</p>
</blockquote>

<p>DWARF format can be parsed using objdump tool.</p>

<p>The below command will output all the addresses of the instruction and it’s mapping to the line number and file name.</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>objdump <span class="nt">--dwarf</span><span class="o">=</span>decodedline ./sample
</code></pre></div></div>

<p>objdump command will output similar to this:</p>
<pre><code class="language-txt">File name                            Line number    Starting address    View    Stmt

/home/poonai/debugger-example/sample.go:
sample.go                                      9            0x498200               x
sample.go                                      9            0x498213               x
sample.go                                     10            0x498221               x
sample.go                                     11            0x498223               x
sample.go                                     11            0x498225        
sample.go                                     12            0x498233               x
sample.go                                     12            0x498236        
sample.go                                     13            0x4982be               x
sample.go                                     13            0x4982cb        
sample.go                                     11            0x4982cd               x
sample.go                                     12            0x4982d2        
sample.go                                      9            0x4982d9               x
sample.go                                      9            0x4982de        
sample.go                                      9            0x4982e0               x
sample.go                                      9            0x4982e5               x
</code></pre>
<p>The output clearly states that <code class="language-plaintext highlighter-rouge">0x498223</code> is the starting address of line number 11 for sample.go file.</p>

<p>The next step is to pause the program at the address <code class="language-plaintext highlighter-rouge">0x498223</code></p>

<h2 id="trick-to-pause-the-program-execution">Trick to pause the program execution</h2>
<p>CPU will interrupt the program whenever it sees data int 3. So, we just have to rewrite the data at the address <code class="language-plaintext highlighter-rouge">0x498223</code> with the data []byte{0xcc} to pause the program.</p>
<blockquote>
  <p>In computing and operating systems, a trap, also known as an exception or a fault, is typically a type of synchronous interrupt caused by an exceptional condition (e.g., breakpoint, division by zero, invalid memory access). source: wikipedia</p>
</blockquote>

<p>Does that mean we have to rewrite the binary at <code class="language-plaintext highlighter-rouge">0x498223</code>? No, we can write it using ptrace.</p>

<h2 id="ptrace-to-rescue">Ptrace to rescue</h2>
<blockquote>
  <p>ptrace is a system call found in Unix and several Unix-like operating systems. By using ptrace (the name is an abbreviation of “process trace”) one process can control another, enabling the controller to inspect and manipulate the internal state of its target. ptrace is used by debuggers and other code-analysis tools, mostly as aids to software development. source:wikipedia</p>
</blockquote>

<p>ptrace is a syscall that allows us to rewrite the registers and write the data at the given address.</p>

<p>Now we know which address to pause and how to find the memory representing lines, and manipulate the memory of the sample program. So, let’s put all this knowledge into action.</p>

<p>exec a process by setting Ptrace flag to true, so that we can use ptrace on the execed process.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">Command</span><span class="p">(</span><span class="s">"./sample"</span><span class="p">)</span>
<span class="n">process</span><span class="o">.</span><span class="n">SysProcAttr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">syscall</span><span class="o">.</span><span class="n">SysProcAttr</span><span class="p">{</span><span class="n">Ptrace</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span> <span class="n">Setpgid</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>    
<span class="n">Foreground</span><span class="o">:</span> <span class="no">false</span><span class="p">}</span>
<span class="n">process</span><span class="o">.</span><span class="n">Stdout</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">Stdout</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">process</span><span class="o">.</span><span class="n">Start</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
    <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The breakpoint can be set at <code class="language-plaintext highlighter-rouge">0x498223</code> by replacing the original data with int 3 (0xCC). This can be done by <code class="language-plaintext highlighter-rouge">PtracePokeData</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">setBreakpoint</span><span class="p">(</span><span class="n">pid</span> <span class="kt">int</span><span class="p">,</span> <span class="n">addr</span> <span class="kt">uintptr</span><span class="p">)</span> <span class="p">[]</span><span class="kt">byte</span> <span class="p">{</span>
    <span class="n">data</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">byte</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtracePeekData</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtracePokeData</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">{</span><span class="m">0xCC</span><span class="p">});</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">data</span>
<span class="p">}</span>
</code></pre></div></div>
<p>You must already be wondering why there is <code class="language-plaintext highlighter-rouge">PtracePeekData</code>, other than <code class="language-plaintext highlighter-rouge">PtracePokeData</code>. <code class="language-plaintext highlighter-rouge">PtracePeekData</code> allows us to read the memory at the given address. I’ll explain later why I’m reading the data at the address <code class="language-plaintext highlighter-rouge">0x498223</code>.</p>

<p>Since we set the breakpoint we’ll continue the program and wait for the interrupt to happen. This can be done by <code class="language-plaintext highlighter-rouge">PtraceCont</code> and <code class="language-plaintext highlighter-rouge">Wait4</code></p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceCont</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
     <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
 <span class="p">}</span>
 <span class="c">// wait for the interupt to come.</span>
 <span class="k">var</span> <span class="n">status</span> <span class="n">unix</span><span class="o">.</span><span class="n">WaitStatus</span>
 <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">Wait4</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">status</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">nil</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
     <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
 <span class="p">}</span>
 <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"breakpoint hit"</span><span class="p">)</span>
</code></pre></div></div>
<p>After the breakpoint hits, we need the program to continue as usual. Since we already modified the data at <code class="language-plaintext highlighter-rouge">0x498223</code> the program doesn’t run as usual. So we need to replace the int 3 with original data.</p>

<p>Remember, we captured the original data at <code class="language-plaintext highlighter-rouge">0x498223</code> using <code class="language-plaintext highlighter-rouge">PtracePeekData</code> while setting the breakpoint. Let’s just revert to the original data at <code class="language-plaintext highlighter-rouge">0x498223</code>.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtracePokeData</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Just reverting to original data doesn’t run the program as usual. Because the instruction at <code class="language-plaintext highlighter-rouge">0x498223</code> is already executed when breakpoint hits. 
So, we want to tell the CPU to execute the instruction again at <code class="language-plaintext highlighter-rouge">0x498223</code>.</p>

<p><img src="/assets/images/registersintro.png" alt="registers" /></p>

<p>CPU executes the instruction that the instruction pointer points to. If you have studied microprocessors at university, you might remember.</p>

<p><img src="/assets/images/dejavu.jfif" alt="dejavu" />
So, that means if we set the instruction pointer to <code class="language-plaintext highlighter-rouge">0x498223</code> then the CPU will execute the instruction at <code class="language-plaintext highlighter-rouge">0x498223</code> again.CPU registers can be manipulated using<code class="language-plaintext highlighter-rouge">PtraceGetRegs</code> and <code class="language-plaintext highlighter-rouge">PtraceSetRegs</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">regs</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">unix</span><span class="o">.</span><span class="n">PtraceRegs</span><span class="p">{}</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceGetRegs</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">regs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
   <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">regs</span><span class="o">.</span><span class="n">Rip</span> <span class="o">=</span> <span class="kt">uint64</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceSetRegs</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">regs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
      <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
 <span class="p">}</span>
</code></pre></div></div>
<p>Now that we modified the register, if we continue the program then it’ll execute the normal flow. But we want to hit the breakpoint again, so we’ll tell the ptrace to execute only the next instruction and set the breakpoint again. <code class="language-plaintext highlighter-rouge">PtraceSingleStep</code> allows us to execute only one instruction.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">resetBreakpoint</span><span class="p">(</span><span class="n">pid</span> <span class="kt">int</span><span class="p">,</span> <span class="n">addr</span> <span class="kt">uintptr</span><span class="p">,</span> <span class="n">originaldata</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="p">{</span>
   <span class="c">// revert back to original data</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtracePokeData</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">originaldata</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
    <span class="p">}</span>
    <span class="c">// set the instruction pointer to execute the instruction again</span>
    <span class="n">regs</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">unix</span><span class="o">.</span><span class="n">PtraceRegs</span><span class="p">{}</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceGetRegs</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">regs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="n">regs</span><span class="o">.</span><span class="n">Rip</span> <span class="o">=</span> <span class="kt">uint64</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceSetRegs</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">regs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceSingleStep</span><span class="p">(</span><span class="n">pid</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="c">// wait for it's execution and set the breakpoint again</span>
    <span class="k">var</span> <span class="n">status</span> <span class="n">unix</span><span class="o">.</span><span class="n">WaitStatus</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">Wait4</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">status</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">nil</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
    <span class="p">}</span>
    <span class="n">setBreakpoint</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">addr</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>So far we have learned how to manipulate registers and set breakpoints. Let’s put all these into a for loop and drive the program.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pid</span> <span class="o">:=</span> <span class="n">process</span><span class="o">.</span><span class="n">Process</span><span class="o">.</span><span class="n">Pid</span>
<span class="n">data</span> <span class="o">:=</span> <span class="n">setBreakpoint</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="m">0x498223</span><span class="p">)</span>
<span class="k">for</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">PtraceCont</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
    <span class="p">}</span>
    <span class="c">// wait for the interrupt to come.</span>
    <span class="k">var</span> <span class="n">status</span> <span class="n">unix</span><span class="o">.</span><span class="n">WaitStatus</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unix</span><span class="o">.</span><span class="n">Wait4</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">status</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">nil</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
    <span class="p">}</span>
    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"breakpoint hit"</span><span class="p">)</span>
    <span class="c">// reset the breakpoint</span>
    <span class="n">resetBreakpoint</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="m">0x498223</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Phew, Finally we able to print <code class="language-plaintext highlighter-rouge">breakpoint hit</code> before our sample program prints random int.</p>

<pre><code class="language-txt">breakpoint hit
6129484611666145821
breakpoint hit
4037200794235010051
breakpoint hit
3916589616287113937
breakpoint hit
6334824724549167320
breakpoint hit
605394647632969758
breakpoint hit
1443635317331776148
breakpoint hit
894385949183117216
</code></pre>

<p>You can find the full source code at https://github.com/poonai/debugger-example</p>

<p>That’s all for now. Hope you folks learned something new. In the next post, I’ll write how to extract values from the variables by reading DWARF info. You can follow me on <a href="https://twitter.com/poonai_">Twitter</a> to get notified about part 2.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[The first thing I do when I create a project is to create the debugger launch config at the .vscode folder. Debuggers help me to avoid putting print statements and building the program again. I always wondered how a debugger can stop the program on the line number I want and be able to inspect variables. Debugger workings have always been dark magic for me. At last, I managed to learn dark art by reading several articles and groking the source code of delve.]]></summary></entry><entry><title type="html">Can differential privacy protect our privacy?</title><link href="https://poonai.xyz/2021/03/29/can-differential-privacy-protect-our-privacy.html" rel="alternate" type="text/html" title="Can differential privacy protect our privacy?" /><published>2021-03-29T00:00:00+00:00</published><updated>2021-03-29T00:00:00+00:00</updated><id>https://poonai.xyz/2021/03/29/can-differential-privacy-protect-our-privacy</id><content type="html" xml:base="https://poonai.xyz/2021/03/29/can-differential-privacy-protect-our-privacy.html"><![CDATA[<p>I’m a mediocre engineer who does systems work and never had experience in the typical user-facing software space. I’ve contributed to software that scales but never really had a chance to experience the vibe of serving millions of users.</p>

<p>Recently, one of my friends explained to me the kind of events they track in their startup. I felt sick after hearing about the kind of events that they track, which are typically very personal to the user. Companies collect data ranging from the user’s geolocation to the user-installed app names(maybe to hike the price if the competitor’s app is not present on the user phone).</p>

<p>I decided to dig deeper and see whether any privacy-friendly tracking solution exists and, that took me to <a href="https://docs-assets.developer.apple.com/ml-research/papers/learning-with-privacy-at-scale.pdf">Apple’s paper</a>. This paper explains how Apple leveraged count-min-sketch with some noise to deduce the inference of user behaviour, without hindering user privacy.</p>

<p>Usually, researchers have their own real-world assumptions which may not be applicable everywhere. I ran a small <a href="https://github.com/poonai/diffrential_privacy/blob/master/cms_test.go#L30">experiment</a> to estimate the popular view on a project management app to validate the paper.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c">// we are tracking what view users are using in their project management app.</span>
    <span class="c">// 6k users are using list view.</span>
    <span class="n">track</span><span class="p">(</span><span class="m">6000</span><span class="p">,</span> <span class="s">"list"</span><span class="p">)</span>
    <span class="c">// 9k user using board view.</span>
    <span class="n">track</span><span class="p">(</span><span class="m">9000</span><span class="p">,</span> <span class="s">"board"</span><span class="p">)</span>
    <span class="c">// 2k user using calendar view.</span>
    <span class="n">track</span><span class="p">(</span><span class="m">2000</span><span class="p">,</span> <span class="s">"calendar"</span><span class="p">)</span>
</code></pre></div></div>
<p>The variance between the actual inference and the inference calculated via differential privacy is not that large.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"estimate for list"</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">Estimate</span><span class="p">([]</span><span class="kt">byte</span><span class="p">(</span><span class="s">"list"</span><span class="p">)))</span>
    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"estimate for board"</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">Estimate</span><span class="p">([]</span><span class="kt">byte</span><span class="p">(</span><span class="s">"board"</span><span class="p">)))</span>
    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"estimate for calendar"</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">Estimate</span><span class="p">([]</span><span class="kt">byte</span><span class="p">(</span><span class="s">"calendar"</span><span class="p">)))</span>
    <span class="c">// output</span>
    <span class="c">// estimate for list 6572.1029024055715</span>
    <span class="c">// estimate for board 9154.186791339975</span>
    <span class="c">// estimate for calendar 1157.8019026490715</span>
</code></pre></div></div>

<p>I presented it to my friend and asked him whether his company would consider using such tech.</p>

<p>Unfortunately, the inference of users’ behaviour alone is not enough, they would also want to send notifications based on certain events.</p>

<p>For example, if the user dropped off on a certain app screen without performing an action, the tracking system should be able to send a notification, which nudges the user to complete the action. Apparently, the tracking service tracks events at the <a href="https://docs.moengage.com/docs/tracking-user-attributes#default-user-attributes">user level</a>.</p>

<p>Companies don’t just want analytics, they also want to target the user. which reminds me of the quote from the movie <code class="language-plaintext highlighter-rouge">The Social Dilemma</code></p>
<blockquote>
  <p>We want to psychologically figure out how to manipulate you as fast as possible</p>
</blockquote>

<h3 id="closing-thoughts">Closing Thoughts:</h3>
<p>Differential Privacy looks good on paper but differential privacy alone is not enough to cater to big companies’ needs. There are some use cases, where I think differential privacy can be deployed.</p>
<ul>
  <li>customers’ sensitive data can be tracked.</li>
  <li>privacy-focused applications.</li>
</ul>

<p>I would love to hear where else differential-privacy can be plugged in. <a href="https://twitter.com/poonai_">My profile</a> if anyone wants to reach out.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I’m a mediocre engineer who does systems work and never had experience in the typical user-facing software space. I’ve contributed to software that scales but never really had a chance to experience the vibe of serving millions of users.]]></summary></entry></feed>