Poonai’s space

brute force attempt to understand RNN

2026-03-01T00:00:00+00:00

As of today, it’s not feasible to fully understand LLM. Therefore, scientists are hypothesizing that studying a toy model would help us to understand the big model. I came across such toy model released by ARC and I tried to understand their understanding.

As an Mechanistic Interpretability enthusiast, I were curious to study the model by myself and it’s an attempt to explain the understanding in my words and I also think it’s beneficial to have multiple explanation for same thing.

Problem Setup

ARC released multiple toy model, trained to do different algorithmic task. However, I chose a model named argmax2 which is trained to predict the second highest number from the input sequence. eg: 3 is the second largest number among the input sequence [3,4]. It is an RNN based model with different parameter size. But we chose the one with 2 hidden size and 2 input sequence to start with.

def forward(self, x, init_state=None):
    batch_size = x.shape[0]
    if init_state is None:
        # initial hidden state set to zero
        h = x.new_zeros(batch_size, self.hidden_size)
    else:
        h = init_state
    for t in range(x.shape[1]):
        xt = x[:, t : t + 1]
        # shape of i2h and h2h are: (2,1) and (2,2)
        h = th.nn.functional.relu(self.i2h(xt) + self.h2h(h))
    # shape of output: (2,2)
    return self.output(h)

The chosen argmax2 model returns two neuron and the neuron with the highest value is the position of second highest number among the inputs. The below example takes [3,4] as input and return [0.4,0.1] since 3 is the second highest number.

\[\text{Input} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \qquad \text{Output} = \begin{bmatrix} 0.4 \\ 0.1 \end{bmatrix}\]

We asked ourself a question; What makes the \(output_{00}> output_{10}\). It’s obvious that answering the question would answer how model works

Sheer Brute Force

To answer the question, we must know what’s going inside the model. RNN calculates hidden state for each time step. the hidden state is calculated based on the current input entry and the previous hidden state, with number of time step decided by the length of the sequence.

In our model, there are two entries in the input sequence so we calculate two hidden state to uncover the working of the model.

Hidden state 1:

\[h_1 = ReLU(W_{hh}.h_0 + W_{hi}.{x_0}); W_{hi} = \begin{bmatrix} hi_{00} \\ -hi_{01} \end{bmatrix}\]

Here \(h_0 = 0\), since there is no previous hidden state for the current step.

\[h_1 = ReLU(\begin{bmatrix} x_0.hi_{00} \\ -x_0.hi_{01} \end{bmatrix}); \text{ where } ReLU(x) = max(0,x)\] \[h_1 = \begin{cases} \begin{bmatrix} x_0.hi_{00} \\ 0 \end{bmatrix} \text{ ; } x_0> 0 \\ \begin{bmatrix} 0 \\ x_0.hi_{01} \end{bmatrix} \text{ ; } x_0< 0 \end{cases}\]

The first neuron is activated when \(x_0 >0\) and second neuron is activated when \(x_0<0\), while the other neuron of the each case are zero. \(h_1\) itself is not giving us any valuable information. However, we can infer that ReLu converting the function into piecewise function based on the signs of the input.

Hidden state 2:

We have directly jumped to result of \(h_2\) without showing the intermediate steps. Including those intermediate step would make this post unnecessarily long without adding any substance. The detailed steps are in my notes, please check that if that is of your interest. Here, two cases in \(h_1\) further branched into six cases when deriving \(h_2\). It is sufficient to discuss only the two cases from those six cases to explain the entire model. Since other cases also express similar behavior.

\[h_2 = \begin{cases} \begin{bmatrix} 0 \\ -1.48.x_1 + 2.09.x_0 \end{bmatrix} \text{ ; } x_0 > 0 \text{ ; } 0.7.x_1 < 2.x_1 < x_0 \\ \begin{bmatrix} 1.61.x_1 - 0.58..x_0 \\ -1.48.x_1 + 2.09.x_0 \end{bmatrix} \text{ ; } x_0 > 0 \text{ ; } 0.7.x_1 < x_0 < 2.x_1 \\ \text{ ... } \\ \text{ ... } \\ \text{ ... } \end{cases}\]

First case:

I plotted the first case to analyze it visually and It unfolded series of insights on itself:

first case of \(h_2\) carved a region in first quadrant and the entire fourth quadrant.
\(x_0\) is always greater than \(x_1\) for all the coordinates belongs to that region. try to pick a coordinate in the shaded region and see for yourself.
in \(h_2\), the second neuron is active, while the first neuron is inactive.

We can conclude that model is trying to learn region where it can tell, \(x_0\) is always greater/lesser than \(x_1\) But, this technique could only work when anyone of the neuron is in active state and the other neuron is in inactive state. Therefore, we’ll analyze the second case where both the neurons are active.

Second case:

The second case of the piecewise function spans the region in first quadrant, around the line \(x_0=x_1\).

It also worth to note that, \(x_0\) is always lesser than \(x_1\) in any of the coordinates above the line \(x_0=x_1\). Conversely, in the other side of the line \(x_0=x_1\), \(x_0\) is always greater than \(x_1\). So, let’s plug the coordinates from both side to gain insights about this case:

coordinate above the line \(x_0=x_1\):

\[\text{Input} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 2 \end{bmatrix} \qquad h_2 = \begin{bmatrix} 4.09 \\ 0.175 \end{bmatrix}\]

coordinate below the line \(x_0=x_1\):

\[\text{Input} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \end{bmatrix} \qquad h_2 = \begin{bmatrix} 1.93 \\ 6.01 \end{bmatrix}\]

coordinate around the line \(x_0=x_1\):

\[\text{Input} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 4.9 \\ 5.1 \end{bmatrix} \qquad h_2 = \begin{bmatrix} 5.3 \\ 2.6 \end{bmatrix}\]

If the output of the different coordinate sparked any insight in you, then you are in the right track. In then second case of \(h_2\), first neuron is always greater than second neuron when \(x_0>x_1\) and inversely when \(x_0

Conclusion

From the discussed cases of \(h_2\), we can say that model is dividing the space into the region, where it can say which of the entry is greater/smaller among the input sequence. I did this experiment to deeply understand the method which was described in the post by the ARC. As I invested more time on this, I was intrigued by several questions and each of the question itself a separate research direction. I’m glad that I did this experiment and also feeling thankful to the ARC team for releasing the toy model. Otherwise, I wouldn’t have had opportunity to work on this experiment. The current brute force method will not work for slightly bigger model. Next, I’m on to learn other mathematical tools like torus to study bigger models!!

I competed against Anthropic to know my self-worth

2026-01-31T00:00:00+00:00

The whole ego competition started when I received a text message saying, “Anyone here is trying this?” with a blog title: Designing AI-resistant technical tech evaluation. As I read the blog, I was excited to see how far AI agents have caught up with coding tasks, and it also created a subtle fear in me of not being valuable anymore. I bet some of you have already felt that way at some point. Anthropic’s open challenge to beat Claude seemed like an invitation to test my self-worth.

After the initial glance at the assignment, I was genuinely impressed by the way they designed it. It contains a toy accelerator and a kernel program that needed to be optimized. Interestingly, the toy accelerator comes with tracing and debug functionality. Such level of detail energized me to solve the problem.

Before we optimize the kernel, let us understand what the kernel does:

def reference_kernel(t: Tree, inp: Input):
    """
    Reference implementation of the kernel.

    A parallel tree traversal where at each node we set
    cur_inp_val = myhash(cur_inp_val ^ node_val)
    and then choose the left branch if cur_inp_val is even.
    If we reach the bottom of the tree we wrap around to the top.
    """
    for h in range(inp.rounds):
        for i in range(len(inp.indices)):
            idx = inp.indices[i]
            val = inp.values[i]
            val = myhash(val ^ t.values[idx])
            idx = 2 * idx + (1 if val % 2 == 0 else 2)
            idx = 0 if idx >= len(t.values) else idx
            inp.values[i] = val
            inp.indices[i] = idx

It is a good old binary tree traversal with a twist. Each node is hashed with a value. The evenness of the hash will determine the path of the traversal. If the hash is even-valued, the program traverse left; otherwise, it traverse right. The same traversal logic is replicated for a list of values. Please stare at the above python code to get a cleaner intuition. It is easier to understand code than paragraph.

for round in range(rounds):
    for i in range(batch_size):
        i_const = self.scratch_const(i)
        # idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        body.append(("debug", ("compare", tmp_idx, (round, i, "idx"))))
        # val = mem[inp_values_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("load", ("load", tmp_val, tmp_addr)))
        body.append(("debug", ("compare", tmp_val, (round, i, "val"))))
        .....
        .....
        # mem[inp_indices_p + i] = idx
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_idx)))
        # mem[inp_values_p + i] = val
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_val)))

The provided kernel is an exact reimplementation of the Python code without any vectorization. At this point, I was deluding myself into thinking that vectorization would solve the problem. But I discovered my disappointment only after vectorizing the program. It did not even cut half of the compute cycles.

for i in range(int(batch_size / VLEN)):
    i_const = self.scratch_const(i * VLEN)
    batch_slots(
        "alu",
        ("+", indices_index + i, self.scratch["inp_indices_p"], i_const),
    )
    batch_slots(
        "alu", ("+", value_index + i, self.scratch["inp_values_p"], i_const)
    )
flush_slots()

// Before
{'valu': [('+', 600, 16, 24)]}
{'valu': [('+', 640, 16, 64)]}

// After
{'valu': [('+', 600, 16, 24), ('+', 608, 16, 32),('+', 616, 16, 40), ('+', 624, 16, 48),
('+', 632, 16, 56), ('+', 640, 16, 64)]}

The next obvious optimization is to pack multiple independent vectorized operations into one VLIW bundle. VLIW operations are energy efficient and are also a key reason behind Google’s TPU adoption. Essentially, we can batch six operations into one cycle. There were multiple independent operations that can be batched together. One of them is indices and values offset calculation.

A tiny detour to say something else which is also useful here. I was quietly studying undergrad math for 8 months without any sort of connection with tech world. I was away from Github, Reddit; any sort of connection that would pull me back into the tech echo chamber. At that time, I noticed something that most of the mathematicians agree. Many of the problems are solved by rewriting the problem into a different form; a^2-1 can be written as (a+1)(a-1). That is your key to optimize further and you can even beat Claude in this game by understanding the structure of the tree.

Starting Line

The interviewing candidates at Anthropic starts at 18,532. Because most of the LLM models can figure vectorization at the first pass. Hence, they adjusted the baseline to reflect that.

Now, our compute cycle also would be around 19,000. This marks the official starting line of the game, whatever the cool things we did so far were just an warmup.

Revealing the solution would defeat the purpose of the assignment. However, the purpose has been already defeated. Some bad actors have released parts or the full solution. You can check that if you want. Best humans have gone beyond Claude’s performance. So, decide whether you want to be a mere spectator or the adreline rushed player in the game.

The released version baseline is 1,47,734 cycles. My kernel version took 3,465 cycles and the Claude’s best version is at 1,487 cycles, which is 1,978 cycles fewer than mine. That makes the claude version 2.3x faster; This revelation made the protective nature of my ego to think otherwise.

Before you all close the tab, I want you to watch this Youtube Shorts that is closely related to this blog context.

Building an Visual Language Model from scratch

2025-10-02T11:12:35+00:00

My initial aim was to build a document processing model. But the idea was far fetched for my skill at the time. So, I settled down for building a toy version of visual language model for better understanding of VLM. I’m documenting my intuition for the benefit of myself and others. My model will receive image as an input and return its caption as an output. Luckily, I found a dataset with image and it’s caption to train.

Instead of training all the component from scratch, I’m going to use pre existing LLM model and customize to our need. You could ask why did you mention scratch in the title and say the title is misleading. In general, this is how VLM are built. Correct me if I’m wrong me.

The off the shelf models are:

ViT

This model is an image embedding model. It returns embedding tensor representation for an image. That means, the vector distance between similar image is shorter. The vector distance between lion and cat will shorter than the vector distance between lion and car. Since, the lion and cat are animals.

GPT2

This model is a text generation model. Given a list of word, it’ll predict the next series of words.

Input	Output
`The sky is`	`The sky is blue`

Projection layer.

This is the key component of our VLM model. This layer transform the image embedding output into a GPT2 textual embedding space. In other words we are converting the images into an intermediate state for the GPT2 to produce meaningful output.

class Projection(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(in_features, in_features * 3),
            nn.GELU(),
            nn.Linear(in_features * 3, out_features),
        )

    def forward(self, input):
        return self.network(input)

The tensor output size of ViT mode is 512 and the tensor input size of gpt2 model is 768. This projection layer will convert the ViT output tensor size into gpt2 input tensor size. With training, the image tensor are transformed into a tensor that GPT2 understands.

self.projection = Projection(512,768)

Conclusion

I’ve put all the component together and trained it with the dataset. Here is the result:

Input Image:

Output Text:

a boy holding a fish in the woods

This project gave me an solid intuition of building a VLM model. In the upcoming days I’m aspiring to build a document processing VLM model that beats state-of-the-art(SOTA) benchmarks. You can see the entire code here

Context Pruning in conversational agents

2025-09-08T11:12:35+00:00

My Journey to Building Agentic Apps

My childhood dream of having a personal J.A.R.V.I.S has come true. The recent advancement in LLMs (Large Language Models) made me look at it as everyone. As an industry, we all are figuring out how to build agentic apps “the right way.”

The good news is there is no “the right way” yet, and the bad news is the same. Since there is no right way, we are seeing a huge influx of frameworks showing up every day, and I’m skeptical about the practical usage of such frameworks. So, my action plan is to use a minimal framework and borrow concepts from those frameworks to build AI apps.

Why BAML?

The minimal framework is BAML. I’ve been using BAML for the past couple of months and am delighted with the developer experience it offers.

It offers function-style LLM calling. All you have to do is define your prompt, input, and output format in BAML language. Then BAML generates type-safe functions that take typed input and return typed output. This is like any other function you use every day.

As you read the blog, you’ll understand how easy BAML is — or you can check their docs as well.

Context Engineering

Context engineering is an important skill in terms of agentic application building. It is essentially providing the right context to steer the LLM in the right direction. There are several ways to do context engineering.

However, I want to demonstrate context pruning in conversational agents. An agent that does a lot of tool calls will be ideal to demonstrate the technique. So, I built an agent that solves mathematical equations which require to stitch multiple tool results. The source code can be seen here

Example:
Solve the quadratic equation: x^2 + 5*x + 6 = 0 and find its derivative.

To answer this question:

it has to solve an equation with a tool call
it has to find a derivative with another tool call

Basically, the agent has to call two tools sequentially to answer the user query. The agent can be implemented by a single prompt. Here is the respective BAML code and prompt for it:

class AgentResponse {
    tools QuandraticSolver  | QuadraticDerivative | QuadraticEvaluator | MessageToUser
}

function MathChat(message_history: string[]) -> AgentResponse {
    client OpenRouter
    prompt #"
        You're a helpful Math AI assistant. You have tools to solve equations related to quadratic 
        equations. The user query could be simple, or complex that requires you to take multiple turns
        between you and tools to resolve the user query.

        Use message_to_user tools to reply or ask clarification questions to the user.

        
        
        
       
        
    "#
}

BAML will generate a type-safe Python function where you can pass message_history as input and AgentResponse as output.

Example:

from baml_client import b

# Example of using the BAML-generated MathChat function
message_history = [
    "{'role': 'user', 'msg': 'Solve the quadratic equation: x^2 + 5*x + 6 = 0'}"
]

# Call the BAML-generated function
agent_response = b.MathChat(message_history)

# The agent_response will contain a tool call, which could be:
# - QuadraticSolver: To solve the equation
# - QuadraticDerivative: To find the derivative
# - QuadraticEvaluator: To evaluate the equation at a specific value
# - MessageToUser: To respond directly to the user

The AI response could be a tool call or a response to the user. The AI responds based on the feedback obtained from the external system. All the interactions are stored in message_history and used as context to decide the flow of the conversation.

def chat(self, user_message: str) -> str:
    # Add user message to history
    self.message_history.append({
        'role': 'user',
        'msg': user_message
    })
    
    while True:
        # Get response from BAML MathChat
        agent_response = b.MathChat([str(message) for message in self.message_history])
        
        # Use the tool and get the response
        tool_response = self.use_tool(agent_response)
               
        # If it's a direct message to the user, return it
        if tool_response['role'] == 'assistant':
            return tool_response['msg']

Example Message History

User: Solve the quadratic equation: x² + 5x + 6 = 0  
Assistant (PROCESSING): I need to solve this quadratic equation. [Calling `quadratic_solver`]  
Tool: Solved equation: x² + 5x + 6 = 0. Roots: [-3, -2]  
Assistant (COMPLETED): The roots of the quadratic equation x² + 5x + 6 = 0 are x = -3 and x = -2.  

User: Find the derivative of: 2x² + 4x − 6  
Assistant (PROCESSING): I need to find the derivative of this quadratic equation. [Calling `quadratic_derivative`]  
Tool: Derivative of equation: 2x² + 4x − 6. Result: 4x + 4  
Assistant (COMPLETED): The derivative of the quadratic equation 2x² + 4x − 6 is 4x + 4.  

Pruning the Context

The interaction between tool and AI is not always needed to answer future user questions. Since the AI response already encapsulates the context in its response, tool-related messages can be removed when irrelevant.

However, we can’t always prune immediately. The AI might still be clarifying missing information before solving. Intermediate tool-related messages should be removed only when not relevant anymore.

Clarifying Conversation Example

Sometimes the AI must clarify before solving. Example:

User: Integrate this equation  
Assistant (PROCESSING): Sure! Could you provide the equation you want me to integrate?  

User: x² + 3x  
Assistant (PROCESSING): I need to integrate this equation. [Calling `integral_solver`]  
Tool: Equation received: x² + 3x. Integration requires clarification: definite or indefinite?  
Assistant (PROCESSING): Do you want me to compute the definite integral (with limits) or the indefinite integral of x² + 3x?  

User: Indefinite integral  
Assistant (PROCESSING): I need to compute the indefinite integral of x² + 3x. [Calling `integral_solver`]  
Tool: Result of ∫(x² + 3x) dx = (x³)/3 + (3x²)/2 + C  
Assistant (COMPLETED): The indefinite integral of x² + 3x is (x³)/3 + (3x²)/2 + C. 

Task Status

When to prune context will be decided based on the current phase of the Agent. It could be in PROCESSING phase or COMPLETED phase. We’ll let the LLM itself tell us by adding the task_status flag to message_to_user tool calls.

enum TaskStatus {
    COMPLETED 
    PROCESSING
}

class MessageToUser {
    type "message_to_user" @description(#"
        DESCRIPTION: MessageToUser is used to respond to the user
    "#)

    response string @description(#"
        The assistant response to the user
    "#)

    task_status TaskStatus
}

Example pruning logic:

if self.context_trimming and tools.task_status == "COMPLETED":
                
    # Convert message history to string list for BAML function
    message_history_str = [str(message) for message in self.message_history]
                
    # Call SummarizeContext BAML function
    summarized_context = b.SummarizeContext(message_history_str)
                
    # Replace message history with summarized context
    self.message_history = summarized_context

Summarization

The pruning logic could be as simple as removing all tool calls, or we can use an LLM to summarize the context for us.

function SummarizeContext(message_history: string[]) -> string []{
    client OpenRouter
    prompt #"
         Your job is to summarize the user conversation to reduce the token length.
         Ideas to summarize:
         - remove intermediate tool calls and tool responses 
         - summarize the user question and response in concise form

         
           {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}  
           {'role': 'assistant', 'msg': 'I need to solve this quadratic equation', 'tool_name': 'quadratic_solver', 'tool_args': {'equation': 'x^2 - 5*x + 6 = 0'}}  
           {'role': 'tool', 'msg': 'Solved equation: x^2 - 5*x + 6 = 0. Roots: [2, 3]', 'tool_name': 'quadratic_solver', 'metadata': {'equation': 'x^2 - 5*x + 6 = 0', 'result': '[2, 3]'}}  
           {'role': 'assistant', 'msg': 'I need to find the derivative of this quadratic equation', 'tool_name': 'quadratic_derivative', 'tool_args': {'equation': 'x^2 - 5*x + 6 = 0'}}  
           {'role': 'tool', 'msg': 'Derivative of equation: x^2 - 5*x + 6 = 0. Result: 2*x - 5', 'tool_name': 'quadratic_derivative', 'metadata': {'equation': 'x^2 - 5*x + 6 = 0', 'result': '2*x - 5'}}
           {'role': 'assistant', 'msg': 'The roots of the equation x^2 - 5*x + 6 = 0 are 2 and 3. The derivative of the equation is 2*x - 5. If you want me to evaluate the equation or the derivative at a specific x value, please let me know.', 'tool_name': 'message_to_user'}    
         

        
           {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}  
           {'role': 'assistant', 'msg': 'The roots of the equation x^2 - 5*x + 6 = 0 are 2 and 3. The derivative of the equation is 2*x - 5.', 'tool_name': 'message_to_user'}    
        

        
          "User asked to solve x^2 - 5*x + 6 = 0, find its derivative, and the squares of the roots; assistant provided roots (2, 3), derivative (2*x - 5), and squares (4, 9)."
        

        
        {'role': 'user', 'msg': 'solve x^2 - 5*x + 6 =0 and find derivative'}
        {'role': 'assistant', 'msg': 'roots (2, 3), derivative (2*x - 5)', 'tool_name': 'message_to_user'} 
        
        
        
        
        
        
    "#
}

Token Usage Comparison

Let’s measure the token usage with context pruning and without context pruning.

Agent Type	Input Tokens	Output Tokens
Smart Agent (with context pruning)	2333	309
Agent (without context pruning)	4883	368

The lower value is better. We reduced token usage by 47% while getting the same output by pruning unnecessary context.

Lesser token has the following advatages:

less hallucination
less cost
less response time

Simplest backpropagation explainer without chain rule

2025-04-27T00:00:00+00:00

Neural Networks learn to predict by backpropagation. This article aims to help you, build a solid intuition about the concept using a simple example. The ideas we learn here can be expanded for bigger nerual network. I assume that you already know how feed forward neural network works.

Before reading the article further, take a pen and paper. The calculation used in this article can be done in the head. But I still want you to do by hand.

“Mathematics is not a spectator sport.” — George Pólya

Calculus: The Art of Change :

Derivation is used throughout the backpropogation, so it’s crucial for us to revise calculus before reaching our desired goal. As title suggests, derivation is used to find how change in value of an variable affect the result. In the context of neural network, how change in weights affects the results of neural networks.

Let’s look at a simple equation:

[ y = x^3 ]

If we plug in (x = 2), we get:

[ y = 2^3 = 8 ]

Now, what happens if we slightly increase (x) by (0.01)? Instead of calculating everything again, we can use the derivative.

The derivative of (y) is:

[ \frac{dy}{dx} = 3x^2 ]

[ dy = 3x^2 \times dx ]

Substituting (x = 2) and (dx = 0.01):

[ dy = 3(2)^2 \times 0.01 = 12 \times 0.01 = 0.12 ]

So, if (x) increases by (0.01), (y) should increase by about (0.12) which is (8.12).

Let’s check it:

At (x = 2), (y = 8).
At (x = 2.01), plugging into the original equation:

[ y = (2.01)^3 = 8.120601 ]

The actual change is about (8.1206), which is very close to our estimate of (8.12).

Note: The derivative is a good approximation function for small changes, does not work well with bigger number. Curious?? plug in (dx = 0.5) and see yourself.

No hidden layer

It almost took two days for me to understand backpropagation clearly. The idea finally clicked when I removed the hidden layer and made it a simple one-to-one network. We’ll take the same route to build up intuition, and later we can stack hidden layers to play with multiple weights.

For this simple network, we’ll consider the following parameters:

Input ( x = 2 )
Weight ( w = 4 )
Target output ( y = 10 )

The prediction formula is:

[ \hat{y} = x \times w ] Substituting the values:

[ \hat{y} = 2 \times 4 = 8 ] Let’s define a cost function to determine the error rate:

[ \text{Cost} = \hat{y} - y = (x \times w) - 10 ] [ \text{Cost} = (2 \times 4) - 10 = 8 - 10 = -2 ] When the cost approaches zero, the predicted output correlates closely with the target output. But in our case, we are off by 2 units.

How do we decrease the cost?

To reduce the cost, we need to tweak the weight parameter. However, randomly adjusting weights won’t help — it would be like searching for a needle in a haystack. Instead, we use the derivative to understand how the weight affects the cost.

[ \frac{dC}{dw} = x = 2 ]

The derivative tells us that any change in the weight will change the cost by twice that amount. In other words, if we increase the weight by 1 unit, the cost will change by 2 units.

Since our current cost is negative, it signals that the weight should be increased.(If the cost were positive, we would need to decrease the weight.) Thus, we increase the weight to ( w = 5 ) to move the cost toward zero.

With a Hidden Layer

Let’s add a hidden layer to the same simple network:

Input ( x = 2 )
Weight ( w_1 = 4 )
Weight ( w_2 = 3 )
Target output ( y_{\text{target}} = 10 )

The prediction is given by:

[ \hat{y} = (x \cdot w_1) \cdot w_2 ]

Substituting the values:

[ \hat{y} = (2 \times 4) \times 3 = 24 ]

The cost is the difference between the prediction and the target:

[ \text{Cost} = \hat{y} - y_{\text{target}} = (x \cdot w_1) \cdot w_2 - 10 = 24 - 10 = 14 ]

Now, let’s compute the derivatives:

[ \frac{dC}{dw_1} = x \cdot w_2 = 2 \times 3 = 6 ] [ \frac{dC}{dw_2} = x \cdot w_1 = 2 \times 4 = 8 ]

The derivatives tell us that the (w_2) influences the network more than the (w_1).

Now, I want you to pause reading and try this quick exercise:

Increase ( w_1 ) by 0.1 and observe how much ( \hat{y} ) changes.
Increase ( w_2 ) by 0.1 and observe how much ( \hat{y} ) changes.
Verify that changing ( w_2 ) causes a bigger change in the output than changing ( w_1 ).

How Do Computers Adjust Weights?

In our first simple network, we manually found the correct weight using our intelligence.
However, computers work much more rudimentary — they adjust the weights using the corresponding derivatives.

The idea is simple:

Weights with higher influence (higher derivative) are adjusted more.
Weights with lower influence are adjusted less.

But here’s the catch:
If the derivative values are large, the weights can change abruptly — causing the cost to fluctuate wildly.
This phenomenon is known as the exploding gradient problem.

To prevent this, we multiply the derivative by a small number called the learning rate (e.g., ( 0.01 )) to ensure smoother learning:

[ w_1 = w_1 - \text{learning_rate} \times \frac{dC}{dw_1} ] [ w_2 = w_2 - \text{learning_rate} \times \frac{dC}{dw_2} ]

By training the model over a large number of samples, the weights are gradually smoothened toward their optimal values, leading to better predictions.

Last Words

I’ve intentionally avoided the chain rule to wrap the core idea in our head. There are a lot of examples out in the wild that use chain rule. Here, is one of my personal favorite

Part 1: DIY debugger in Golang

2021-09-02T00:00:00+00:00

The first thing I do when I create a project is to create the debugger launch config at the .vscode folder. Debuggers help me to avoid putting print statements and building the program again. I always wondered how a debugger can stop the program on the line number I want and be able to inspect variables. Debugger workings have always been dark magic for me. At last, I managed to learn dark art by reading several articles and groking the source code of delve.

In this post, I’ll talk about my learning while demystifying the dark art of debugger.

Problem statement

Let’s define the problem statement before coding. I have a sample golang program that prints random ints every second. The goal which I want to achieve is that our debugger program should print breakpoint hit before our sample program prints the random integer.

Here is the sample program which prints random ints at every second.

package main

import (
 "fmt"
 "math/rand"
 "time"
)

func main() {
    for {
        variableToTrace := rand.Int()
        fmt.Println(variableToTrace)
        time.Sleep(time.Second)
    }
}


Solution

Now that we know what we want to achieve. Let’s go step by step and solve the problem statement.

The first step is to pause the sample program before it prints the random int. That means we have to set the breakpoint at line number 11.

To set the breakpoint at line number 11, we must gather the address of instruction at line number 11.

Some of us know from high school that all high-level language is converted into assembly language at the end. So, how do we find the address of the instruction in the assembly language?

Luckily, compilers add debug information along with the optimized assembly instruction on the output binary. Debug information contains information related to the mapping of assembly code to high-level language. For Linux binaries, debug information is usually encoded in the DWARF format.

DWARF is a debugging file format used by many compilers and debuggers to support source level debugging. It addresses the requirements of a number of procedural languages, such as C, C++, and Fortran, and is designed to be extensible to other languages. DWARF is architecture independent and applicable to any processor or operating system. It is widely used on Unix, Linux and other operating systems, as well as in stand-alone environments. source: http://www.dwarfstd.org/

DWARF format can be parsed using objdump tool.

The below command will output all the addresses of the instruction and it’s mapping to the line number and file name.

objdump --dwarf=decodedline ./sample

objdump command will output similar to this:

File name                            Line number    Starting address    View    Stmt

/home/poonai/debugger-example/sample.go:
sample.go                                      9            0x498200               x
sample.go                                      9            0x498213               x
sample.go                                     10            0x498221               x
sample.go                                     11            0x498223               x
sample.go                                     11            0x498225        
sample.go                                     12            0x498233               x
sample.go                                     12            0x498236        
sample.go                                     13            0x4982be               x
sample.go                                     13            0x4982cb        
sample.go                                     11            0x4982cd               x
sample.go                                     12            0x4982d2        
sample.go                                      9            0x4982d9               x
sample.go                                      9            0x4982de        
sample.go                                      9            0x4982e0               x
sample.go                                      9            0x4982e5               x

The output clearly states that 0x498223 is the starting address of line number 11 for sample.go file.

The next step is to pause the program at the address 0x498223

Trick to pause the program execution

CPU will interrupt the program whenever it sees data int 3. So, we just have to rewrite the data at the address 0x498223 with the data []byte{0xcc} to pause the program.

In computing and operating systems, a trap, also known as an exception or a fault, is typically a type of synchronous interrupt caused by an exceptional condition (e.g., breakpoint, division by zero, invalid memory access). source: wikipedia

Does that mean we have to rewrite the binary at 0x498223? No, we can write it using ptrace.

Ptrace to rescue

ptrace is a system call found in Unix and several Unix-like operating systems. By using ptrace (the name is an abbreviation of “process trace”) one process can control another, enabling the controller to inspect and manipulate the internal state of its target. ptrace is used by debuggers and other code-analysis tools, mostly as aids to software development. source:wikipedia

ptrace is a syscall that allows us to rewrite the registers and write the data at the given address.

Now we know which address to pause and how to find the memory representing lines, and manipulate the memory of the sample program. So, let’s put all this knowledge into action.

exec a process by setting Ptrace flag to true, so that we can use ptrace on the execed process.

process := exec.Command("./sample")
process.SysProcAttr = &syscall.SysProcAttr{Ptrace: true, Setpgid: true,    
Foreground: false}
process.Stdout = os.Stdout
if err := process.Start(); err != nil {
    panic(err)
}

The breakpoint can be set at 0x498223 by replacing the original data with int 3 (0xCC). This can be done by PtracePokeData.

func setBreakpoint(pid int, addr uintptr) []byte {
    data := make([]byte, 1)
    if _, err := unix.PtracePeekData(pid, addr, data); err != nil {
        panic(err)
    }
    if _, err := unix.PtracePokeData(pid, addr, []byte{0xCC}); err != nil {
        panic(err)
    }
    return data
}

You must already be wondering why there is PtracePeekData, other than PtracePokeData. PtracePeekData allows us to read the memory at the given address. I’ll explain later why I’m reading the data at the address 0x498223.

Since we set the breakpoint we’ll continue the program and wait for the interrupt to happen. This can be done by PtraceCont and Wait4

if err := unix.PtraceCont(pid, 0); err != nil {
     panic(err.Error())
 }
 // wait for the interupt to come.
 var status unix.WaitStatus
 if _, err := unix.Wait4(pid, &status, 0, nil); err != nil {
     panic(err.Error())
 }
 fmt.Println("breakpoint hit")

After the breakpoint hits, we need the program to continue as usual. Since we already modified the data at 0x498223 the program doesn’t run as usual. So we need to replace the int 3 with original data.

Remember, we captured the original data at 0x498223 using PtracePeekData while setting the breakpoint. Let’s just revert to the original data at 0x498223.

if _, err := unix.PtracePokeData(pid, addr, data); err != nil {
        panic(err.Error())
}

Just reverting to original data doesn’t run the program as usual. Because the instruction at 0x498223 is already executed when breakpoint hits. So, we want to tell the CPU to execute the instruction again at 0x498223.

CPU executes the instruction that the instruction pointer points to. If you have studied microprocessors at university, you might remember.

So, that means if we set the instruction pointer to 0x498223 then the CPU will execute the instruction at 0x498223 again.CPU registers can be manipulated usingPtraceGetRegs and PtraceSetRegs.

regs := &unix.PtraceRegs{}
if err := unix.PtraceGetRegs(pid, regs); err != nil {
   panic(err)
}
regs.Rip = uint64(addr)
if err := unix.PtraceSetRegs(pid, regs); err != nil {
      panic(err)
 }

Now that we modified the register, if we continue the program then it’ll execute the normal flow. But we want to hit the breakpoint again, so we’ll tell the ptrace to execute only the next instruction and set the breakpoint again. PtraceSingleStep allows us to execute only one instruction.

func resetBreakpoint(pid int, addr uintptr, originaldata []byte) {
   // revert back to original data
    if _, err := unix.PtracePokeData(pid, addr, originaldata); err != nil {
        panic(err.Error())
    }
    // set the instruction pointer to execute the instruction again
    regs := &unix.PtraceRegs{}
    if err := unix.PtraceGetRegs(pid, regs); err != nil {
        panic(err)
    }
    regs.Rip = uint64(addr)

    if err := unix.PtraceSetRegs(pid, regs); err != nil {
        panic(err)
    }
    if err := unix.PtraceSingleStep(pid); err != nil {
        panic(err)
    }
    // wait for it's execution and set the breakpoint again
    var status unix.WaitStatus
    if _, err := unix.Wait4(pid, &status, 0, nil); err != nil {
        panic(err.Error())
    }
    setBreakpoint(pid, addr)
}

So far we have learned how to manipulate registers and set breakpoints. Let’s put all these into a for loop and drive the program.

pid := process.Process.Pid
data := setBreakpoint(pid, 0x498223)
for {
    if err := unix.PtraceCont(pid, 0); err != nil {
        panic(err.Error())
    }
    // wait for the interrupt to come.
    var status unix.WaitStatus
    if _, err := unix.Wait4(pid, &status, 0, nil); err != nil {
        panic(err.Error())
    }
    fmt.Println("breakpoint hit")
    // reset the breakpoint
    resetBreakpoint(pid, 0x498223, data)
}

Phew, Finally we able to print breakpoint hit before our sample program prints random int.

breakpoint hit
6129484611666145821
breakpoint hit
4037200794235010051
breakpoint hit
3916589616287113937
breakpoint hit
6334824724549167320
breakpoint hit
605394647632969758
breakpoint hit
1443635317331776148
breakpoint hit
894385949183117216

You can find the full source code at https://github.com/poonai/debugger-example

That’s all for now. Hope you folks learned something new. In the next post, I’ll write how to extract values from the variables by reading DWARF info. You can follow me on Twitter to get notified about part 2.

Can differential privacy protect our privacy?

2021-03-29T00:00:00+00:00

I’m a mediocre engineer who does systems work and never had experience in the typical user-facing software space. I’ve contributed to software that scales but never really had a chance to experience the vibe of serving millions of users.

Recently, one of my friends explained to me the kind of events they track in their startup. I felt sick after hearing about the kind of events that they track, which are typically very personal to the user. Companies collect data ranging from the user’s geolocation to the user-installed app names(maybe to hike the price if the competitor’s app is not present on the user phone).

I decided to dig deeper and see whether any privacy-friendly tracking solution exists and, that took me to Apple’s paper. This paper explains how Apple leveraged count-min-sketch with some noise to deduce the inference of user behaviour, without hindering user privacy.

Usually, researchers have their own real-world assumptions which may not be applicable everywhere. I ran a small experiment to estimate the popular view on a project management app to validate the paper.

    // we are tracking what view users are using in their project management app.
    // 6k users are using list view.
    track(6000, "list")
    // 9k user using board view.
    track(9000, "board")
    // 2k user using calendar view.
    track(2000, "calendar")

The variance between the actual inference and the inference calculated via differential privacy is not that large.

    fmt.Println("estimate for list", server.Estimate([]byte("list")))
    fmt.Println("estimate for board", server.Estimate([]byte("board")))
    fmt.Println("estimate for calendar", server.Estimate([]byte("calendar")))
    // output
    // estimate for list 6572.1029024055715
    // estimate for board 9154.186791339975
    // estimate for calendar 1157.8019026490715

I presented it to my friend and asked him whether his company would consider using such tech.

Unfortunately, the inference of users’ behaviour alone is not enough, they would also want to send notifications based on certain events.

For example, if the user dropped off on a certain app screen without performing an action, the tracking system should be able to send a notification, which nudges the user to complete the action. Apparently, the tracking service tracks events at the user level.

Companies don’t just want analytics, they also want to target the user. which reminds me of the quote from the movie The Social Dilemma

We want to psychologically figure out how to manipulate you as fast as possible

Closing Thoughts:

Differential Privacy looks good on paper but differential privacy alone is not enough to cater to big companies’ needs. There are some use cases, where I think differential privacy can be deployed.

customers’ sensitive data can be tracked.
privacy-focused applications.

I would love to hear where else differential-privacy can be plugged in. My profile if anyone wants to reach out.