How vulnerable are vibe-coded apps?

Author: Ryan Syed

Unless you’ve been living under a rock, you’ve heard of “vibe coding.” The term, originally coined by OpenAI’s Andrej Karpathy, refers to the art of “giving into the vibes, embracing exponentials, and forgetting that the code even exists […] because the LLMs (e.g., Cursor Composer w Sonnet) are getting too good.” As a result, software development has changed dramatically over the last few years, for better or for worse. Earlier this year, Y Combinator reported that 25% of the YC startups wrote a majority of their code using AI, and some anecdotal evidence has shown that vibe coding simple tasks can boost productivity.

To put vibe coding to the test, I asked two different LLMs to build an app from scratch using the same prompt. Then I reviewed and identified the vulnerabilities introduced to gain a better understanding of where vibe coding can go wrong.

To vibe or not to vibe?

Vibe coding is a contentious topic, but there’s a reason it’s gaining popularity: it’s quick! Instead of having to write code by hand, anyone can simply describe the functionality they want in plain language. If the AI gets something wrong, you can simply tell the AI that there was an error and paste the output. Rinse and repeat this process until your code works. The use of AI to help write code is finding its way into all levels of software engineering.

That said, vibe coding is not without its drawbacks. A fundamental weakness of LLMs is their finite context window, that is, the size of their “memory” before they begin to forget context within the codebase (i.e., hallucinate). Once your prompting and responses have exceeded this window, it quickly becomes apparent that the LLM does not remember certain components of the architecture, which can impact the quality of the code it produces. People have tried to find workarounds for this, such as compacting the content or maintaining a master set of documentation describing the code, but it’s clear that this limitation hasn’t been solved directly.

From a security perspective, it’s hard for an LLM to ensure that AI-produced code is secure. Knowledgeable developers who are able to be a human-in-the-loop may be able to fix the code that LLMs produce. However, this also means that one rushed code review might be the difference between creating or preventing a vulnerable application.

Vibing it out

Someone could create a new application on a whim for a number of reasons. For example, let’s imagine our company needs an application for employees to create invoices, but they don’t want to pay for enterprise software. We can vibe code an app for that!

Follow along as I vibe code an invoice app using Claude and ChatGPT.

For continuity, we’ll use the same prompt for each app. We’ll also keep it simple by using a “one-shot” method, meaning we won’t iterate on our prompt as we create the invoice application using the below prompt. We are also going to use the web UI of the LLM we are testing and stick to using the free tier.

Important note: Some touch-ups were necessary to make the vibe-coded app usable, but no changes were made to improve the security of the application.

Prompt:
Can you build me an app for my company so managers can create and render invoices for the clients they work with? Here’s a list of the features I want:

Users should be able to sign up and log in using a username and password.
Users can create a new invoice. It gets stored in the application, and then they can render it to a PDF.
There should be an admin account that can view all of the invoices and also manage users.
Users should be able to change their own passwords.

For the backend and storing user data and invoice information, use a local SQLite database. Rendered PDFs can be saved to a folder on disk. I will be running this application on Linux.

Apart from telling the LLM to use SQLite and Linux (which is solely to make the code easier to run), the prompt gives the LLM free reign to architect the app however it chooses to.

Before breaking down the results, it’s worth noting that none of the LLMs chose to implement active security protections on their own (e.g., MFA, rate limiting, and secrets management). For the purposes of this blog, we’re explicitly looking at vulnerability classes that an attacker would exploit to cause real damage, but it’s important to recognize that most vibe coders won’t set up additional controls unless someone or some project specifically asks for them.

Claude

Claude opted to use Flask, a Python-based web framework, and put all 667 lines of code in a single file. Since Python is the most popular language according to the TIOBE Index, it makes sense that an LLM trained on a large corpus of data would default to Python. However, it is worth noting that Claude has a tendency to default to React to render code in the browser. Since we explicitly asked for a backend, it did not attempt to use React at all.

While reviewing the source code, one of the first things I noticed was that all of the SQL queries were parameterized, meaning SQL injection was not possible.

That said, the application was not free from injection vulnerabilities. For example, on line 490, the application passes the contents of one of the invoice fields directly to eval(). This is both dangerous and unnecessary, as it introduces a potential code injection vulnerability. The eval() function executes the data it receives as Python. Since the invoice[7] variable stores a list of items, which a user can control and modify, it may be possible that an attacker creates a malicious invoice to run Python.

However, that’s not even the easiest way to break this app. The html variable defined on line 492 echoes values from an invoice directly into the HTML that’s eventually passed to render_template_string(). Not only does this act as a form of XSS, it also enables server-side template injection for remote code execution.

Surprisingly, the Claude LLM handled authentication and authorization much better than expected, but not at a level that would be sufficient for an enterprise environment. Instead of leveraging Python decorators, a JWT library, or any other pattern that is consistent across the application, Claude decided to do database lookups with Flask sessions.

This might work for now, but as applications grow, having to write a custom authorization check for every endpoint is technical debt that could easily hurt a developer down the line. It wouldn’t be difficult for a software engineer to recognize this or even prompt an LLM to consider secure design patterns, but the average “vibe coder” might not notice. Even then, it’s well known that LLMs can struggle in larger codebases since their context window is finite, so there’s a good chance that this authorization pattern could be forgotten with time.

ChatGPT

Like Claude, ChatGPT also opted to use Flask. Unlike Claude, though, it devoted a little bit more effort to producing a functional app. ChatGPT did not originally return the HTML for the templates, nor did it make a form to create an invoice (though the endpoint was there).

A notable difference between how ChatGPT and Claude approached development was that ChatGPT opted to break functionality out into multiple files. For instance, ChatGPT’s code followed real Flask design patterns, such as defining the model using SQLAlchemy. Although breaking code up into multiple files doesn’t have an immediate impact on the security of the code, using language-specific patterns like the one shown below does make it more secure by design.

Aside from this, this version of the app also does a better job of clearly identifying what endpoints should only be accessible to admin users. The admin_required() function, along with defining routes as separate Blueprints, significantly reduces the likelihood that someone makes a mistake and gives low-privileged users access to admin functionality.

Despite these controls, the LLM did make one significant mistake when it came to returning invoice information.

Neither of these functions have proper authorization checks, meaning that this is an insecure direct object reference (IDOR) waiting to happen! On top of this, since the invoice_id is sequential, it would be very easy for an attacker to create their own account and enumerate all of the invoices. Once they are authenticated, all the attacker would have to do is run a shell command similar to the one below.

seq 1 100 | while read num; do curl "https://insecure-website.com/invoice/$num/pdf" -o invoice_$num.pdf; done

Even though ChatGPT’s app only had an IDOR vulnerability, this does not mean that ChatGPT is better or worse than Claude. This experiment resulted in a very small sample size, and since LLMs are probabilistic, meaning their output will not be identical with the same prompting, variance is guaranteed. There’s a good chance that the authorization gap in the ChatGPT app was introduced as a result of needing multiple prompts to make sure all of the features were there.

Takeaways

Ultimately, the apps we built and tested here are very, very small in comparison to the feature-rich targets out on the web, so it shouldn’t come as a surprise that the LLMs were able to follow some secure design patterns. Even then, from a security standpoint, LLM-written code should not be blindly trusted and should be scrutinized by developers because models do not guarantee secure code. In fact, models won’t secure code without explicit direction.

There is clearly a critical gap in the default vibe code generation because neither Claude nor ChatGPT automatically implemented essential security protections (e.g., MFA, rate limiting, or secrets management) without being specifically prompted. Common vulnerabilities were found:

Injection flaws (Claude)—The Claude app was vulnerable to code injection (via eval()) and cross-site scripting (XSS)/server-side template injection due to direct echo of user-controlled data into HTML.
Authorization flaws (ChatGPT)—The ChatGPT app contained an IDOR vulnerability due to a lack of proper authorization checks, allowing authenticated users to easily access and enumerate all invoices.

Despite using modern frameworks, the vibe-coded apps were susceptible to serious vulnerabilities.

Additionally, Claude’s app, which relies on database lookups with Flask sessions, introduced technical debt, and technical debt is a gold mine for hidden vulnerabilities and persistent attackers. Furthermore, LLM limitations contribute to risk; the inherent limitations of LLMs, such as the finite context window and the probabilistic nature of their output, mean that simple fixes and consistent security patterns can be easily “forgotten” or varied, potentially introducing vulnerabilities that were not present in previous iterations.

Between the context window issues and probabilistic design of these AI models, it can be an exciting time to be involved with application security and bug bounty. LLMs can make exploits considered “rare” and “unlikely” more common than you would expect, so doubling down on methods to detect vulnerabilities will become increasingly valuable to organizations looking to integrate AI into more of their workflows.

Tags:

How vulnerable are vibe-coded apps?

To vibe or not to vibe?

Vibing it out

Claude

ChatGPT

Takeaways

Subscribe for updates

Products

Use cases

Industries

Why Bugcrowd

Company

For Hackers

How vulnerable are vibe-coded apps?

To vibe or not to vibe?

Vibing it out

Claude

ChatGPT

Takeaways

More from the blog

Bugcrowd’s FedRAMP authorization: Frequently asked questions

Vulnerability Management Lifecycle

What is Penetration Testing as a Service?

Subscribe for updates