YC reveals the secrets of Prompt Engineering, the top AI agent: no longer a "black box", but an evolvable "code" and "employee"

Written by

Silas Grey

Updated on:June-13th-2025

Prompt is regarded as the "mantra" of large language models. It has evolved into the core of interacting with AI and has become a key link in building efficient and reliable AI applications.

Recently, on YC’s Lightcone podcast, experts from the AI entrepreneurship and technology frontiers Garry, Harj, Diana, and Jared deeply analyzed the valuable experience they have accumulated in working with hundreds of LLM founders.

They discuss why Prompt is still important, where it tends to fail, and how top teams improve its reliability in production environments. They not only share real-world examples of Prompt failures, but also reveal how companies conduct quality testing and how top teams make LLM outputs useful and predictable.

ParaHelp in action: Six pages of prompts to help customer service agents better understand you

To understand the most advanced prompt engineering, let’s start with a specific example. AI customer service company ParaHelp provides customer support services to well-known AI companies such as Perplexity, Replika, and Bolt. Its AI agents are driven by carefully designed prompts. ParaHelp generously disclosed one of its core prompts, allowing us to take a peek.

The first impression of this prompt is that it is "long" and "detailed", with six pages in total. Its core design concepts include:

Role Setting : Tell the LLM clearly what role he or she plays, such as "You are a manager of customer service agents." And list his or her responsibilities in detail using bullet points.
Task Definition : Clearly describe the task to be completed, such as "approve or reject a tool call."
Step-by-Step Plan : Break down the task into specific steps, such as steps 1, 2, 3, 4, 5.
Behavioral constraints : clearly point out the key points that need to be paid attention to when performing tasks, such as not being able to call unauthorized tools at will.
Structured Output : Specifies the format of output to facilitate collaboration between different agents and API calls. ParaHelp's Prompt requires output in a specific format (such as acceptance or rejection) for subsequent processing.
Markdown style formatting : Use Markdown formatting (such as headings, subheadings, bullet points) to organize prompt content to make it easier to read and clearer.
Reasoning : The best prompts explain how to think and reason about the task.
XML tag format : Use XML-like tags in prompts to specify plans and steps. Research has found that since many LLMs have been exposed to XML-like inputs during the RLHF (Reinforcement Learning from Human Feedback) phase, this format makes LLMs easier to follow and produces better results. ParaHelp's planning prompt uses XML-like tags. , , , , etc. tags.
Conditional Logic : Pass Labels enable conditional judgments, allowing the agent to perform different steps depending on the situation. Interestingly, ParaHelp deliberately does not allow the model to use "else" blocks, but instead requires explicit "if" conditions to be defined for each path, which they found to improve performance in evaluation.
Variable references : Allows the model to use variable names (such as represents the result of a tool call, and {{policy_variable}} represents a variable in a specific policy), so that the model can plan processes across multiple tool calls without knowing the specific output values.

This type of prompt designed for vertical AI agents is usually regarded as the company's core intellectual property. He also pointed out that in practical applications, prompts are divided into different levels:

System Prompt : A high-level API that defines company operations, such as this general framework demonstrated by ParaHelp.
Developer Prompt : Customized for specific customers or scenarios, containing specific context information, for example, the way to handle RAG issues for Perplexity may be different from that for Bolt.
User Prompt : Input directly by the end user, for example, in Replika, the user inputs "help me generate a website with these buttons". ParaHelp's product form determines that it may not have a direct user prompt.

There are a lot of entrepreneurial opportunities in tool development around Prompt engineering, such as automatically extracting best practices from customer datasets and integrating them into Prompt workflows, eliminating manual work.

Metaprompting: Let prompts evolve themselves

An exciting trend is Metaprompting. Garry compares it to programming in 1995, where the tools are not perfect yet, but the potential is huge. The core idea of Metaprompting is to let Prompt dynamically generate a better version of itself .

Prompt Folding : A classifier prompt can dynamically generate a more specialized prompt based on the previous query.
Improve with failed cases : You can give feedback to LLM about the cases that caused the prompt to fail, and let it help improve the original prompt. This method is very effective because LLM "knows" itself better.
Play the role of an expert : A simple meta-prompt trick is to have LLM play the role of "Prompt Engineering Expert" and then enter your own prompt, and it will give detailed suggestions for improvement. Harj said that this process can be continuously iterated.
Big model optimization, small model execution : Diana mentioned that some companies will use more powerful models (such as Claude 3 Opus or GPT-4) for meta-prompts, optimize to get a high-quality prompt, and then use it in a smaller and faster model. This is especially important for voice AI agents that require low latency to pass the "Turing test".

The power of high-quality examples

In addition to meta-hints, providing high-quality examples is also the key to improving LLM output. For example, a company called Jazzberry trains the model to automatically discover bugs in the code by inputting a large number of examples of complex code defects (such as N+1 queries) that only expert programmers can solve into LLM.

This "teaching by example is worse than teaching how to fish" approach can help LLM understand and handle complex tasks that are difficult to describe accurately in words, similar to "test-driven development" (TDD) in programming.

How to avoid LLM's serious nonsense

Sometimes LLM will "talk nonsense seriously" to meet the output format requirements, that is, produce hallucinations. Therefore, LLM must be provided with an "escape hatch".

Clearly state "I don't know" : LLMs need to be told that if the information is insufficient to make a judgment, they should not make up an answer but stop and ask.
"Complaint" mechanism : An innovative approach within YC is to add a "debug info" parameter to the LLM response format, allowing LLM to "complain" to developers about vague or insufficient input information. This will form a to-do list to guide developers to improve Prompt.

Evals are king: where the real moat lies

Although Prompt itself is very important, evaluations (Evals) are the real "crown jewel" and data asset of these AI companies . ParaHelp is willing to make its Prompt public in part because they believe that without evaluation, it is impossible to understand why Prompt is designed in this way and it is difficult to improve it .

Garry agrees with this, and believes that the ability to obtain high-quality evaluation data is crucial for AI and SaaS companies in vertical fields. This requires a deep understanding of the real workflow of specific users. For example, "You have to sit next to the regional manager of tractor sales in Nebraska and understand what he cares about, what his incentive mechanism is, and how he handles invoices and warranty issues." Converting these first-hand observations into specific evaluation criteria is the real value and a powerful weapon for startups to fight against the question of "We are just a shell company." This is the "moat" of startups.

The founder is a "frontline deployment engineer" (FDE): In-depth understanding of user scenarios is the key to success

This extreme insight into user scenarios led to the concept of "Founder as a Forward Deployed Engineer (FDE)". Garry, who previously worked at Palantir, explained that the concept of FDE stems from Palantir sending engineers directly to the offices of customers (such as FBI agents), working side by side with them to understand their real needs and pain points, and quickly transforming these insights into usable software solutions.

Engineers face customers directly : Unlike traditional sales staff, Palantir dispatches engineers to communicate directly with customers, which enables them to understand problems more deeply and iterate products quickly.
Rapid prototyping and feedback : The core of the FDE model is "show, not tell." After meeting with customers, engineers can quickly build prototypes and get real feedback instead of spending weeks or even months on sales documents and contracts.
Accelerator in the AI era : Diana pointed out that the combination of the FDE model and AI enables AI companies in vertical fields to rise rapidly. The founding team can communicate directly with purchasing decision makers of large enterprises, obtain contextual information, quickly adjust prompts, and even show amazing demos the next day, thereby signing six-digit or even seven-digit orders. Giga ML mentioned by Harj and Happy Robot mentioned by Diana are both successful cases.
Multiple roles of the founder : In this model, the founder must be a technical expert, product manager, user researcher and designer.

Big Model "Personality" Differences: The Art of Prompt in Teaching Students in Accordance with Their Aptitude

An interesting observation is that different large models seem to have their own “personalities.” Diana mentioned that Claude is generally considered more “helpful” and easy to guide, while Llama may require more explicit instructions and is more like communicating with a developer, which may be related to the degree of training in its RLHF stage.

Harj shared their experience in using different models for investor scoring. They provided a rubric for LLM and required an output of a score of 0-100.

Claude : Behaving very "rigidly", strictly adhering to the evaluation standards, and imposing heavy penalties for not fully meeting the standards.
Gemini : shows greater "flexibility". It applies evaluation criteria but also understands and takes into account some exceptions, just like an employee with high autonomy who can think more deeply.

Garry points out that this difference is very useful for evaluating complex situations, such as determining whether an investor is worth accepting. Some investors may have impeccable processes, while others may be very capable but slow to respond due to their busy schedule. When LLM handles these nuances, its "debugging information" and final judgment will be very interesting.

Prompt New understanding of engineering: coding, management and "improvement"

Garry concluded that the current Prompt project, on the one hand, is like going back to the early days of programming in 1995, when the tools were imperfect, many things were not standardized, and it was full of unknown explorations; on the other hand, it is very much like learning how to manage an employee, which requires clear communication of goals, expectations, and evaluation criteria.

On a deeper level, this contains the philosophy of "continuous improvement" (Kaizen) - this concept originated from Japanese manufacturing and helped the Japanese automobile industry take off in the 1990s. It emphasizes that people who are in the process are the best candidates for improving the process. This coincides with the idea of "meta-prompt" to let Prompt iterate itself.

We are in an exciting new era. Prompt engineering is no longer a mysterious "black box operation", but has gradually evolved into a complex art that combines coding skills, management wisdom and continuous improvement concepts. In the future, we will undoubtedly witness more innovations and breakthroughs around Prompt.