This is exactly what I'm doing. Some papers I'm studying: TextGrad: Automatic "D...

sgt101 · 2025-06-04T19:02:00 1749063720

I was trying to pick n-shot examples from a data set. The idea was that given 1000s of examples for a prompt finding a combination of n that was optimal could be advantageous, but for n's that are large then bruteforcing the combincation would be impossible... so can we find an optimal set with an efficient search?

But the problem was that the search space wasn't informative. The best 1 example didn't feature in the best 2 examples. So I couldn't optimise for 5, 6,7 examples..

Xmd5a · 2025-06-04T20:54:11 1749070451

I guess this really depends on the problem but from the PromptWizard (PW) paper:

    | Approach | API calls | IO Tokens | Total tokens  | Cost ($) |
    |----------|-----------|-----------|---------------|----------|
    | Instinct | 1730      | 67        | 115910        | 0.23     |
    | InsZero  | 18600     | 80        | 1488000       | 2.9      |
    | PB       | 5000      | 80        | 400000        | 0.8      |
    | EvoP     | 69        | 362       | 24978         | 0.05     |
    | PW       | 69        | 362       | 24978         | 0.05     |

They ascribe this gain in efficiency to a balance between exploration and exploitation that involves a first phase of instructions mutation followed by a phase where both instruction and few-shot examples are optimized at the same time. They also rely on "textual gradients", namely criticism enhanced by CoT, as well as synthesizing examples and counter-examples.

What I gathered from reading those papers + some more is that textual feedback, i.e. using a LLM to reason about how to carry out a step of the optimization process is what allows to give structure to the search space.

sgt101 · 2025-06-04T21:15:50 1749071750

Super interesting.

I will have to read it - I will be looking to figure out if the tasks that they are working on significant/realistic? And are the improvements that they are finding robust?

Xmd5a · 2025-06-05T13:00:05 1749128405

The tasks these methods are tackling are generally significant and realistic. Think complex QA like HotPotQA or Google-Proof QA, math reasoning (GSM8K), coding challenges, and even agentic systems. It's not just about toy problems anymore.

Are the improvements robust? It's an evolving space, but the big win seems to be for smaller, open-source LLMs. These techniques can genuinely uplift them to near the performance of larger, proprietary models, which is massive for cost reduction and accessibility. For already SOTA models, the headline metric gains might be smaller single-digit percentages on very hard tasks, but this often translates into crucial improvements in reliability and the model's ability to follow complex instructions accurately.

"Textual gradient"-like mechanisms (or execution traces, or actual gradients over reasoning as in some newer work ) are becoming essential. Manually fine-tuning complex prompt workflows or prompts with many distinct nodes or components just doesn't scale. These automated methods provide a more principled and systematic approach to guide and refine LLM behavior.

So, less "spectacular" gains on the absolute hardest tasks with the biggest models, yes, but still valuable. More importantly, it's a powerful optimization route for making capable AI more efficient and accessible. And critically, it's shifting prompt design from a black art to a more transparent, traceable, and robust engineering discipline. That foundational aspect is probably the most significant contribution right now.