PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

🎉 Our paper has been accepted to EMNLP 2025 System Demonstrations!

PromptSuite:

A Task-Agnostic Framework for Multi-Prompt Generation

Code Paper 📦 PyPI Package 📧 Contact 🤗 Dataset Try Web Interface Demonstration Video

Eliya Habba^*, Noam Dahan^*, Gili Lior, Gabriel Stanovsky
The Hebrew University of Jerusalem
^*Equal contribution

PromptSuite framework showing modular prompt components and their perturbations

Python Example

# Initialize and load data
from promptsuite import PromptSuite

# Initialize
ps = PromptSuite()

# 1) Load dataset directly from HF
ps.load_dataset("rajpurkar/squad")

# 2) Setup template and 3) Choose pertubations
template = {
  'instruction': 'Please answer the following questions.',
  'prompt format': 'Q: {question}\nA: {answer}',
"instruction variations": ["paraphrase_with_llm"],
"prompt format variations": ["format structure"],
}
ps.set_template(template)

# 4) Generate variations
ps.configure(variations_per_field=3, api_platform="OpenAI",
              model_name="gpt-4o-mini", api_key="YOUR_OPENAI_API_KEY")

variations = ps.generate(verbose=True)

# Export results
ps.export("output.json", format="json")

PromptSuite's web interface for easy prompt variation generation

🎯 Why Use PromptSuite? 🎯

Transform single-prompt evaluation into robust multi-prompt evaluation

Why PromptSuite?

Single-prompt evaluation is unreliable
Small changes cause significant performance differences
Manual prompt variation is time-consuming
Need for standardized multi-prompt evaluation

What it Does?

Generates diverse prompt variations automatically
Supports different tasks out-of-the-box
Modular design with component-wise perturbations
Both Python API and web interface

How to Use?

Install via pip: pip install promptsuite
Load your dataset (CSV, JSON, HuggingFace)
Configure template with variation types
Generate variations with one command
Export in multiple formats

🌐 Try Our Interactive Web Interface

No coding required - generate prompt variations through our user-friendly interface

🎮 Features:

📁 Upload Data or use sample datasets
🔧 Build Templates with smart suggestions

⚡ Generate Variations with real-time progress
💾 Export Results in multiple formats

🚀 Launch Web Interface

Abstract

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice.

To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible – working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types.

Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API and a user-friendly web interface.

Get started: https://github.com/eliyahabba/PromptSuite

🤗 Public Dataset on Hugging Face

We've made our benchmark results publicly available on Hugging Face! The dataset contains prompt variations and model outputs in DOVE format across multiple tasks and models.

📊 What's Included

2 Models: GPT-4o-mini, Llama-3.3-70B-Instruct
9 Task Categories: Multiple-choice reasoning (MMLU), Mathematical problem solving (GSM8K), Sentiment analysis (SST), Translation (WMT14), Summarization (CNN/DailyMail), Multi-hop QA (MuSiQue), Reading comprehension (SQuAD), Graduate-level reasoning (GPQA Diamond), Code generation (HumanEval)
Scale: 50 examples per task, up to 25 variations per example
Total: ~1,250 evaluated prompts per task

🔧 Quick Usage

from datasets import load_dataset

# Load MMLU anatomy with GPT-4o-mini
dataset = load_dataset(
    "nlphuji/PromptSuite",
    data_files="GPT-4o-mini/en/2_shots/mmlu:dataset=mmlu.anatomy*.parquet"
)

# Load all tasks for a model
all_tasks = load_dataset(
    "nlphuji/PromptSuite",
    data_files="GPT-4o-mini/en/2_shots/*.parquet"
)

🤗 Explore Dataset on Hugging Face

The dataset includes metadata for prompt variations, model responses, and evaluation metrics. Perfect for studying prompt sensitivity and robustness across different tasks and models.

🔧 Supported Perturbation Types

Perturbation Type	Applicable Components	Description
Formatting	Instruction, prompt format, demonstrations, instance content	Adds surface-level noise to the text. It includes inserting extra spaces, introducing typos, changing letter casing, and altering punctuation. Following (Sclar et al., 2023).
Paraphrase	Instruction	Creates semantically equivalent variations to the instruction that differ in phrasing and style. Following (Mizrahi et al., 2024).
Context addition	Instance content	Uses an LLM to add text related to the instance content without revealing or changing the answer. Following (Liu et al., 2023; Levy et al., 2024).
Demonstration Editing	Demonstrations	Changes the few-shot examples, their order and their number. Following (Lu et al., 2021).

Table 1: Overview of the perturbation types supported by PromptSuite. The "Applicable Components" column specifies which prompt components each perturbation can be applied to. For example, paraphrasing is applicable to the instruction component.

Using PromptSuite

📦 Installation

Install PromptSuite from PyPI:

pip install promptsuite

Or install from source:

git clone https://github.com/eliyahabba/PromptSuite.git
cd PromptSuite
pip install -e .

🚀 Basic Usage

Generate prompt variations with just a few lines of code:

from promptsuite import PromptSuite
import pandas as pd

# Initialize
ps = PromptSuite()

# Load data
data = [{"question": "What is 2+2?", "answer": "4"}]
ps.load_dataframe(pd.DataFrame(data))

# Configure template
template = {
    'instruction': 'Please answer the following questions.',
    'prompt format': 'Q: {question}\nA: {answer}',
    'instruction variations': ['paraphrase_with_llm'],
    'prompt format variations': ['format structure'],
}
ps.set_template(template)

# Generate variations
ps.configure(variations_per_field=3, api_platform="OpenAI", model_name="gpt-4o-mini")
variations = ps.generate(verbose=True)

# Export results
ps.export("output.json", format="json")

🎯 Advanced Examples

Sentiment Analysis

import pandas as pd
from promptsuite import PromptSuite

data = pd.DataFrame({
    'text': ['I love this movie!', 'This book is terrible.'],
    'label': ['positive', 'negative']
})

template = {
    'instruction': 'Classify the sentiment',
    'instruction variations': ['paraphrase_with_llm'],
    'prompt format': 'Text: {text}\nSentiment: {label}',
    'text': ['typos and noise'],
}

ps = PromptSuite()
ps.load_dataframe(data)
ps.set_template(template)
ps.configure(variations_per_field=3, max_variations_per_row=2)
variations = ps.generate(verbose=True)

Multiple Choice with Few-shot

template = {
    'instruction': 'Answer the question:\nQuestion: {question}\nAnswer: {answer}',
    'instruction variations': ['paraphrase_with_llm'],
    'question': ['semantic'],
    'gold': 'answer',
    'few_shot': {
        'count': 2,
        'format': 'same_examples__synchronized_order_variations',
        'split': 'train'
    }
}

ps = PromptSuite()
ps.load_dataframe(qa_data)
ps.set_template(template)
ps.configure(variations_per_field=2)
variations = ps.generate(verbose=True)

💾 Export Formats

# JSON - Full data with metadata
ps.export("output.json", format="json")

# CSV - Flattened for spreadsheets
ps.export("output.csv", format="csv")

# TXT - Plain prompts only
ps.export("output.txt", format="txt")

Citation

@misc{habba2025promptsuitetaskagnosticframeworkmultiprompt,
      title={PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation},
      author={Eliya Habba and Noam Dahan and Gili Lior and Gabriel Stanovsky},
      year={2025},
      eprint={2507.14913},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.14913},
},
  year={2024}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.