Spaces:

algorithmicsuperintelligence
/

prompt-optimizer

Running

App Files Files Community

codelion commited on 22 days ago

Commit

bc07d04

verified ·

1 Parent(s): 769e325

Upload app.py

Browse files

Files changed (1) hide show

app.py +29 -23

app.py CHANGED Viewed

@@ -734,22 +734,27 @@ Bad patterns to avoid:
 ```
 # Task
-Rewrite the prompt above to improve its performance on the specified metrics.
-REMEMBER:
-- SHORTER is usually BETTER - avoid adding unnecessary words
-- Keep the EXACT same output format (especially placeholder variables like {{input}})
-- Focus on DIRECTNESS - what's the clearest way to ask for what we need?
-- Avoid conversational language that might confuse the model
-- For classification: ask directly for the label, don't ask for explanations
-Provide a complete new version of the prompt that:
-1. Maintains the same input/output format (keep placeholders like {{input}}, {{text}}, etc.)
-2. Is brief and direct
-3. Clearly asks for the classification/answer without asking for reasoning
-4. Will cause the model to output the label word in its response
-Output ONLY the new prompt text between ```text markers:
 ```text
 Your improved prompt here
@@ -761,11 +766,11 @@ Your improved prompt here
     config = {
         "llm": {
-            "primary_model": model,
             "api_base": "https://openrouter.ai/api/v1",  # Use OpenRouter endpoint
-            "temperature": 1.0,  # Higher temperature for more creative variations
         },
-        "max_iterations": 10,  # More iterations for better exploration
         "checkpoint_interval": 1,  # Save checkpoints every iteration to preserve prompt history
         "diff_based_evolution": False,  # Use full rewrite mode for prompts (not diff/patch mode)
         "language": "text",  # CRITICAL: Optimize text/prompts, not Python code!
@@ -775,11 +780,11 @@ Your improved prompt here
             "template_dir": templates_dir,  # Use our custom prompt engineering templates
         },
         "evolution": {
-            "population_size": 10,  # Smaller population but more iterations
             "num_islands": 1,  # Single island for simpler evolution
-            "elite_ratio": 0.3,  # Keep top 30% (3 best prompts)
-            "explore_ratio": 0.2,  # Less random exploration
-            "exploit_ratio": 0.5,  # More exploitation of good prompts
         },
         "database": {
             "log_prompts": True,  # Save prompts used to generate each program
@@ -979,11 +984,12 @@ def optimize_prompt(initial_prompt: str, dataset_name: str, dataset_split: str,
 ### Summary
 - **Dataset**: {dataset_name} ({dataset_split} split)
-- **Model**: {model}
 - **Initial Eval**: 50 samples
 - **Final Eval**: 50 samples (same samples for fair comparison)
-- **Evolution**: 50 samples per variant (SAME samples as initial/final for consistency!)
-- **Iterations**: 10
 ### Results
 - **Initial Accuracy**: {initial_eval['accuracy']:.2f}% ({initial_eval['correct']}/{initial_eval['total']})

 ```
 # Task
+Rewrite the prompt to MAXIMIZE accuracy on sentiment classification.
+CRITICAL REQUIREMENTS (these DIRECTLY affect score):
+1. ✓ MUST include word "sentiment" → model response will contain "sentiment" keyword
+2. ✓ MUST use pattern "[Action] sentiment: {{input}}" → triggers correct response format
+3. ✓ MUST be SHORT (under 35 chars) → prevents verbose/conversational responses
+4. ✓ MUST keep {{input}} placeholder EXACTLY as-is
+PROVEN WORKING PATTERNS (use these!):
+- "Classify sentiment: {{input}}" ← BEST (scores ~90%)
+- "Determine sentiment: {{input}}" ← Also works well (~85%)
+- "Sentiment of: {{input}}" ← Good (~80%)
+PATTERNS THAT FAIL (avoid!):
+- ❌ "What is the sentiment?" - question format, no {{input}}
+- ❌ "Review: {{input}}" - missing "sentiment" keyword
+- ❌ "Please analyze the sentiment..." - too long, word "please"
+Generate a SHORT, DIRECT prompt using the working pattern above.
+Output ONLY the new prompt between ```text markers:
 ```text
 Your improved prompt here
     config = {
         "llm": {
+            "primary_model": "meta-llama/llama-3.1-8b-instruct",  # Use STRONGER model for prompt generation
             "api_base": "https://openrouter.ai/api/v1",  # Use OpenRouter endpoint
+            "temperature": 1.2,  # Even higher temperature for more creative variations
         },
+        "max_iterations": 5,  # Fewer iterations (each is expensive)
         "checkpoint_interval": 1,  # Save checkpoints every iteration to preserve prompt history
         "diff_based_evolution": False,  # Use full rewrite mode for prompts (not diff/patch mode)
         "language": "text",  # CRITICAL: Optimize text/prompts, not Python code!
             "template_dir": templates_dir,  # Use our custom prompt engineering templates
         },
         "evolution": {
+            "population_size": 15,  # Larger population = more variants per generation
             "num_islands": 1,  # Single island for simpler evolution
+            "elite_ratio": 0.4,  # Keep top 40% (6 best prompts)
+            "explore_ratio": 0.1,  # Minimal random exploration (only 1-2 prompts)
+            "exploit_ratio": 0.5,  # 50% exploitation of best prompts
         },
         "database": {
             "log_prompts": True,  # Save prompts used to generate each program
 ### Summary
 - **Dataset**: {dataset_name} ({dataset_split} split)
+- **Evaluation Model**: {model}
+- **Evolution Model**: meta-llama/llama-3.1-8b-instruct (larger model for better prompt generation)
 - **Initial Eval**: 50 samples
 - **Final Eval**: 50 samples (same samples for fair comparison)
+- **Evolution**: 50 samples per variant (SAME samples as initial/final!)
+- **Iterations**: 5 (population: 15, elite: 40%, explore: 10%, exploit: 50%)
 ### Results
 - **Initial Accuracy**: {initial_eval['accuracy']:.2f}% ({initial_eval['correct']}/{initial_eval['total']})