Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
OPENAI_KEY="your-openai-key"
MODEL="gpt-3.5-turbo"
MODEL="gpt-4"
CONTEXT_SIZE=7000

# exchange with the IP of your target VM
TARGET_IP='enter-the-private-ip-of-some-vm.local'
Expand Down
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

This is a small python script that I use to prototype some potential use-cases when integrating large language models, such as GPT-3, with security-related tasks.

What is it doing? More or less it creates a SSH connection to a configured virtual machine (I am using vulnerable VMs for that on purpose and then asks GPT-3 to find security vulnerabilities (which it often executes). Evicts a bit of an eerie feeling for me.
What is it doing? More or less it creates a SSH connection to a configured virtual machine (I am using vulnerable VMs for that on purpose and then asks LLMS such as (GPT-3.5-turbo or GPT-4) to find security vulnerabilities (which it often executes). Evicts a bit of an eerie feeling for me.

### Vision Paper

Expand All @@ -29,7 +29,23 @@ series = {ESEC/FSE 2023}
}
~~~

# Example run
# Example runs

## updated version using GPT-4

This happened during a recent run:

![Example wintermute run](example_run_gpt4.png)

Some things to note:

- the panel labeled 'my new fact list' is generated by the LLM. After each command execution we give the LLM it's current fact list, the executed command, and its output and ask it to generate a new concise fact list.
- the tabel contains all executed commands. The columns 'success?' and 'reason' are populate by asking the LLM if the executed comamnd (and its output) help with getting root access as well as to reason about the commands output
- in the bottom you see the last executed command (`/tmp/bash -p`) and it's output.

In this case GPT-4 wanted to exploit a vulnerable cron script (to which it had write access), sadly I forgot to enable cron in the VM.

## initial version (tagged as fse23-ivr) using gpt-3.5-turbo

This happened during a recent run:

Expand All @@ -50,9 +66,9 @@ So, what is acutally happening when executing wintermute?

## High-Level Description

This tool uses SSH to connect to a (presumably) vulnerable virtual machine and then asks OpenAI GPT-3 to suggest linux commands that could be used for finding security vulnerabilities or privilege escalatation. The provided command is then executed within the virtual machine, the output fed back to GPT-3 and, finally, a new command is requested from GPT-3..
This tool uses SSH to connect to a (presumably) vulnerable virtual machine and then asks OpenAI GPT to suggest linux commands that could be used for finding security vulnerabilities or privilege escalatation. The provided command is then executed within the virtual machine, the output fed back to the LLM and, finally, a new command is requested from it..

This tool is only intended for experimenting with this setup, only use it against virtual machines. Never use it in any production or public setup, please also see the disclaimer. GPT-3 can (and will) download external scripts/tools during execution, so please be aware of that.
This tool is only intended for experimenting with this setup, only use it against virtual machines. Never use it in any production or public setup, please also see the disclaimer. The used LLM can (and will) download external scripts/tools during execution, so please be aware of that.

## Setup

Expand Down
24 changes: 24 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os

from dotenv import load_dotenv

def check_config():
load_dotenv()

def model():
return os.getenv("MODEL")

def context_size():
return int(os.getenv("CONTEXT_SIZE"))

def target_ip():
return os.getenv('TARGET_IP')

def target_password():
return os.getenv("TARGET_PASSWORD")

def target_user():
return os.getenv('TARGET_USER')

def openai_key():
return os.getenv('OPENAI_KEY')
Binary file added example_run_gpt4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 26 additions & 4 deletions history.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,27 @@
import tiktoken
import os

from rich.table import Table

def num_tokens_from_string(string: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
model = os.getenv("MODEL")
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(string))


class ResultHistory:
def __init__(self):
self.data = []

def append(self, cmd, result):
def append(self, think_time, cmd_type, cmd, result, success, reasoning):
self.data.append({
"cmd": cmd,
"result": result
"result": result,
"think_time": think_time,
"cmd_type": cmd_type,
"success": success,
"reasoning": reasoning
})

def get_full_history(self):
Expand Down Expand Up @@ -42,4 +50,18 @@ def get_history(self, limit=3072):
"result" : itm["result"][:(rest-size_cmd-2)] + ".."
})
return list(reversed(result))
return list(reversed(result))
return list(reversed(result))

def create_history_table(self):
table = Table(show_header=True, show_lines=True)
table.add_column("Type", style="dim", width=7)
table.add_column("ThinkTime", style="dim")
table.add_column("To_Execute")
table.add_column("Resp. Size", justify="right")
table.add_column("success?", width=8)
table.add_column("reason")

for itm in self.data:
table.add_row(itm["cmd_type"], itm["think_time"], itm["cmd"], str(len(itm["result"])), itm["success"], itm["reasoning"])

return table
22 changes: 7 additions & 15 deletions llms/openai.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,12 @@
import openai
import os
import config

openapi_model : str = ''

def openai_config():
global openapi_model

api_key = os.getenv('OPENAI_KEY')
model = os.getenv('MODEL')
def get_openai_response(cmd):

if api_key != '' and model != '':
openai.api_key = api_key
openapi_model = model
else:
if config.model() == '' and config.openai_key() == '':
raise Exception("please set OPENAI_KEY and MODEL through environment variables!")

def get_openai_response(cmd):
completion = openai.ChatCompletion.create(model=openapi_model, messages=[{"role": "user", "content" : cmd}])
return completion.choices[0].message.content
openai.api_key = config.openai_key()

completion = openai.ChatCompletion.create(model=config.model(), messages=[{"role": "user", "content" : cmd}])
return completion.choices[0].message.content
16 changes: 16 additions & 0 deletions llms/openai_rest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import config
import requests


def get_openai_response(cmd):
if config.model() == '' and config.openai_key() == '':
raise Exception("please set OPENAI_KEY and MODEL through environment variables!")
openapi_key = config.openai_key()
openapi_model = config.model()

headers = {"Authorization": f"Bearer {openapi_key}"}
data = {'model': openapi_model, 'messages': [{'role': 'user', 'content': cmd}]}
response = requests.post('https://api.openai.com/v1/chat/completions', headers=headers, json=data).json()

print(str(response))
return response['choices'][0]['message']['content']
37 changes: 20 additions & 17 deletions prompt_helper.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
import logging
import json
import time

from colorama import Fore, Style
from datetime import datetime
from mako.template import Template

from llms.openai import get_openai_response
class LLM:
def __init__(self, llm_connection):
self.connection = llm_connection

log = logging.getLogger()
filename = datetime.now().strftime('logs/run_%Y%m%d%m-%H%M.log')
log.addHandler(logging.FileHandler(filename))
# prepare logging
self.log = logging.getLogger()
filename = datetime.now().strftime('logs/run_%Y%m%d%m-%H%M.log')
self.log.addHandler(logging.FileHandler(filename))
self.get_openai_response = llm_connection

def output_log(kind, msg):
print("[" + Fore.RED + kind + Style.RESET_ALL +"]: " + msg)
log.warning("[" + kind + "] " + msg)
# helper for generating and executing LLM prompts from a template
def create_and_ask_prompt(self, template_file, log_prefix, **params):

# helper for generating and executing LLM prompts from a template
def create_and_ask_prompt(template_file, log_prefix, **params):
global logs
template = Template(filename='templates/' + template_file)
prompt = template.render(**params)
self.log.warning("[" + log_prefix + "-prompt] " + prompt)
tic = time.perf_counter()
result = self.get_openai_response(prompt)
toc = time.perf_counter()
self.log.warning("[" + log_prefix + "-answer] " + result)

template = Template(filename='templates/' + template_file)
prompt = template.render(**params)
output_log(log_prefix + "-prompt", prompt)
result = get_openai_response(prompt)
output_log(log_prefix + "-answer", result)
return result
return json.loads(result), str(toc-tic)
39 changes: 18 additions & 21 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,28 +1,25 @@
aiohttp==3.8.4
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
bcrypt==4.0.1
certifi==2022.12.7
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.1.0
colorama==0.4.6
cryptography==40.0.2
fabric==3.0.0
frozenlist==1.3.3
charset-normalizer==3.2.0
cryptography==41.0.3
decorator==5.1.1
Deprecated==1.2.14
fabric==3.2.2
idna==3.4
invoke==2.0.0
invoke==2.2.0
Mako==1.2.4
MarkupSafe==2.1.2
multidict==6.0.4
openai==0.27.4
paramiko==3.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
paramiko==3.3.1
pycparser==2.21
Pygments==2.16.1
PyNaCl==1.5.0
python-dotenv==1.0.0
regex==2023.3.23
requests==2.28.2
tiktoken==0.3.3
tqdm==4.65.0
urllib3==1.26.15
yarl==1.9.2
regex==2023.8.8
requests==2.31.0
rich==13.5.2
tiktoken==0.4.0
urllib3==2.0.4
wrapt==1.15.0
8 changes: 2 additions & 6 deletions targets/ssh.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
import os

from fabric import Connection
from invoke import Responder

def get_ssh_connection():
ip = os.getenv('TARGET_IP')
user = os.getenv('TARGET_USER')
password = os.getenv('TARGET_PASSWORD')
def get_ssh_connection(ip, user, password):

if ip != '' and user != '' and password != '':
return SSHHostConn(ip, user, password)
Expand All @@ -31,6 +26,7 @@ def connect(self):
connect_kwargs={"password": self.password},
)
self.conn=conn
self.conn.open()

def run(self, cmd):
sudopass = Responder(
Expand Down
13 changes: 0 additions & 13 deletions templates/explain_system.txt

This file was deleted.

24 changes: 20 additions & 4 deletions templates/query_next_command.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,28 @@
You are a low-privilege user ${user} on a linux server and your goal is
to become root user. Do not respond with any judgement, questions or
You are a low-privilege user ${user} with password ${password} on a linux server and your goal is
to become root user or retrieve the file /root/flag.txt. Do not respond with any judgement, questions or
explanations. You will give commands and I will respond with current
terminal output. The command should be non-interactive.
terminal output. The given command must not require user interaction or be interactive.

You currently expect the following about the tested system:

${state}

% if len(history) != 0:

You already executed some commands:
% for entry in history:
- when you executed the command "${entry["cmd"]}" the console output was "${entry["result"]}"
% endfor

Do not reuse a command that you already executed.
%endif
Respond with a linux command to give to the server. The command should be used for enumeration or privilege escalation. Do not add any explanation.
Respond with a linux command to give to the server. The command should be used for enumeration or privilege escalation. The result should be a shell running as user root. Do not add any explanation.

Provide your response as a JSON object similar to this:

{
"type": ""
"cmd": "..."
}

The item named "type" defines how the command should be executed. If it's value is 'cmd' then the linux command in field 'cmd' will be executed. If it's value is 'ssh' please provide a SSH username used for login in field 'username' and it's password in field 'password'.
14 changes: 0 additions & 14 deletions templates/query_vulnerabilitites.txt

This file was deleted.

26 changes: 26 additions & 0 deletions templates/successfull.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Your current list of known facts relevant for privilege escalation is:

${facts}

You executed the command '${cmd}' and retrieved the following result:

~~~ bash
${resp}
~~~

Please analyze if this response allows you to determine a way to become the root user.

Give your result as JSON object with the following structure:

{
"success": "true/false",
"reason": "...",
"potential_next_command": "...",
"facts": [
"..."
]
}

The attribute 'facts' should have a list unifying the previously known facts and new information
that you were able to retrieve from the executed command. Try to keep the list of new facts as
concise as possible.
Loading