ComputerAgent Pro Vision API Documentation

Overview

ComputerAgent Pro Vision API provides advanced image analysis capabilities for extracting precise pixel coordinates of UI elements based on textual descriptions. The API supports two request formats:

Form-data format (multipart/form-data)
JSON format (OpenAI-compatible)

Authentication

All requests require API key authentication using the X-API-Key header.

X-API-Key: your-api-key

API Endpoints

Element Location (Form-Data Format)

Request

# Just coordinates
curl -X POST https://computeragent.pro/api/chat \
  -H "X-API-Key: your-api-key" \
  -F "prompt=Find the submit button" \
  -F "[email protected]"

# With annotated image
curl -X POST https://computeragent.pro/api/chat \
  -H "X-API-Key: your-api-key" \
  -F "prompt=Find the submit button" \
  -F "[email protected]" \
  -F "return_annotated_image=true"

Python Example (Form-Data)

from form_client import FormDataClient

# Initialize client
client = FormDataClient(api_key="your-api-key")

# Locate element (just coordinates)
result = client.locate_element(
    image_path="screenshot.png",
    prompt="Find the submit button"
)

# Locate element (with annotated image)
result_with_image = client.locate_element(
    image_path="screenshot.png",
    prompt="Find the submit button",
    return_annotated_image=True
)

Element Location (JSON Format)

Request

# Just coordinates
curl -X POST https://computeragent.pro/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Find the submit button"},
        {"type": "image_url", "image_url": "data:image/png;base64,..."}
      ]
    }]
  }'

# Just coordinates with inline base64 image encoding
curl -X POST https://computeragent.pro/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: computeragent_prod_key_2024" \
  -d "{
    \"model\": \"OS-Copilot/OS-Atlas-Base-7B\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"text\",
          \"text\": \"Refresh Status\"
        },
        {
          \"type\": \"image_url\",
          \"image_url\": \"data:image/jpeg;base64,$(base64 -w 0 ./test_images/test_image_2.jpg)\"
        }
      ]
    }],
  }"

# If you are on macOS, the base64 command does not support the -w 0 option, so you can use this instead:
        {
          \"type\": \"image_url\",
          \"image_url\": \"data:image/jpeg;base64,$(base64 ./test_images/test_image_2.jpg | tr -d '\n')\"
        }

# With annotated image
curl -X POST https://computeragent.pro/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Find the submit button"},
        {"type": "image_url", "image_url": "data:image/png;base64,..."}
      ]
    }],
    "return_annotated_image": true
  }'

Python Example (JSON)

from json_client import JSONClient

# Initialize client
client = JSONClient(api_key="your-api-key")

# Locate element (just coordinates)
result = client.locate_element(
    image_path="screenshot.png",
    prompt="Find the submit button"
)

# Locate element (with annotated image)
result_with_image = client.locate_element(
    image_path="screenshot.png",
    prompt="Find the submit button",
    return_annotated_image=True
)

Response Format

Success Response (Coordinates Only)

{
    "prediction": "[[x1, y1, x2, y2]]"
}

Success Response (With Annotated Image)

{
    "prediction": "[[x1, y1, x2, y2]]",
    "annotated_image": "base64_encoded_image_data"
}

Error Response

{
    "detail": "Error message"
}

Rate Limits and Constraints

Maximum image size: 10MB
Supported image formats: JPEG, PNG
Maximum prompt length: 500 characters

Best Practices

Prompt Engineering
- Be specific about the UI element you want to locate
- Use clear, concise language
- Example prompts:
  - "Find the login button"
  - "Locate the refresh status icon"
  - "Where is the submit button?"
Image Preparation
- Use clear, high-quality screenshots
- Ensure the target element is visible
- Optimize image size when possible
Error Handling
- Implement proper error handling in your code
- Validate input parameters before making requests

Integration Examples

UI Automation Framework

from json_client import JSONClient
import pyautogui

class UIAutomation:
    def __init__(self, api_key: str):
        self.vision_client = JSONClient(api_key)
        
    def find_and_click(self, screenshot_path: str, element_description: str):
        result = self.vision_client.locate_element(
            image_path=screenshot_path,
            prompt=element_description
        )
        coordinates = eval(result["prediction"])[0]
        center_x = (coordinates[0] + coordinates[2]) / 2
        center_y = (coordinates[1] + coordinates[3]) / 2
        pyautogui.click(center_x, center_y)

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
app		app
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deploy-compose.sh		deploy-compose.sh
docker-compose.yml		docker-compose.yml
download_model.py		download_model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ComputerAgent Pro Vision API Documentation

Overview

Authentication

API Endpoints

Element Location (Form-Data Format)

Request

Python Example (Form-Data)

Element Location (JSON Format)

Request

Python Example (JSON)

Response Format

Success Response (Coordinates Only)

Success Response (With Annotated Image)

Error Response

Rate Limits and Constraints

Best Practices

Integration Examples

UI Automation Framework

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

3clyp50/computeragent-pro

Folders and files

Latest commit

History

Repository files navigation

ComputerAgent Pro Vision API Documentation

Overview

Authentication

API Endpoints

Element Location (Form-Data Format)

Request

Python Example (Form-Data)

Element Location (JSON Format)

Request

Python Example (JSON)

Response Format

Success Response (Coordinates Only)

Success Response (With Annotated Image)

Error Response

Rate Limits and Constraints

Best Practices

Integration Examples

UI Automation Framework

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages