feat: add support for passing url to guard endpoint by homanp · Pull Request #1075 · superagent-ai/superagent

homanp · 2025-11-22T22:08:13Z

Description

Add support for passing a url to the guard method.

Checklist

I tested my changes
I reviewed my own code

vercel · 2025-11-22T22:08:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
docs	Ready	Preview	Comment	Nov 22, 2025 10:13pm

claude · 2025-11-22T22:10:21Z

Code Review: Add URL Support to Guard Endpoint

Thank you for this contribution! This is a well-structured PR that adds URL support to the guard endpoint. Here's my detailed review:

✅ Strengths

Comprehensive Implementation: The PR adds URL support across all layers - API, Python SDK, TypeScript SDK, documentation, and tests.
Good Documentation: Excellent documentation updates with clear examples in both SDK docs and README files.
Test Coverage: Both Python and TypeScript SDKs include new test cases for the URL functionality.
Consistent API Design: The URL detection logic is consistent between Python and TypeScript implementations.

🐛 Potential Bugs & Issues

1. URL Validation Logic Has Edge Cases (sdk/python/src/superagent_ai/client.py:222-224, sdk/typescript/src/index.ts:304-305)

The current URL detection is too simplistic:

is_url = isinstance(input, str) and (
    input.startswith("http://") or input.startswith("https://")
)

Issues:

A string like "http://not-a-real-url" or "https://" would be detected as a URL
No validation that the URL is well-formed
Could accidentally treat text analysis as URL if user input happens to start with http://

Recommendation:

# Python
from urllib.parse import urlparse

def is_valid_url(s: str) -> bool:
    try:
        result = urlparse(s)
        return all([result.scheme in ('http', 'https'), result.netloc])
    except:
        return False

is_url = isinstance(input, str) and is_valid_url(input)

// TypeScript
function isValidUrl(s: string): boolean {
  try {
    const url = new URL(s);
    return url.protocol === 'http:' || url.protocol === 'https:';
  } catch {
    return false;
  }
}

const isUrl = typeof input === 'string' && isValidUrl(input);

2. Conflicting Input Validation (sdk/python/src/superagent_ai/client.py:226-227)

if not is_file and not input:
    raise GuardError("input must be a non-empty string, file, or URL")

This check happens AFTER determining is_url, but since is_url requires isinstance(input, str) to be true, the not input check will never catch URL cases properly. The logic flow could be clearer.

Recommendation:

# Check empty input first, before type determination
if not input:
    raise GuardError("input must be a non-empty string, file, or URL")

3. File Detection Logic is Fragile (sdk/python/src/superagent_ai/client.py:219)

is_file = hasattr(input, 'read') or not isinstance(input, str)

The or not isinstance(input, str) part means ANY non-string input (integers, booleans, None, etc.) would be treated as a file, which could cause confusing errors downstream.

Recommendation:

is_file = hasattr(input, 'read') or hasattr(input, 'file') or (
    not isinstance(input, str) and hasattr(input, '__fspath__')
)

🔒 Security Concerns

1. SSRF Vulnerability Risk ⚠️ HIGH PRIORITY

The API will download PDFs from ANY URL provided by the user. This creates a Server-Side Request Forgery (SSRF) vulnerability where attackers could:

Access internal network resources (http://localhost:6379, http://169.254.169.254/latest/meta-data/)
Scan internal ports
Access cloud metadata endpoints
Bypass firewall restrictions

Recommendations:

URL validation on the backend - Block private IP ranges, localhost, link-local addresses
Implement allowlist/blocklist - Consider allowing only specific domains or blocking dangerous ones
Add timeout limits - Prevent hanging on slow/malicious endpoints
Limit file size - Prevent DoS via large file downloads
Follow redirects carefully - Limit redirect depth and validate redirect targets

Example validation (for backend implementation):

import ipaddress
from urllib.parse import urlparse

def is_safe_url(url: str) -> bool:
    try:
        parsed = urlparse(url)
        
        # Only allow http/https
        if parsed.scheme not in ('http', 'https'):
            return False
            
        # Resolve hostname to IP
        import socket
        ip = socket.gethostbyname(parsed.hostname)
        ip_obj = ipaddress.ip_address(ip)
        
        # Block private/local addresses
        if ip_obj.is_private or ip_obj.is_loopback or ip_obj.is_link_local:
            return False
            
        return True
    except:
        return False

2. Missing Content-Type Validation

There's no validation that the URL actually points to a PDF file. An attacker could provide URLs to:

Extremely large files (DoS)
Non-PDF files that might cause parsing errors
Malicious PDFs with exploits

Recommendation:

Check Content-Type header before downloading
Validate file size before full download (use HEAD request)
Verify PDF magic bytes after download

⚡ Performance Considerations

1. Synchronous URL Downloads Block Request Processing

The API downloads the PDF synchronously during the request. For large files or slow networks, this could:

Block the API server thread/worker
Cause request timeouts
Create poor user experience

Recommendations:

Implement async processing: Return immediately with a job ID, process in background
Add streaming: Stream the download and processing
Set reasonable timeouts: Add explicit timeout for URL fetches (5-10 seconds)
Add file size limits: Reject downloads over a certain size (e.g., 10MB)

2. No Caching Strategy

If the same URL is analyzed multiple times, it will be downloaded repeatedly.

Recommendation:

Consider caching PDF downloads with TTL (time-to-live)
Add Cache-Control headers respect
Store hash of content to detect duplicates

🧪 Test Coverage

Positive Aspects:

Both SDKs have new test cases for URL functionality
Tests use real URLs (arxiv.org PDF)

Gaps:

No error case testing for:
- Invalid URLs
- Unreachable URLs (404, 500 errors)
- Non-PDF content
- URLs that timeout
- URLs that redirect
- Very large files
No mock testing: Tests use real external URLs which:
- Makes tests flaky (dependent on external service)
- Slower test execution
- Could fail if URL becomes unavailable

Recommendations:

# Python test additions
@pytest.mark.asyncio
async def test_guard_rejects_invalid_url():
    client = create_client(api_base_url=API_BASE_URL, api_key=API_KEY)
    
    with pytest.raises(GuardError):
        await client.guard("not-a-url")

@pytest.mark.asyncio
async def test_guard_handles_unreachable_url():
    client = create_client(api_base_url=API_BASE_URL, api_key=API_KEY)
    
    # Use a URL that will definitely 404
    result = await client.guard("https://example.com/nonexistent.pdf")
    # Assert appropriate error handling

# Add mocked tests
@pytest.mark.asyncio
async def test_guard_url_with_mock(httpx_mock):
    httpx_mock.add_response(
        url="https://test.com/doc.pdf",
        content=b"%PDF-1.4...",  # Mock PDF content
    )
    # Test with mocked response

📝 Minor Issues

1. OpenAPI Schema Inconsistency (docs/openapi.json:431-449)

The schema shows properties as not required, but the description says "At least one of text, file, or url must be provided." This mutual exclusivity isn't enforced at the schema level.

Recommendation:
Use oneOf or add custom validation description to make the requirement clearer.

2. Error Message Could Be More Specific (sdk/typescript/src/index.ts:308)

throw new GuardError("input must be a non-empty string, file, or URL.");

Consider making the error message indicate what was actually received:

throw new GuardError(`input must be a non-empty string, file, or URL. Received: ${typeof input}`);

3. Changelog Version Bump Premature

The CHANGELOG shows version 0.0.18, but this should typically be updated as part of the release process, not in the feature PR itself. Consider removing version bumps from the PR and handling them separately during release.

🎯 Recommendations Summary

High Priority:

⚠️ Address SSRF vulnerability - Add URL validation on backend to prevent internal network access
Add proper URL validation instead of simple startsWith check
Implement timeouts and file size limits for URL downloads

Medium Priority:
4. Improve error handling for invalid/unreachable URLs
5. Add comprehensive test coverage for error cases
6. Consider async processing for large file downloads

Low Priority:
7. Add response caching for frequently accessed URLs
8. Use mocked tests to avoid external dependencies
9. Improve error messages with more context

✨ Overall Assessment

This is a solid feature addition with good documentation and basic test coverage. The main concerns are around security (SSRF) and robustness (URL validation, error handling). With the suggested improvements, this will be a strong addition to the codebase.

Great work on maintaining consistency across multiple SDKs and keeping documentation in sync!

claude · 2025-11-22T22:15:41Z

Pull Request Review: Add URL Support to Guard Endpoint

Overview

This PR adds support for passing URLs (particularly PDF URLs) to the guard endpoint. The implementation is well-structured and consistent across SDKs, with good test coverage and documentation updates.

✅ Strengths

Code Quality & Consistency

Excellent consistency: Both TypeScript and Python SDKs follow identical patterns for URL detection and handling
Clean implementation: URL detection using startsWith("http://") or startsWith("https://") is simple and effective
Good separation of concerns: Three distinct code paths for text, file, and URL inputs
Proper error handling: Maintains existing error handling patterns

Documentation & Testing

Comprehensive documentation updates: All SDK docs, API docs, and OpenAPI spec updated
Good test coverage: Both SDKs include tests for URL functionality using a real PDF URL (arxiv.org)
Helpful examples: CLI and SDK documentation include clear usage examples
Updated CHANGELOG: Properly documented the new feature

SDK Implementation

Both SDKs properly handle the new URL input:

TypeScript (sdk/typescript/src/index.ts:305-306)
Python (sdk/python/src/superagent_ai/client.py:221-224)

The URL requests correctly use JSON format with {"url": input} and appropriate headers.

🔍 Areas for Improvement

1. Security Concerns (HIGH PRIORITY)

Server-Side Request Forgery (SSRF) Risk

Location: Backend API implementation (not visible in PR diff)

Issue: The API will download content from user-provided URLs, which creates potential SSRF vulnerabilities.

Recommendations:

Implement URL validation and sanitization
Whitelist allowed protocols (only https://, not http://)
Blacklist private IP ranges (RFC 1918: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
Blacklist localhost (127.0.0.0/8, ::1)
Blacklist link-local addresses (169.254.0.0/16)
Blacklist cloud metadata endpoints (169.254.169.254)
Implement timeout and size limits for downloads
Consider using a dedicated service with network isolation for URL fetching
Add rate limiting per user/API key for URL requests

Example validation code:

from urllib.parse import urlparse
import ipaddress

def is_safe_url(url: str) -> bool:
    parsed = urlparse(url)
    
    # Only allow HTTPS
    if parsed.scheme != 'https':
        return False
    
    # Resolve hostname and check IP
    try:
        ip = ipaddress.ip_address(parsed.hostname)
        # Block private/internal IPs
        if ip.is_private or ip.is_loopback or ip.is_link_local:
            return False
    except:
        pass  # hostname resolution may fail, handle separately
    
    return True

Open Redirect / Unvalidated Redirects

Issue: If the backend follows redirects, malicious users could exploit this.

Recommendations:

Disable automatic redirect following OR
Validate redirect destinations using the same SSRF checks
Log all redirects for security monitoring

2. Performance Considerations

No Timeout Documentation

Locations:

sdk/typescript/src/index.ts:339-350
sdk/python/src/superagent_ai/client.py:247-257

Issue: URL downloads could take significantly longer than text/file analysis, but timeout behavior isn't documented.

Recommendations:

Document expected timeout behavior for URL requests in SDK docs
Consider implementing URL-specific longer timeouts
Add retry logic with exponential backoff for transient network failures

Resource Consumption

Issue: Large PDF files could consume significant memory/bandwidth.

Recommendations:

Implement and document file size limits (e.g., max 10MB)
Stream large files instead of loading entirely into memory
Return clear error messages when limits are exceeded

3. Error Handling & User Experience

Limited URL Validation in Client

Locations:

sdk/typescript/src/index.ts:305-306
sdk/python/src/superagent_ai/client.py:221-224

Issue: Client only checks if URL starts with http:// or https://, but doesn't validate URL format.

Recommendations:

// Add URL validation
const isUrl = typeof input === 'string' && 
  (input.startsWith("http://") || input.startsWith("https://"));

if (isUrl) {
  try {
    new URL(input); // Validates URL format
  } catch {
    throw new GuardError("Invalid URL format provided.");
  }
}

Missing Error Context

Issue: When URL downloads fail, users may not get clear error messages.

Recommendations:

Add specific error types for URL-related failures (e.g., URLDownloadError, URLTimeoutError)
Include more context in error messages (e.g., "Failed to download PDF from URL: Connection timeout")

4. Testing Improvements

Test Reliability

Locations:

sdk/typescript/tests/guard.test.ts:47-68
sdk/python/tests/test_guard.py:64-83

Issue: Tests depend on external arxiv.org URL being available and accessible.

Recommendations:

Add tests with mocked responses for more reliable CI/CD
Test error cases: invalid URLs, unreachable URLs, non-PDF content, oversized files
Consider using a dedicated test file hosted on a controlled server
Add integration test flag to skip URL tests when running offline

Example test cases to add:

# Test invalid URL
with pytest.raises(GuardError):
    await client.guard("not-a-valid-url")

# Test non-HTTPS URL (if blocked)
with pytest.raises(GuardError):
    await client.guard("http://example.com/file.pdf")

# Test unreachable URL
with pytest.raises(GuardError):
    await client.guard("https://this-does-not-exist-12345.com/file.pdf")

5. Documentation Enhancements

Missing Information

No documentation about supported file types (only PDF? what about other formats?)
No information about maximum file size limits
No guidance on URL authentication (what if PDF requires authentication?)
Missing information about redirect handling

Recommendations:
Add to SDK documentation:

### URL Requirements
- Must use HTTPS protocol (HTTP is not supported for security)
- Maximum file size: 10MB
- Supported formats: PDF only
- Authentication: Public URLs only (no authentication headers)
- Redirects: Up to 3 redirects followed
- Timeout: 30 seconds for download + analysis time

6. API Design Considerations

Mutual Exclusivity Not Enforced in Schema

Location: docs/openapi.json:428-451

Issue: OpenAPI schema says "Only one input method should be provided" but doesn't enforce it.

Recommendations:

Use OpenAPI oneOf to enforce mutual exclusivity
Backend should validate and return clear error if multiple inputs provided
Document precedence if multiple inputs are provided (currently unclear)

{
  "oneOf": [
    {
      "required": ["text"],
      "properties": { "text": {...} }
    },
    {
      "required": ["file"],
      "properties": { "file": {...} }
    },
    {
      "required": ["url"],
      "properties": { "url": {...} }
    }
  ]
}

📊 Test Coverage Assessment

Good coverage for:

Happy path: URL acceptance and processing ✅
Basic callback functionality ✅
Response structure validation ✅

Missing coverage for:

Invalid URL formats ❌
Network failures (timeout, DNS failure, connection refused) ❌
Non-PDF content at URL ❌
Oversized files ❌
HTTP (non-HTTPS) URLs ❌
URLs with authentication requirements ❌
Malformed responses from backend ❌

🏗️ Architecture Questions

Backend Implementation: This PR doesn't show the backend API changes. Has the backend been updated to:
- Validate URLs for SSRF?
- Implement size limits?
- Handle network failures gracefully?
- Return appropriate error codes?
Rate Limiting: Are URL requests rate-limited separately from text/file requests (since they consume more resources)?
Monitoring: Are URL requests logged/monitored for security analysis?

🎯 Recommendations Priority

Must Fix (Security):

⚠️ Implement SSRF protection in backend
⚠️ Validate URLs on server-side (whitelist HTTPS, blacklist private IPs)
⚠️ Add file size limits

Should Fix (Reliability):
4. Add comprehensive error handling for network failures
5. Improve test coverage for error cases
6. Document timeout behavior and limits

Nice to Have (Polish):
7. Enforce mutual exclusivity in OpenAPI schema
8. Add client-side URL validation
9. Create dedicated error types for URL-related failures

✨ Overall Assessment

This is a well-implemented feature with good consistency across SDKs and comprehensive documentation. The main concerns are around security (SSRF risks) and test coverage for failure scenarios.

Recommendation: Approve with required changes for security hardening before merging to production.

Checklist for Merge:

SSRF protection implemented in backend
File size limits enforced
Timeout behavior documented
Error handling improved
Additional test cases added for failure scenarios
Security review completed

Great work on the implementation! The code is clean and well-structured. Address the security concerns and this will be ready to ship. 🚀

Review generated with assistance from Claude Code

add support for passing url to guard endpoint

b3f348c

homanp self-assigned this Nov 22, 2025

vercel bot deployed to Preview November 22, 2025 22:08 View deployment

bump cli and mcp

3931333

vercel bot deployed to Preview November 22, 2025 22:13 View deployment

homanp merged commit 12faa6d into main Nov 24, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for passing url to guard endpoint#1075

feat: add support for passing url to guard endpoint#1075
homanp merged 2 commits intomainfrom
feat/add-file-url-support

homanp commented Nov 22, 2025

Uh oh!

vercel bot commented Nov 22, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 22, 2025

Uh oh!

claude bot commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

homanp commented Nov 22, 2025

Description

Checklist

Uh oh!

vercel bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 22, 2025

Code Review: Add URL Support to Guard Endpoint

✅ Strengths

🐛 Potential Bugs & Issues

1. URL Validation Logic Has Edge Cases (sdk/python/src/superagent_ai/client.py:222-224, sdk/typescript/src/index.ts:304-305)

2. Conflicting Input Validation (sdk/python/src/superagent_ai/client.py:226-227)

3. File Detection Logic is Fragile (sdk/python/src/superagent_ai/client.py:219)

🔒 Security Concerns

1. SSRF Vulnerability Risk ⚠️ HIGH PRIORITY

2. Missing Content-Type Validation

⚡ Performance Considerations

1. Synchronous URL Downloads Block Request Processing

2. No Caching Strategy

🧪 Test Coverage

Positive Aspects:

Gaps:

📝 Minor Issues

1. OpenAPI Schema Inconsistency (docs/openapi.json:431-449)

2. Error Message Could Be More Specific (sdk/typescript/src/index.ts:308)

3. Changelog Version Bump Premature

🎯 Recommendations Summary

✨ Overall Assessment

Uh oh!

claude bot commented Nov 22, 2025

Pull Request Review: Add URL Support to Guard Endpoint

Overview

✅ Strengths

Code Quality & Consistency

Documentation & Testing

SDK Implementation

🔍 Areas for Improvement

1. Security Concerns (HIGH PRIORITY)

Server-Side Request Forgery (SSRF) Risk

Open Redirect / Unvalidated Redirects

2. Performance Considerations

No Timeout Documentation

Resource Consumption

3. Error Handling & User Experience

Limited URL Validation in Client

Missing Error Context

4. Testing Improvements

Test Reliability

5. Documentation Enhancements

Missing Information

6. API Design Considerations

Mutual Exclusivity Not Enforced in Schema

📊 Test Coverage Assessment

🏗️ Architecture Questions

🎯 Recommendations Priority

✨ Overall Assessment

Checklist for Merge:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Nov 22, 2025 •

edited

Loading