Skip to content

feat: add support for passing url to guard endpoint#1075

Merged
homanp merged 2 commits intomainfrom
feat/add-file-url-support
Nov 24, 2025
Merged

feat: add support for passing url to guard endpoint#1075
homanp merged 2 commits intomainfrom
feat/add-file-url-support

Conversation

@homanp
Copy link
Copy Markdown
Collaborator

@homanp homanp commented Nov 22, 2025

Description

Add support for passing a url to the guard method.

Checklist

  • I tested my changes
  • I reviewed my own code

@homanp homanp self-assigned this Nov 22, 2025
@vercel
Copy link
Copy Markdown

vercel bot commented Nov 22, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
docs Ready Ready Preview Comment Nov 22, 2025 10:13pm

@claude
Copy link
Copy Markdown

claude bot commented Nov 22, 2025

Code Review: Add URL Support to Guard Endpoint

Thank you for this contribution! This is a well-structured PR that adds URL support to the guard endpoint. Here's my detailed review:

✅ Strengths

  1. Comprehensive Implementation: The PR adds URL support across all layers - API, Python SDK, TypeScript SDK, documentation, and tests.
  2. Good Documentation: Excellent documentation updates with clear examples in both SDK docs and README files.
  3. Test Coverage: Both Python and TypeScript SDKs include new test cases for the URL functionality.
  4. Consistent API Design: The URL detection logic is consistent between Python and TypeScript implementations.

🐛 Potential Bugs & Issues

1. URL Validation Logic Has Edge Cases (sdk/python/src/superagent_ai/client.py:222-224, sdk/typescript/src/index.ts:304-305)

The current URL detection is too simplistic:

is_url = isinstance(input, str) and (
    input.startswith("http://") or input.startswith("https://")
)

Issues:

  • A string like "http://not-a-real-url" or "https://" would be detected as a URL
  • No validation that the URL is well-formed
  • Could accidentally treat text analysis as URL if user input happens to start with http://

Recommendation:

# Python
from urllib.parse import urlparse

def is_valid_url(s: str) -> bool:
    try:
        result = urlparse(s)
        return all([result.scheme in ('http', 'https'), result.netloc])
    except:
        return False

is_url = isinstance(input, str) and is_valid_url(input)
// TypeScript
function isValidUrl(s: string): boolean {
  try {
    const url = new URL(s);
    return url.protocol === 'http:' || url.protocol === 'https:';
  } catch {
    return false;
  }
}

const isUrl = typeof input === 'string' && isValidUrl(input);

2. Conflicting Input Validation (sdk/python/src/superagent_ai/client.py:226-227)

if not is_file and not input:
    raise GuardError("input must be a non-empty string, file, or URL")

This check happens AFTER determining is_url, but since is_url requires isinstance(input, str) to be true, the not input check will never catch URL cases properly. The logic flow could be clearer.

Recommendation:

# Check empty input first, before type determination
if not input:
    raise GuardError("input must be a non-empty string, file, or URL")

3. File Detection Logic is Fragile (sdk/python/src/superagent_ai/client.py:219)

is_file = hasattr(input, 'read') or not isinstance(input, str)

The or not isinstance(input, str) part means ANY non-string input (integers, booleans, None, etc.) would be treated as a file, which could cause confusing errors downstream.

Recommendation:

is_file = hasattr(input, 'read') or hasattr(input, 'file') or (
    not isinstance(input, str) and hasattr(input, '__fspath__')
)

🔒 Security Concerns

1. SSRF Vulnerability Risk ⚠️ HIGH PRIORITY

The API will download PDFs from ANY URL provided by the user. This creates a Server-Side Request Forgery (SSRF) vulnerability where attackers could:

  • Access internal network resources (http://localhost:6379, http://169.254.169.254/latest/meta-data/)
  • Scan internal ports
  • Access cloud metadata endpoints
  • Bypass firewall restrictions

Recommendations:

  1. URL validation on the backend - Block private IP ranges, localhost, link-local addresses
  2. Implement allowlist/blocklist - Consider allowing only specific domains or blocking dangerous ones
  3. Add timeout limits - Prevent hanging on slow/malicious endpoints
  4. Limit file size - Prevent DoS via large file downloads
  5. Follow redirects carefully - Limit redirect depth and validate redirect targets

Example validation (for backend implementation):

import ipaddress
from urllib.parse import urlparse

def is_safe_url(url: str) -> bool:
    try:
        parsed = urlparse(url)
        
        # Only allow http/https
        if parsed.scheme not in ('http', 'https'):
            return False
            
        # Resolve hostname to IP
        import socket
        ip = socket.gethostbyname(parsed.hostname)
        ip_obj = ipaddress.ip_address(ip)
        
        # Block private/local addresses
        if ip_obj.is_private or ip_obj.is_loopback or ip_obj.is_link_local:
            return False
            
        return True
    except:
        return False

2. Missing Content-Type Validation

There's no validation that the URL actually points to a PDF file. An attacker could provide URLs to:

  • Extremely large files (DoS)
  • Non-PDF files that might cause parsing errors
  • Malicious PDFs with exploits

Recommendation:

  • Check Content-Type header before downloading
  • Validate file size before full download (use HEAD request)
  • Verify PDF magic bytes after download

⚡ Performance Considerations

1. Synchronous URL Downloads Block Request Processing

The API downloads the PDF synchronously during the request. For large files or slow networks, this could:

  • Block the API server thread/worker
  • Cause request timeouts
  • Create poor user experience

Recommendations:

  1. Implement async processing: Return immediately with a job ID, process in background
  2. Add streaming: Stream the download and processing
  3. Set reasonable timeouts: Add explicit timeout for URL fetches (5-10 seconds)
  4. Add file size limits: Reject downloads over a certain size (e.g., 10MB)

2. No Caching Strategy

If the same URL is analyzed multiple times, it will be downloaded repeatedly.

Recommendation:

  • Consider caching PDF downloads with TTL (time-to-live)
  • Add Cache-Control headers respect
  • Store hash of content to detect duplicates

🧪 Test Coverage

Positive Aspects:

  • Both SDKs have new test cases for URL functionality
  • Tests use real URLs (arxiv.org PDF)

Gaps:

  1. No error case testing for:

    • Invalid URLs
    • Unreachable URLs (404, 500 errors)
    • Non-PDF content
    • URLs that timeout
    • URLs that redirect
    • Very large files
  2. No mock testing: Tests use real external URLs which:

    • Makes tests flaky (dependent on external service)
    • Slower test execution
    • Could fail if URL becomes unavailable

Recommendations:

# Python test additions
@pytest.mark.asyncio
async def test_guard_rejects_invalid_url():
    client = create_client(api_base_url=API_BASE_URL, api_key=API_KEY)
    
    with pytest.raises(GuardError):
        await client.guard("not-a-url")

@pytest.mark.asyncio
async def test_guard_handles_unreachable_url():
    client = create_client(api_base_url=API_BASE_URL, api_key=API_KEY)
    
    # Use a URL that will definitely 404
    result = await client.guard("https://example.com/nonexistent.pdf")
    # Assert appropriate error handling

# Add mocked tests
@pytest.mark.asyncio
async def test_guard_url_with_mock(httpx_mock):
    httpx_mock.add_response(
        url="https://test.com/doc.pdf",
        content=b"%PDF-1.4...",  # Mock PDF content
    )
    # Test with mocked response

📝 Minor Issues

1. OpenAPI Schema Inconsistency (docs/openapi.json:431-449)

The schema shows properties as not required, but the description says "At least one of text, file, or url must be provided." This mutual exclusivity isn't enforced at the schema level.

Recommendation:
Use oneOf or add custom validation description to make the requirement clearer.

2. Error Message Could Be More Specific (sdk/typescript/src/index.ts:308)

throw new GuardError("input must be a non-empty string, file, or URL.");

Consider making the error message indicate what was actually received:

throw new GuardError(`input must be a non-empty string, file, or URL. Received: ${typeof input}`);

3. Changelog Version Bump Premature

The CHANGELOG shows version 0.0.18, but this should typically be updated as part of the release process, not in the feature PR itself. Consider removing version bumps from the PR and handling them separately during release.


🎯 Recommendations Summary

High Priority:

  1. ⚠️ Address SSRF vulnerability - Add URL validation on backend to prevent internal network access
  2. Add proper URL validation instead of simple startsWith check
  3. Implement timeouts and file size limits for URL downloads

Medium Priority:
4. Improve error handling for invalid/unreachable URLs
5. Add comprehensive test coverage for error cases
6. Consider async processing for large file downloads

Low Priority:
7. Add response caching for frequently accessed URLs
8. Use mocked tests to avoid external dependencies
9. Improve error messages with more context


✨ Overall Assessment

This is a solid feature addition with good documentation and basic test coverage. The main concerns are around security (SSRF) and robustness (URL validation, error handling). With the suggested improvements, this will be a strong addition to the codebase.

Great work on maintaining consistency across multiple SDKs and keeping documentation in sync!

@claude
Copy link
Copy Markdown

claude bot commented Nov 22, 2025

Pull Request Review: Add URL Support to Guard Endpoint

Overview

This PR adds support for passing URLs (particularly PDF URLs) to the guard endpoint. The implementation is well-structured and consistent across SDKs, with good test coverage and documentation updates.

✅ Strengths

Code Quality & Consistency

  • Excellent consistency: Both TypeScript and Python SDKs follow identical patterns for URL detection and handling
  • Clean implementation: URL detection using startsWith("http://") or startsWith("https://") is simple and effective
  • Good separation of concerns: Three distinct code paths for text, file, and URL inputs
  • Proper error handling: Maintains existing error handling patterns

Documentation & Testing

  • Comprehensive documentation updates: All SDK docs, API docs, and OpenAPI spec updated
  • Good test coverage: Both SDKs include tests for URL functionality using a real PDF URL (arxiv.org)
  • Helpful examples: CLI and SDK documentation include clear usage examples
  • Updated CHANGELOG: Properly documented the new feature

SDK Implementation

Both SDKs properly handle the new URL input:

  • TypeScript (sdk/typescript/src/index.ts:305-306)
  • Python (sdk/python/src/superagent_ai/client.py:221-224)

The URL requests correctly use JSON format with {"url": input} and appropriate headers.

🔍 Areas for Improvement

1. Security Concerns (HIGH PRIORITY)

Server-Side Request Forgery (SSRF) Risk

Location: Backend API implementation (not visible in PR diff)

Issue: The API will download content from user-provided URLs, which creates potential SSRF vulnerabilities.

Recommendations:

  • Implement URL validation and sanitization
  • Whitelist allowed protocols (only https://, not http://)
  • Blacklist private IP ranges (RFC 1918: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
  • Blacklist localhost (127.0.0.0/8, ::1)
  • Blacklist link-local addresses (169.254.0.0/16)
  • Blacklist cloud metadata endpoints (169.254.169.254)
  • Implement timeout and size limits for downloads
  • Consider using a dedicated service with network isolation for URL fetching
  • Add rate limiting per user/API key for URL requests

Example validation code:

from urllib.parse import urlparse
import ipaddress

def is_safe_url(url: str) -> bool:
    parsed = urlparse(url)
    
    # Only allow HTTPS
    if parsed.scheme != 'https':
        return False
    
    # Resolve hostname and check IP
    try:
        ip = ipaddress.ip_address(parsed.hostname)
        # Block private/internal IPs
        if ip.is_private or ip.is_loopback or ip.is_link_local:
            return False
    except:
        pass  # hostname resolution may fail, handle separately
    
    return True

Open Redirect / Unvalidated Redirects

Issue: If the backend follows redirects, malicious users could exploit this.

Recommendations:

  • Disable automatic redirect following OR
  • Validate redirect destinations using the same SSRF checks
  • Log all redirects for security monitoring

2. Performance Considerations

No Timeout Documentation

Locations:

  • sdk/typescript/src/index.ts:339-350
  • sdk/python/src/superagent_ai/client.py:247-257

Issue: URL downloads could take significantly longer than text/file analysis, but timeout behavior isn't documented.

Recommendations:

  • Document expected timeout behavior for URL requests in SDK docs
  • Consider implementing URL-specific longer timeouts
  • Add retry logic with exponential backoff for transient network failures

Resource Consumption

Issue: Large PDF files could consume significant memory/bandwidth.

Recommendations:

  • Implement and document file size limits (e.g., max 10MB)
  • Stream large files instead of loading entirely into memory
  • Return clear error messages when limits are exceeded

3. Error Handling & User Experience

Limited URL Validation in Client

Locations:

  • sdk/typescript/src/index.ts:305-306
  • sdk/python/src/superagent_ai/client.py:221-224

Issue: Client only checks if URL starts with http:// or https://, but doesn't validate URL format.

Recommendations:

// Add URL validation
const isUrl = typeof input === 'string' && 
  (input.startsWith("http://") || input.startsWith("https://"));

if (isUrl) {
  try {
    new URL(input); // Validates URL format
  } catch {
    throw new GuardError("Invalid URL format provided.");
  }
}

Missing Error Context

Issue: When URL downloads fail, users may not get clear error messages.

Recommendations:

  • Add specific error types for URL-related failures (e.g., URLDownloadError, URLTimeoutError)
  • Include more context in error messages (e.g., "Failed to download PDF from URL: Connection timeout")

4. Testing Improvements

Test Reliability

Locations:

  • sdk/typescript/tests/guard.test.ts:47-68
  • sdk/python/tests/test_guard.py:64-83

Issue: Tests depend on external arxiv.org URL being available and accessible.

Recommendations:

  • Add tests with mocked responses for more reliable CI/CD
  • Test error cases: invalid URLs, unreachable URLs, non-PDF content, oversized files
  • Consider using a dedicated test file hosted on a controlled server
  • Add integration test flag to skip URL tests when running offline

Example test cases to add:

# Test invalid URL
with pytest.raises(GuardError):
    await client.guard("not-a-valid-url")

# Test non-HTTPS URL (if blocked)
with pytest.raises(GuardError):
    await client.guard("http://example.com/file.pdf")

# Test unreachable URL
with pytest.raises(GuardError):
    await client.guard("https://this-does-not-exist-12345.com/file.pdf")

5. Documentation Enhancements

Missing Information

  • No documentation about supported file types (only PDF? what about other formats?)
  • No information about maximum file size limits
  • No guidance on URL authentication (what if PDF requires authentication?)
  • Missing information about redirect handling

Recommendations:
Add to SDK documentation:

### URL Requirements
- Must use HTTPS protocol (HTTP is not supported for security)
- Maximum file size: 10MB
- Supported formats: PDF only
- Authentication: Public URLs only (no authentication headers)
- Redirects: Up to 3 redirects followed
- Timeout: 30 seconds for download + analysis time

6. API Design Considerations

Mutual Exclusivity Not Enforced in Schema

Location: docs/openapi.json:428-451

Issue: OpenAPI schema says "Only one input method should be provided" but doesn't enforce it.

Recommendations:

  • Use OpenAPI oneOf to enforce mutual exclusivity
  • Backend should validate and return clear error if multiple inputs provided
  • Document precedence if multiple inputs are provided (currently unclear)
{
  "oneOf": [
    {
      "required": ["text"],
      "properties": { "text": {...} }
    },
    {
      "required": ["file"],
      "properties": { "file": {...} }
    },
    {
      "required": ["url"],
      "properties": { "url": {...} }
    }
  ]
}

📊 Test Coverage Assessment

Good coverage for:

  • Happy path: URL acceptance and processing ✅
  • Basic callback functionality ✅
  • Response structure validation ✅

Missing coverage for:

  • Invalid URL formats ❌
  • Network failures (timeout, DNS failure, connection refused) ❌
  • Non-PDF content at URL ❌
  • Oversized files ❌
  • HTTP (non-HTTPS) URLs ❌
  • URLs with authentication requirements ❌
  • Malformed responses from backend ❌

🏗️ Architecture Questions

  1. Backend Implementation: This PR doesn't show the backend API changes. Has the backend been updated to:

    • Validate URLs for SSRF?
    • Implement size limits?
    • Handle network failures gracefully?
    • Return appropriate error codes?
  2. Rate Limiting: Are URL requests rate-limited separately from text/file requests (since they consume more resources)?

  3. Monitoring: Are URL requests logged/monitored for security analysis?

🎯 Recommendations Priority

Must Fix (Security):

  1. ⚠️ Implement SSRF protection in backend
  2. ⚠️ Validate URLs on server-side (whitelist HTTPS, blacklist private IPs)
  3. ⚠️ Add file size limits

Should Fix (Reliability):
4. Add comprehensive error handling for network failures
5. Improve test coverage for error cases
6. Document timeout behavior and limits

Nice to Have (Polish):
7. Enforce mutual exclusivity in OpenAPI schema
8. Add client-side URL validation
9. Create dedicated error types for URL-related failures

✨ Overall Assessment

This is a well-implemented feature with good consistency across SDKs and comprehensive documentation. The main concerns are around security (SSRF risks) and test coverage for failure scenarios.

Recommendation: Approve with required changes for security hardening before merging to production.

Checklist for Merge:

  • SSRF protection implemented in backend
  • File size limits enforced
  • Timeout behavior documented
  • Error handling improved
  • Additional test cases added for failure scenarios
  • Security review completed

Great work on the implementation! The code is clean and well-structured. Address the security concerns and this will be ready to ship. 🚀


Review generated with assistance from Claude Code

@homanp homanp merged commit 12faa6d into main Nov 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant