Skip to content

feat: OPTIC-1938: Proxy for storages when presigned urls are off #7354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 90 commits into from
Apr 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
e3889f7
feat: OPTIC-1938: Proxy for storages when presigned urls are off
makseq Apr 8, 2025
143f4ab
Add first version of good implementation
makseq Apr 9, 2025
36700b1
Revert
makseq Apr 9, 2025
5d01e25
Fixes
makseq Apr 9, 2025
6698b3f
Fix
makseq Apr 9, 2025
999b48b
Fix
makseq Apr 9, 2025
4e009b4
Always generate /presign
makseq Apr 9, 2025
fc310dd
Redesign
makseq Apr 9, 2025
76e0366
Fix
makseq Apr 9, 2025
3f4b678
Move resover s3 to storage mixin
makseq Apr 9, 2025
845be0f
Sync Follow Merge dependencies
robot-ci-heartex Apr 9, 2025
21202b7
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 9, 2025
36c39df
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 9, 2025
72b784d
Fix storage list project undefined
makseq Apr 9, 2025
702bf34
Remove useEffect
makseq Apr 9, 2025
94fd140
Move gcs to storagemixin
makseq Apr 9, 2025
771ffde
Fix azure
makseq Apr 9, 2025
b834dbf
Fixes
makseq Apr 10, 2025
c6003ec
Fixes
makseq Apr 11, 2025
6386d8a
Remove fflag_fix_all_lsdv_4711_cors_errors_accessing_task_data_short
makseq Apr 11, 2025
8f05e10
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 11, 2025
62a44d1
Merging
makseq Apr 11, 2025
c1b93a9
Build
makseq Apr 11, 2025
1f84bfc
Fix
makseq Apr 11, 2025
0a292a9
test_presign_storage_data.py fixed
makseq Apr 11, 2025
79e49cd
Turn on ff
makseq Apr 11, 2025
14d8338
Fix test with 1
makseq Apr 11, 2025
b6d030c
Linters
makseq Apr 11, 2025
a416ab9
Blue
makseq Apr 11, 2025
5cf75b4
Merge branch 'develop' into 'fb-optic-1938'
robot-ci-heartex Apr 11, 2025
e3d92a2
Sync Follow Merge dependencies
robot-ci-heartex Apr 11, 2025
0f9e80e
Fix test
makseq Apr 11, 2025
e680508
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 11, 2025
34d70a7
Fix
makseq Apr 11, 2025
54a1302
Fix
makseq Apr 11, 2025
eee27db
serviceworker change to allow the new presign resolving endpoint to b…
bmartel Apr 11, 2025
3624620
Merge branch 'fb-optic-1938' of github.com:HumanSignal/label-studio i…
bmartel Apr 11, 2025
22d8f33
fixed serviceworker cache logic
bmartel Apr 11, 2025
1732313
fixed serviceworker cache logic
bmartel Apr 11, 2025
b275ad2
simplify the logic
bmartel Apr 11, 2025
413e1df
Add more unit tests
makseq Apr 12, 2025
9c943bc
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 12, 2025
8cf799b
Linters
makseq Apr 12, 2025
1c16686
Fix
makseq Apr 12, 2025
ae4da82
Fix
makseq Apr 12, 2025
246fdf1
Sync Follow Merge dependencies
robot-ci-heartex Apr 14, 2025
d9bd25f
Fix cache invalidate
makseq Apr 14, 2025
ece1b0b
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 14, 2025
9909bb9
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 14, 2025
aa20d3d
Fixes
makseq Apr 14, 2025
48020fc
Fix blue
makseq Apr 15, 2025
6c09fe8
blue
makseq Apr 15, 2025
9257191
Merge branch 'develop' into 'fb-optic-1938'
robot-ci-heartex Apr 15, 2025
64eabfc
Sync Follow Merge dependencies
robot-ci-heartex Apr 15, 2025
6718f3f
Sync Follow Merge dependencies
robot-ci-heartex Apr 15, 2025
a6cced0
Sync Follow Merge dependencies
robot-ci-heartex Apr 16, 2025
fe212b1
Add TimeoutRangedFileResponse
makseq Apr 16, 2025
57d2ec2
Linter
makseq Apr 16, 2025
eb20cf8
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 16, 2025
efa9128
Sync Follow Merge dependencies
robot-ci-heartex Apr 16, 2025
660af02
Add direct stream from S3
makseq Apr 16, 2025
6530f88
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 16, 2025
58b9c6a
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 16, 2025
c620965
Add normal streaming for s3
makseq Apr 16, 2025
3a9840a
Before iter_chunks
makseq Apr 17, 2025
f22757c
Working streaming for s3 iter_chunks
makseq Apr 17, 2025
6c319ac
Second working version
makseq Apr 17, 2025
1b2e223
Fixes with working version of streaming s3
makseq Apr 17, 2025
acd3fe3
Working cleanup
makseq Apr 17, 2025
ca2490b
Blue & ruff
makseq Apr 17, 2025
9ce2cb5
Fixes
makseq Apr 17, 2025
58af33d
Linters
makseq Apr 17, 2025
27c0274
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 17, 2025
5dad575
fix tests
makseq Apr 17, 2025
6af0df9
Working GCS streaming
makseq Apr 18, 2025
156147d
Working Azure
makseq Apr 19, 2025
88f2206
==> GCS old streaming implementation and new with direct http request…
makseq Apr 19, 2025
1bfc758
New GCS streaming over streaming http request
makseq Apr 19, 2025
5d7cdf6
Fix tests
makseq Apr 19, 2025
2d90d42
Tests are fixed
makseq Apr 19, 2025
57d9b7a
Sync Follow Merge dependencies
makseq Apr 19, 2025
af8279e
Merge branch 'develop' into 'fb-optic-1938'
makseq Apr 19, 2025
5e1d590
Add GCS timeout
makseq Apr 19, 2025
76b7dad
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Apr 19, 2025
e7df0e3
Merge branch 'fb-optic-1938' of github.com:heartexlabs/label-studio i…
makseq Apr 19, 2025
f6bfb83
Tests passed. Everything is working
makseq Apr 19, 2025
8a055bb
Revert azure
makseq Apr 19, 2025
10f3754
Fix for GCS
makseq Apr 19, 2025
d397c6c
Linters
makseq Apr 19, 2025
10e2538
Fix tests again
makseq Apr 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 24 additions & 12 deletions label_studio/core/all_urls.json
Original file line number Diff line number Diff line change
Expand Up @@ -275,18 +275,6 @@
"name": "data_import:storage-data-upload",
"decorators": ""
},
{
"url": "/tasks/<int:task_id>/presign/",
"module": "data_import.api.TaskPresignStorageData",
"name": "data_import:task-storage-data-presign",
"decorators": ""
},
{
"url": "/projects/<int:project_id>/presign/",
"module": "data_import.api.ProjectPresignStorageData",
"name": "data_import:project-storage-data-presign",
"decorators": ""
},
{
"url": "/api/dm/views/",
"module": "data_manager.api.ViewAPI",
Expand Down Expand Up @@ -947,6 +935,30 @@
"name": "storages:api:export-storage-localfiles-form",
"decorators": ""
},
{
"url": "/tasks/<int:task_id>/resolve/",
"module": "io_storages.proxy_api.TaskResolveStorageUri",
"name": "storages:task-storage-data-resolve",
"decorators": ""
},
{
"url": "/projects/<int:project_id>/resolve/",
"module": "io_storages.proxy_api.ProjectResolveStorageUri",
"name": "storages:project-storage-data-resolve",
"decorators": ""
},
{
"url": "/tasks/<int:task_id>/presign/",
"module": "io_storages.proxy_api.TaskResolveStorageUri",
"name": "storages:task-storage-data-presign",
"decorators": ""
},
{
"url": "/projects/<int:project_id>/presign/",
"module": "io_storages.proxy_api.ProjectResolveStorageUri",
"name": "storages:project-storage-data-presign",
"decorators": ""
},
{
"url": "/api/ml/",
"module": "ml.api.MLBackendListAPI",
Expand Down
11 changes: 11 additions & 0 deletions label_studio/core/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -820,3 +820,14 @@ def collect_versions_dummy(**kwargs):
}

LOGOUT_REDIRECT_URL = get_env('LOGOUT_REDIRECT_URL', None)

RESOLVER_PROXY_BUFFER_SIZE = int(get_env('RESOLVER_PROXY_BUFFER_SIZE', 512 * 1024))
RESOLVER_PROXY_TIMEOUT = int(get_env('RESOLVER_PROXY_TIMEOUT', 20))
RESOLVER_PROXY_MAX_RANGE_SIZE = int(get_env('RESOLVER_PROXY_MAX_RANGE_SIZE', 8 * 1024 * 1024))
RESOLVER_PROXY_GCS_DOWNLOAD_URL = get_env(
'RESOLVER_PROXY_GCS_DOWNLOAD_URL',
'https://storage.googleapis.com/download/storage/v1/b/{bucket_name}/o/{blob_name}?alt=media',
)
RESOLVER_PROXY_GCS_HTTP_TIMEOUT = int(get_env('RESOLVER_PROXY_GCS_HTTP_TIMEOUT', 5))
RESOLVER_PROXY_ENABLE_ETAG_CACHE = get_bool_env('RESOLVER_PROXY_ENABLE_ETAG_CACHE', True)
RESOLVER_PROXY_CACHE_TIMEOUT = int(get_env('RESOLVER_PROXY_CACHE_TIMEOUT', 3600))
5 changes: 3 additions & 2 deletions label_studio/core/static/js/sw.js
Original file line number Diff line number Diff line change
Expand Up @@ -43,19 +43,20 @@ async function handlePresignedUrl(event) {
// even when it is not expired.
// This is so the server can just naively proxy the presign request
// and the performance is not degraded.
const requestPathToCache = "/presign/";
const requestsPathToCache = /\/presign\/|\/resolve\//;

// Check if the request URL doesn't match the specified path part
// or if it is not a request to the same origin we will just fetch it
if (
!event.request.url.startsWith(self.location.origin) ||
!event.request.url.includes(requestPathToCache) ||
!requestsPathToCache.test(event.request.url) ||
// This is to avoid an error trying to load a direct presign URL in a new tab
!event.request.referrer ||
// Easier to leave this uncached as if we were to handle this caching
// it would be more complex and not really worth the effort as it is not the most repeated
// request and it is not a big deal if it is not cached.
// If this was to be cached it forces a full resource request instead of a chunked ranged one.
// This also avoids caching the data directly when resolving uri -> data, which is not the goal of this service worker
event.request.headers.get("range")
) {
// For other requests, just allow the network to handle it
Expand Down
90 changes: 1 addition & 89 deletions label_studio/data_import/api.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
"""This file and its contents are licensed under the Apache License 2.0. Please see the included NOTICE for copyright information and LICENSE for a copy of the license.
"""
import base64
import json
import logging
import mimetypes
import time
from typing import Union
from urllib.parse import unquote, urlparse

import drf_yasg.openapi as openapi
Expand All @@ -18,7 +16,7 @@
from csp.decorators import csp
from django.conf import settings
from django.db import transaction
from django.http import HttpRequest, HttpResponse, HttpResponseRedirect
from django.http import HttpResponse
from django.utils.decorators import method_decorator
from drf_yasg.utils import swagger_auto_schema
from projects.models import Project, ProjectImport, ProjectReimport
Expand Down Expand Up @@ -748,89 +746,3 @@ def get(self, request, *args, **kwargs):
response['X-Accel-Redirect'] = redirect
response['Content-Disposition'] = 'attachment; filename="{}"'.format(filepath)
return response


class PresignAPIMixin:
def handle_presign(self, request: HttpRequest, fileuri: str, instance: Union[Task, Project]) -> Response:
model_name = type(instance).__name__

if not instance.has_permission(request.user):
return Response(status=status.HTTP_403_FORBIDDEN)

# Attempt to base64 decode the fileuri
try:
fileuri = base64.urlsafe_b64decode(fileuri.encode()).decode()
# For backwards compatibility, try unquote if this fails
except Exception as exc:
logger.debug(
f'Failed to decode base64 {fileuri} for {model_name} {instance.id}: {exc} falling back to unquote'
)
fileuri = unquote(fileuri)

try:
resolved = instance.resolve_storage_uri(fileuri)
except Exception as exc:
logger.error(f'Failed to resolve storage uri {fileuri} for {model_name} {instance.id}: {exc}')
return Response(status=status.HTTP_404_NOT_FOUND)

if resolved is None or resolved.get('url') is None:
return Response(status=status.HTTP_404_NOT_FOUND)

url = resolved['url']
max_age = 0
if resolved.get('presign_ttl'):
max_age = resolved.get('presign_ttl') * 60

# Proxy to presigned url
response = HttpResponseRedirect(redirect_to=url, status=status.HTTP_303_SEE_OTHER)
response.headers['Cache-Control'] = f'no-store, max-age={max_age}'

return response


class TaskPresignStorageData(PresignAPIMixin, APIView):
"""A file proxy to presign storage urls at the task level."""

swagger_schema = None
http_method_names = ['get']
permission_classes = (IsAuthenticated,)

def get(self, request, *args, **kwargs):
"""Get the presigned url for a given fileuri"""
request = self.request
task_id = kwargs.get('task_id')
fileuri = request.GET.get('fileuri')

if fileuri is None or task_id is None:
return Response(status=status.HTTP_400_BAD_REQUEST)

try:
task = Task.objects.get(pk=task_id)
except Task.DoesNotExist:
return Response(status=status.HTTP_404_NOT_FOUND)

return self.handle_presign(request, fileuri, task)


class ProjectPresignStorageData(PresignAPIMixin, APIView):
"""A file proxy to presign storage urls at the project level."""

swagger_schema = None
http_method_names = ['get']
permission_classes = (IsAuthenticated,)

def get(self, request, *args, **kwargs):
"""Get the presigned url for a given fileuri"""
request = self.request
project_id = kwargs.get('project_id')
fileuri = request.GET.get('fileuri')

if fileuri is None or project_id is None:
return Response(status=status.HTTP_400_BAD_REQUEST)

try:
project = Project.objects.get(pk=project_id)
except Project.DoesNotExist:
return Response(status=status.HTTP_404_NOT_FOUND)

return self.handle_presign(request, fileuri, project)
6 changes: 0 additions & 6 deletions label_studio/data_import/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,4 @@
# special endpoints for serving imported files
path('data/upload/<path:filename>', api.UploadedFileResponse.as_view(), name='data-upload'),
path('storage-data/uploaded/', api.DownloadStorageData.as_view(), name='storage-data-upload'),
path('tasks/<int:task_id>/presign/', api.TaskPresignStorageData.as_view(), name='task-storage-data-presign'),
path(
'projects/<int:project_id>/presign/',
api.ProjectPresignStorageData.as_view(),
name='project-storage-data-presign',
),
]
71 changes: 70 additions & 1 deletion label_studio/io_storages/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,4 +115,73 @@ All these states are present in both the open-source and enterprise editions for
4. RQ workers were killed manually => `storage_background_failure` wasn't called.
5. Job was removed from RQ Queue => it's not a failure, but we need to update storage status somehow.

To handle these cases correctly, all these conditions must be checked in ensure_storage_status when the Storage List API is retrieved.
To handle these cases correctly, all these conditions must be checked in ensure_storage_status when the Storage List API is retrieved.

## Storage Proxy API

The Storage Proxy API is a critical component that handles access to files stored in cloud storages (S3, GCS, Azure, etc.). It serves two main purposes:

1. **Security & Access Control**: It acts as a secure gateway to cloud storage resources, enforcing Label Studio's permission model and preventing direct exposure of cloud credentials to the client.

2. **Flexible Content Delivery**: It supports two modes of operation based on the storage configuration:
- **Redirect Mode** (`presign=True`): Generates pre-signed URLs with temporary access and redirects the client to them. This is efficient as content flows directly from the storage to the client.
- **Proxy Mode** (`presign=False`): Streams content through the Label Studio server. This provides additional security and is useful when storage providers don't support pre-signed URLs or when administrators want to enforce stricter access control.

### How It Works

1. When tasks contain references to cloud storage URIs (e.g., `s3://bucket/file.jpg`), these are converted to proxy URLs (`/tasks/{task_id}/resolve/?fileuri=base64encodeduri`).

2. When a client requests this URL, the Proxy API:
- Decodes the URI and locates the appropriate storage connection
- Validates user permissions for the task/project
- Either redirects to a pre-signed URL or streams the content directly, based on the storage's `presign` setting

3. The API handles both task-level and project-level resources through dedicated endpoints:
- `/tasks/<task_id>/resolve/` - for resolving files referenced in tasks
- `/projects/<project_id>/resolve/` - for resolving project-level resources

This architecture ensures secure, controlled access to cloud storage resources while maintaining flexibility for different deployment scenarios and security requirements.

### Proxy Mode Optimizations*

The Proxy Mode has been optimized with several mechanisms to improve performance, reliability, and resource utilization:

* *Range Header Processing*

The `override_range_header` function processes and intelligently modifies Range headers to limit stream sizes:

- It enforces a maximum size for range requests (controlled by `RESOLVER_PROXY_MAX_RANGE_SIZE`)
- Converts unbounded range requests (`bytes=123456-`) to bounded ones
- Handles various range request formats including header probes (`bytes=0-`)
- Prevents worker exhaustion by chunking large file transfers

* *Time-Limited Streaming*

The `time_limited_chunker` generator provides controlled streaming with timeout protection:

- Stops yielding chunks after a configurable timeout period (`RESOLVER_PROXY_TIMEOUT`)
- Uses buffer-sized chunks (`RESOLVER_PROXY_BUFFER_SIZE`) for efficient memory usage
- Tracks statistics about stream performance and reports on timeouts and print it as debug info
- Properly closes streams to prevent resource leaks

* *Response Header Management*

The `prepare_headers` function manages HTTP response headers for optimal client handling:

- Forwards important headers from storage providers (Content-Length, Content-Range, Last-Modified)
- Enables range requests with Accept-Ranges header
- Implements cache control with configurable TTL (`RESOLVER_PROXY_CACHE_TIMEOUT`)
- Generates ETags based on user permissions to invalidate cache when access changes

### *Environment Variables*

The Storage Proxy API behavior can be configured using the following environment variables:

| Variable | Description | Default |
|----------|-------------|---------|
| `RESOLVER_PROXY_BUFFER_SIZE` | Size in bytes of each chunk when streaming data | 64*1024 |
| `RESOLVER_PROXY_TIMEOUT` | Maximum time in seconds a streaming connection can remain open | 10 |
| `RESOLVER_PROXY_MAX_RANGE_SIZE` | Maximum size in bytes for a single range request | 7*1024*1024 |
| `RESOLVER_PROXY_CACHE_TIMEOUT` | Cache TTL in seconds for proxy responses | 3600 |

These optimizations ensure that the Proxy API remains responsive and resource-efficient, even when handling large files or many concurrent requests.
17 changes: 12 additions & 5 deletions label_studio/io_storages/all_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@
response = view(request._request, *args, **kwargs)
payload = response.data
if not isinstance(payload, list):
raise ValueError('Response is not list')
return response.data
raise ValueError(f'Response is not list: {payload}')
return payload

Check warning on line 131 in label_studio/io_storages/all_api.py

View check run for this annotation

Codecov / codecov/patch

label_studio/io_storages/all_api.py#L131

Added line #L131 was not covered by tests
except Exception:
logger.error(f"Can't process {api.__class__.__name__}", exc_info=True)
return []
Expand Down Expand Up @@ -158,9 +158,16 @@
permission_required = all_permissions.projects_change

def _get_response(self, api, request, *args, **kwargs):
view = api.as_view()
response = view(request._request, *args, **kwargs)
return response.data
try:
view = api.as_view()
response = view(request._request, *args, **kwargs)
payload = response.data
if not isinstance(payload, list):
raise ValueError(f'Response is not list: {payload}')
return payload
except Exception:
logger.error(f"Can't process {api.__class__.__name__}", exc_info=True)
return []

Check warning on line 170 in label_studio/io_storages/all_api.py

View check run for this annotation

Codecov / codecov/patch

label_studio/io_storages/all_api.py#L161-L170

Added lines #L161 - L170 were not covered by tests

def list(self, request, *args, **kwargs):
list_responses = sum(
Expand Down
Loading
Loading