Skip to content

Do not wait for reindex completion on filtered index creation #2980

Open
@AetherUnbound

Description

@AetherUnbound

Problem

The filtered index creation has recently been throttled due to its affect on production API performance (#2975). This has extended the time it takes to complete the create_and_populate_filtered_index step, namely the reindex call here:

self.es.reindex(
body={
"source": {
"index": source_index,
"query": {
"bool": {
"must_not": [
# Use `terms` query for exact matching against
# unanalyzed raw fields
{"terms": {f"{field}.raw": sensitive_terms}}
for field in ["tags.name", "title", "description"]
]
}
},
},
"dest": {"index": destination_index},
},
slices="auto",
wait_for_completion=True,
)

The step appears to have a default timeout of 43200 seconds (12 hours) per a recent exception:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 1378, in getresponse
    response.begin()
  File "/usr/local/lib/python3.11/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 357, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='openverse-es-8-8-2-elasticsearch-production.private', port=9200): Read timed out. (read timeout=43200)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/elasticsearch/connection/http_requests.py", line 166, in perform_request
    response = self.session.send(prepared_request, **send_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='openverse-es-8-8-2-elasticsearch-production.private', port=9200): Read timed out. (read timeout=43200)

Description

We should remove the wait_for_completion=True parameter of reindex and instead wait on the task using Elasticsearch's task management API (or using existing alternative mechanisms the ingestion server might have at its disposal to do so). This will require adding steps in the create filtered media index DAG in order to wait on the step to complete before issuing the refresh command (which ensures replicas exist). We may also need to add a REFRESH action to the ingestion server API which can be called by Airflow once the reindex step is complete.

Alternatives

We could alternatively override the request_timeout parameter available to all elasticsearch-py methods to a value greater than 43200. This could be a short-term workaround.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions