Skip to content

Memory Leak in Sharded Zarr Indexing #3164

Open
@rm1113

Description

@rm1113

Zarr version

v3.0.8

Numcodecs version

v0.16.1

Python Version

3.12

Operating System

Linux (WSL2)

Installation

pip

Description

The RAM consumption when reading a local Zarr array created with the shards option depends on the indexing method. Selecting the same data via a slice or via a list of indices yields dramatically different memory usage. If I omit the shards parameter when calling create_array, the memory usage for both selection methods becomes similar.

drawing drawing

Steps to reproduce

import zarr
import numpy as np

N_ROWS = 10_000
N_COLS = 2_000
CHUNK_ROW = 5_000
CHUNK_COL = 100
ARRAY_PATH = "/tmp/tmp.zarr"

# create array as
rng = np.random.default_rng(seed=42)
x = rng.random((N_ROWS, N_COLS))

array = zarr.create_array(
    store=ARRAY_PATH,
    data=x,
    chunks=(CHUNK_ROW, CHUNK_COL),
    shards=(CHUNK_ROW, 500),
    overwrite=True,
)

# read data
array = zarr.open_array(ARRAY_PATH)

x = array[slice(0, CHUNK_ROW)]    # RAM consumption 50 MiB 
y = array[list(range(CHUNK_ROW))] # RAM consumption 800 MiB 

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions