diff --git a/web/pandas/pdeps/0018-nullable-object-dtype.md b/web/pandas/pdeps/0018-nullable-object-dtype.md new file mode 100644 index 0000000000000..78f0000d43cac --- /dev/null +++ b/web/pandas/pdeps/0018-nullable-object-dtype.md @@ -0,0 +1,175 @@ +# PDEP-18: Nullable Object Dtype for Pandas + +- Created: 07 June 2025 +- Status: Draft +- Discussion: [#32931](https://github.com/pandas-dev/pandas/issues/32931) +- Author: [Simon Hawkins](https://github.com/simonjayhawkins) +- Revision: 1 + +## Abstract + +This proposal outlines the introduction of a nullable object +dtype to the pandas library. The goal is to provide a +dedicated dtype for handling arbitrary Python objects with +consistent missing value semantics using `pd.NA`. Unlike the +traditional `object` dtype which lacks robust missing data +handling, this new nullable dtype will add clarity and +consistency in representing missing or undefined values +within object arrays. + +## Motivation + +Currently, the `object` dtype in pandas is a catch-all for +heterogeneous Python objects, but it does not enforce any +particular missing-value semantics. As pandas has evolved to +include extension types (like `string[python]`, `Int64`, or +`boolean`), there is a clear benefit in extending these +improvements to the object datatype. A nullable object dtype +would help: +- **Consistency**: Enforce a uniform approach to managing +missing values with `pd.NA` across all dtypes. +- **Interoperability**: Enable cleaner and more predictable +behavior when performing operations on data previously +stored as generic objects. +- **Clarity**: Help users distinguish between truly “object” +data and data that is better represented by a nullable +container supporting missing values. + +This proposal is driven by frequent community discussions +and development efforts that aim to unify missing value +handling across pandas data types. + +## Detailed Proposal + +### Definition + +The proposal introduces a new extension type, tentatively +named `"object_nullable"`, that stores an underlying array +of Python objects alongside a boolean mask that indicates +missing (i.e., `pd.NA`) values. The API should mimic that of +existing extension arrays, ensuring that missing value +propagation, casting, and arithmetic comparisons (where +applicable) behave consistently with other nullable types. + +### Key Features +1. **Consistent Missing Value Semantics**: + - Missing entries will be represented by `pd.NA`, + ensuring compatibility with pandas nullable dtypes that + use `pd.NA` as the missing value indicator as well as + the experimental `ArrowDType`. + - Operations that encounter missing values will handle + `pd.NA` uniformly consistent with other pandas nullable + dtypes that use `pd.NA` as the missing value indicator. +2. **Underlying Data Storage**: + - The core data structure will consist of a NumPy array + of Python objects and an associated boolean mask. (not + so different from the current `object` backed nullable + string array variant that uses `pd.NA` as the missing + value.) + - Consideration should be given to performance, ensuring + that operations remain as vectorized as possible despite + the inherent overhead of handling Python objects. +3. **API Integration**: + - The new dtype will implement the ExtensionArray + interface. + - Methods such as `astype`, `isna`, `fillna`, and + element-wise operations are already defined to respect + missing values in the other pandas nullable dtypes. + - All operations on a nullable object array will return + a pandas nullable array except where requested, such as + `astype`. Methods like `fillna` would still return a + nullable object array even though there are no missing + values to avoid introducing mixed-propagation behavior. + - Ensure compatibility with pandas functions, like + groupby, concatenation, and merging, where the semantics + of missing values are critical. +4. **Transition and Interoperability**: + - Users should be able to convert from the legacy object + dtype to object_nullable using a constructor or an + explicit method (e.g., `pd.array(old_array, + dtype="object_nullable")`) using the existing api. + - Operations on existing pandas nullable dtypes that + would normally produce an object dtype should be updated + (or made configurable as a transition path) to yield + "object_nullable" in all cases even when missing values + are not present to avoid introducing mixed-propagation + behavior. + - `ArrowDType` does not offer an `object` dtype for + heterogeneous Python objects and therefore a user + requesting arrow dtypes could be given "object_nullable" + arrays where appropriate to avoid mixed `pd.NA`/`np.nan` + semantics when using `dtype_backend="pyarrow"`. + + +### Implementation Considerations +1. **Performance**: + - Handling arbitrary Python objects is inherently slower + than operations on native numerical types. + - Expanding the EA interface to 2D is outside the scope + of this PDEP. + +2. **Backward Compatibility**: + - Existing code that uses the traditional object dtype + should not break. (Making the pandas nullable object + dtype the default is not part of this proposal and would + be discussed in conjunction with moving the other pandas + nullable dtypes to be default.) + - Existing code that uses the pandas nullable dtypes + should not break without warnings, even though they are + considered experimental, as these dtypes have been + available to users for a long time. The new dtype can be + offered as an opt-in feature initially. +3. **Testing and Documentation**: + - Extensive tests will be required to validate behavior + against edge cases. + - Updated documentation should explain differences + between the legacy object dtype and object_nullable, + including examples and migration tips. +4. **Community Feedback**: + - Continuous discussions on GitHub, mailing lists, and + related channels will inform refinements. The nullable + object dtype should be available as opt-in for at least + 2 minor versions to allow sufficient time for feedback + before the return types of the existing pandas nullable + dtypes are changed. + +## Alternatives Considered +- Continuing with the Legacy Object Dtype: + - Retaining the ambiguous missing value semantics of the + legacy object dtype does not provide a robust and + consistent solution, aligning with the design of other + extension arrays. + - Not having a nullable object dtype could potentially + be a blocker for a potential future nullable by default + policy. + +## Drawbacks and Future Directions +1. **Overhead Cost**: +The additional memory required for a boolean mask and +possible performance penalties in highly heterogeneous +arrays are acknowledged trade-offs. +2. **Integration Complexity**: +Ensuring seamless integration with the full suite of pandas +functionality may reveal edge cases that require careful +handling. +3. **Incompatibility**: +The existing object array can hold any python object, even +`pd.NA` itself. The proposed nullable object array will be +unable to hold `np.nan`, `None` or `pd.NaT` as these will be +considered missing in the constructors and other conversions +when following the existing API for the other nullable +types. Users will not be able to round-trip between the +legacy and nullable object dtypes. + +## Conclusion +Introducing a nullable object dtype in pandas will offer a +clearer semantic for missing values and align the behavior +of object arrays with other nullable types. This proposal is +aimed at fostering discussion and soliciting community +feedback to refine the design and implementation roadmap. + + + +## PDEP-18 History + +- 07 June 2025: Initial version.