Desktop: Importing from OneNote: Fix large .one files with many attachments fail to import#15117
Conversation
Refactors reader.rs
…mport/fix-large-file-handling
This reverts commit d3db685.
This reverts commit 3b83843.
Credit: Issue found during a Claude Code code review.
Credit: Issue flagged during a Claude Code code review.
Credit: Issues were flagged by Claude Code during a review. (The implemented solutions are my own)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplaces eager in-memory file reads with streamed, seekable I/O using u64/BigInt sizes across Node/WASM/native layers; Reader and FileBlob adopt file-backed, lazy-loading models; FileHandle exposes byte_length(); new error variant for integer conversion added. Changes
Sequence Diagram(s)sequenceDiagram
participant JS as Node (node_functions.js)
participant WASM as WASM bindings
participant Rust as parser-utils (Reader / FileHandle)
participant FS as Filesystem
JS->>WASM: fileReader(path) -> create handle
JS-->>WASM: size() returns BigInt (rgba(0,128,0,0.5))
JS->>WASM: read(bigPosition, bigLength) (BigInt)
WASM->>Rust: read(offset:u64, size:u64)
Rust->>Rust: Reader.seek/read -> Vec<u8>
Rust->>FS: fs.read via native driver (BufReader)
Rust-->>WASM: Uint8Array (rgba(0,0,255,0.5))
WASM-->>JS: bytes returned
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (5)
packages/onenote-converter/parser/src/shared/compact_u64.rs (1)
23-82: Please add regression cases for every width branch.This parser now has separate decoding and EOF paths for 7/14/21/28/35/42/49/64-bit values, but the exercised tests in this patch still only cover the zero case. A small table-driven suite here would make the refactor much safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser/src/shared/compact_u64.rs` around lines 23 - 82, Add table-driven regression tests that cover every width branch in CompactU64::parse: test the zero-case, the 7-bit branch (first_byte & 1), 14-bit (first_byte & 2), 21-bit (first_byte & 4), 28-bit (8), 35-bit (16), 42-bit (32), 49-bit (64) and the 64-bit full path (first_byte & 128). For each case construct a Reader input that exercises both normal decoding and EOF/error behavior (e.g., truncated byte slices) and assert the returned CompactU64 value or the error; reference the parse function and Reader.get_u8/get_u64/read methods to locate where to craft inputs and expected outputs.packages/onenote-converter/renderer/src/page/image.rs (1)
11-15: Stream image payloads straight to disk.Line 11 still materialises the whole attachment before Line 15 writes it, so one very large image can still hit the same memory ceiling this PR is trying to remove. A chunked write path here would close most of the remaining gap for single huge attachments.
Based on learnings: "The fix uses
streamFileDataUrito read large files in 3 MB chunks ... so the full file content is never in memory at once."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/renderer/src/page/image.rs` around lines 11 - 15, The current code calls image.data()? which materializes the entire payload before writing; change to a streaming/chunked write so large images are never fully in memory: use determine_image_filename(image)? to get filename and fs_driver().join(&self.output, &filename) for path, then open/create a writable file handle via the fs_driver API (or add a write_file_stream method) and iterate over the image's streaming API (e.g., streamFileDataUri or whatever yields chunks from image) writing each chunk to the file handle until finished; remove the image.data()? call and replace fs_driver().write_file(&path, &data[..])? with the chunked write loop so only a small buffer is allocated at a time.packages/onenote-converter/parser-utils/src/file_api/native_driver.rs (1)
91-95: Consider propagating metadata errors instead of defaulting to 0.If
metadata()fails on an already-opened file, returning0will cause subsequent seeks to fail with misleading "out-of-bounds" errors. Whilst this scenario is rare, propagating the actual error would aid debugging.Note: This would require changing the
FileHandletrait'sbyte_length()signature to returnResult<u64>, which may be a larger refactor.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/file_api/native_driver.rs` around lines 91 - 95, The current FileHandle implementation for BufReader<fs::File> swallows metadata() errors by returning 0 in byte_length(), causing misleading out-of-bounds seeks; update the FileHandle::byte_length signature to return Result<u64, std::io::Error> (or the crate’s Error type) and change the impl for BufReader<fs::File> to propagate get_ref().metadata()?.len() as Ok(len) (or map_ok) so metadata errors are returned instead of defaulting to 0; update callers of FileHandle::byte_length (and any trait impls) to handle the Result accordingly.packages/onenote-converter/parser/src/shared/file_data_ref.rs (2)
15-18:PartialEqnow means “same loader instance”, not “same blob”.Two
FileBlobs with identical bytes but different loaders now compare unequal, including independently-created empty blobs. If callers still treat this as a value type, that behavioural shift is going to be surprising. I’d either dropPartialEqor make the identity check explicit via a separate helper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser/src/shared/file_data_ref.rs` around lines 15 - 18, The current PartialEq impl on FileBlob compares loader identity (Rc::ptr_eq) rather than blob bytes, which makes value-equal blobs compare unequal; remove the impl PartialEq for FileBlob to avoid surprising behavior and instead add an explicit identity helper like FileBlob::is_same_loader(&self, other: &FileBlob) -> bool that returns Rc::ptr_eq(&self.loader, &other.loader); if true value-equality semantics are required elsewhere, add a separate method FileBlob::equals_bytes(&self, other: &FileBlob) -> bool that compares size and actual bytes (reading via the loader) rather than reintroducing PartialEq.
21-23: Avoid lettingsizedrift from the loader output.
FileBlobnow has two sources of truth for length: the storedsizeand whateverload()returns. If they ever disagree,len()lies and the mismatch is only discovered much later. Consider deriving the size from the loader or validatingdata.len() == self.sizeinload().Also applies to: 53-65
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser/src/shared/file_data_ref.rs` around lines 21 - 23, FileBlob currently trusts a stored size field while load() from FileDataLoader can return a different byte vector, leading len() being wrong; update FileBlob::load() to read the data via the FileDataLoader::load() implementation, then validate and reconcile the size: after obtaining data from load(), either set self.size = data.len() (deriving the canonical size from the loader output) or return an error if data.len() != self.size (choose consistent behavior for your domain), and ensure FileBlob::len() uses the reconciled self.size; refer to the FileDataLoader trait, FileBlob::load(), and FileBlob::len() when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/onenote-converter/parser-utils/node_functions.js`:
- Around line 35-49: The read() and size() helpers currently coerce
bigPosition/bigLength to Number which can silently lose precision; preserve
BigInt end-to-end by passing bigPosition (and bigLength where supported) as
BigInt to fs.readSync (or validate and explicitly range-check before coercing to
Number), and have size() return the BigInt obtained from fs.fstatSync(fd, {
bigint: true }) rather than constructing BigInt(size) from a lost-precision
Number; update references in read, size, and any fstat usage to use the bigint
path and only convert to Number when you have validated the value fits safely.
In `@packages/onenote-converter/parser-utils/src/reader.rs`:
- Around line 152-159: The buffer-backed branch of as_data_ref (when matching
ReaderData::BufferRef in reader.rs) slices buffer[start..start + size] without
bounds checking and can panic; add a check comparing size to self.remaining()
(using self.data_offset) and if size > remaining() return Err(UnexpectedEof)
(consistent with read_exact behavior) instead of slicing, otherwise proceed to
create ReaderDataRef::Vec from the slice; ensure the same validation logic
applies conceptually to the BufferRef path to prevent malformed .one files from
causing a panic.
- Around line 230-236: The From<Box<dyn FileHandle>> impl for Reader<'a> assumes
the handle is at offset 0 but sets data_offset = 0 without synchronising the
actual FileHandle; update this constructor to either (A) seek/rewind the
underlying handle to position 0 and read its current position to set data_offset
consistently (use the FileHandle seek/rewind API before wrapping in
ReaderData::File), or (B) replace the From impl with a fallible constructor
(e.g., Reader::from_file(handle) -> Result<Reader, Error>) that queries and/or
seeks the handle to enforce the invariant and returns an error on failure;
ensure you update data_len via value.byte_length() and set data_offset from the
actual file cursor after the seek/query so Reader bookkeeping and the real file
cursor stay in sync.
- Around line 263-275: The FilePointer branch in ReaderDataRef uses
file.borrow_mut() and returns early on errors from file.seek or file.read_exact,
leaving the shared file cursor moved; fix by performing the seek/read inside a
sub-scope/closure or using a local Result (e.g. let read_result = (|| { ...
})();) so the `?` inside that closure returns an Err into read_result instead of
exiting the function, then always restore the cursor with
file.seek(SeekFrom::Start(original_offset))?; update the
ReaderDataRef::FilePointer code around original_offset, file.seek, and
read_result so cursor restoration runs regardless of read success or failure.
In `@packages/onenote-converter/parser/src/fsshttpb/data/binary_item.rs`:
- Around line 15-16: The code casts CompactU64::parse(...).value() (named size)
directly to usize before calling reader.read, which can truncate on platforms
where usize is smaller than u64; change the unchecked cast to a checked
conversion (e.g. use usize::try_from(size) or size.try_into()) and return an
error if the conversion fails, then pass the safely converted usize into
reader.read; update the CompactU64::parse usage and reader.read(size) call sites
(variables size and reader.read) to use the checked conversion and an
appropriate parse/error return on overflow.
---
Nitpick comments:
In `@packages/onenote-converter/parser-utils/src/file_api/native_driver.rs`:
- Around line 91-95: The current FileHandle implementation for
BufReader<fs::File> swallows metadata() errors by returning 0 in byte_length(),
causing misleading out-of-bounds seeks; update the FileHandle::byte_length
signature to return Result<u64, std::io::Error> (or the crate’s Error type) and
change the impl for BufReader<fs::File> to propagate get_ref().metadata()?.len()
as Ok(len) (or map_ok) so metadata errors are returned instead of defaulting to
0; update callers of FileHandle::byte_length (and any trait impls) to handle the
Result accordingly.
In `@packages/onenote-converter/parser/src/shared/compact_u64.rs`:
- Around line 23-82: Add table-driven regression tests that cover every width
branch in CompactU64::parse: test the zero-case, the 7-bit branch (first_byte &
1), 14-bit (first_byte & 2), 21-bit (first_byte & 4), 28-bit (8), 35-bit (16),
42-bit (32), 49-bit (64) and the 64-bit full path (first_byte & 128). For each
case construct a Reader input that exercises both normal decoding and EOF/error
behavior (e.g., truncated byte slices) and assert the returned CompactU64 value
or the error; reference the parse function and Reader.get_u8/get_u64/read
methods to locate where to craft inputs and expected outputs.
In `@packages/onenote-converter/parser/src/shared/file_data_ref.rs`:
- Around line 15-18: The current PartialEq impl on FileBlob compares loader
identity (Rc::ptr_eq) rather than blob bytes, which makes value-equal blobs
compare unequal; remove the impl PartialEq for FileBlob to avoid surprising
behavior and instead add an explicit identity helper like
FileBlob::is_same_loader(&self, other: &FileBlob) -> bool that returns
Rc::ptr_eq(&self.loader, &other.loader); if true value-equality semantics are
required elsewhere, add a separate method FileBlob::equals_bytes(&self, other:
&FileBlob) -> bool that compares size and actual bytes (reading via the loader)
rather than reintroducing PartialEq.
- Around line 21-23: FileBlob currently trusts a stored size field while load()
from FileDataLoader can return a different byte vector, leading len() being
wrong; update FileBlob::load() to read the data via the FileDataLoader::load()
implementation, then validate and reconcile the size: after obtaining data from
load(), either set self.size = data.len() (deriving the canonical size from the
loader output) or return an error if data.len() != self.size (choose consistent
behavior for your domain), and ensure FileBlob::len() uses the reconciled
self.size; refer to the FileDataLoader trait, FileBlob::load(), and
FileBlob::len() when making the change.
In `@packages/onenote-converter/renderer/src/page/image.rs`:
- Around line 11-15: The current code calls image.data()? which materializes the
entire payload before writing; change to a streaming/chunked write so large
images are never fully in memory: use determine_image_filename(image)? to get
filename and fs_driver().join(&self.output, &filename) for path, then
open/create a writable file handle via the fs_driver API (or add a
write_file_stream method) and iterate over the image's streaming API (e.g.,
streamFileDataUri or whatever yields chunks from image) writing each chunk to
the file handle until finished; remove the image.data()? call and replace
fs_driver().write_file(&path, &data[..])? with the chunked write loop so only a
small buffer is allocated at a time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1fb11b20-240c-46de-bbc5-f07d09094c32
📒 Files selected for processing (26)
packages/onenote-converter/parser-utils/node_functions.jspackages/onenote-converter/parser-utils/src/errors.rspackages/onenote-converter/parser-utils/src/file_api/api.rspackages/onenote-converter/parser-utils/src/file_api/native_driver.rspackages/onenote-converter/parser-utils/src/file_api/wasm_driver.rspackages/onenote-converter/parser-utils/src/reader.rspackages/onenote-converter/parser/src/bin/inspect.rspackages/onenote-converter/parser/src/fsshttpb/data/binary_item.rspackages/onenote-converter/parser/src/fsshttpb/data/stream_object.rspackages/onenote-converter/parser/src/fsshttpb/data_element/data_element_fragment.rspackages/onenote-converter/parser/src/fsshttpb/data_element/object_data_blob.rspackages/onenote-converter/parser/src/local_onestore/common/file_chunk_reference.rspackages/onenote-converter/parser/src/local_onestore/file_node/file_node.rspackages/onenote-converter/parser/src/local_onestore/file_structure/file_node_list.rspackages/onenote-converter/parser/src/local_onestore/file_structure/file_node_list_fragment.rspackages/onenote-converter/parser/src/local_onestore/one_store_file.rspackages/onenote-converter/parser/src/onenote/embedded_file.rspackages/onenote-converter/parser/src/onenote/image.rspackages/onenote-converter/parser/src/onenote/mod.rspackages/onenote-converter/parser/src/onestore/mod.rspackages/onenote-converter/parser/src/shared/compact_u64.rspackages/onenote-converter/parser/src/shared/file_data_ref.rspackages/onenote-converter/parser/src/shared/property.rspackages/onenote-converter/renderer/src/lib.rspackages/onenote-converter/renderer/src/page/embedded_file.rspackages/onenote-converter/renderer/src/page/image.rs
|
has this mow been fixed in the current beta? where can I download it for macos silicone? then I will test it again for my 1.95 onepkg import into joplin. |
This has not yet been fixed in the current beta. (This change will need to be merged before it is available in a prerelease). |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
packages/onenote-converter/parser-utils/src/reader.rs (2)
152-164:⚠️ Potential issue | 🟠 MajorReject oversized
as_data_refrequests before slicing.
read_exact()andadvance()already normalise short input toUnexpectedEof, but Line 158 still panics on buffer-backed readers whensize > remaining(), and Line 160 can still hand back a deferred invalid file range. Please validatesizeup front here as well.Proposed fix
pub fn as_data_ref(&mut self, size: usize) -> Result<ReaderDataRef> { + if self.remaining() < size as u64 { + return Err( + ErrorKind::UnexpectedEof("Unexpected EOF (Reader.as_data_ref)".into()).into(), + ); + } + match &mut self.data { ReaderData::BufferRef { buffer } => { let start = self.data_offset as usize; // Cloning needs to be done early with BufferRef, since we don't own the original // data. Large data should generally use `ReaderData::File`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/reader.rs` around lines 152 - 164, In Reader::as_data_ref, reject requests where size > self.remaining() up front to avoid panics and invalid file ranges: before matching on self.data (and before slicing buffer[start..start+size] or returning a FilePointer), check if size <= self.remaining() and return the same UnexpectedEof error variant used by read_exact()/advance() when the request is too large; apply this validation for both ReaderData::BufferRef (so slicing can't panic) and ReaderData::File (so the returned FilePointer does not expose an invalid offset/size).
230-236:⚠️ Potential issue | 🟠 MajorSynchronise the file handle before wrapping it.
This constructor sets
data_offsetto0, but never rewinds or checks the underlyingFileHandle. If the caller passes a pre-seeked handle, the invariant documented on Lines 52-53 is broken immediately and later reads/peeks start from the wrong place.Proposed fix
-impl<'a> From<Box<dyn FileHandle>> for Reader<'a> { - fn from(value: Box<dyn FileHandle>) -> Self { - Self { +impl<'a> Reader<'a> { + pub fn from_file_handle(mut value: Box<dyn FileHandle>) -> Result<Self> { + value.seek(SeekFrom::Start(0))?; + Ok(Self { data_len: value.byte_length(), data_offset: 0, data: ReaderData::File(Rc::new(RefCell::new(value))), - } + }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/reader.rs` around lines 230 - 236, The From<Box<dyn FileHandle>> for Reader<'a> impl must ensure the incoming FileHandle is positioned at the start before assuming data_offset = 0; rewind or seek the boxed handle to position 0 on the FileHandle trait (e.g. call its rewind/seek-to-start method) before wrapping it in ReaderData::File and setting data_offset to 0, and handle/report any error from that operation so the Reader invariant (start at file beginning) holds for later reads/peeks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/onenote-converter/parser/src/shared/compact_u64.rs`:
- Around line 23-24: Change the parse function to accept a mutable reference to
the Reader (e.g. fn parse(reader: &mut Reader) -> Result<CompactU64>) because
Reader::get_u8() and Reader::read() require &mut self; update the function
signature for CompactU64::parse and any call sites (tests already pass &mut
Reader) so internal calls to reader.get_u8() and reader.read() compile.
---
Duplicate comments:
In `@packages/onenote-converter/parser-utils/src/reader.rs`:
- Around line 152-164: In Reader::as_data_ref, reject requests where size >
self.remaining() up front to avoid panics and invalid file ranges: before
matching on self.data (and before slicing buffer[start..start+size] or returning
a FilePointer), check if size <= self.remaining() and return the same
UnexpectedEof error variant used by read_exact()/advance() when the request is
too large; apply this validation for both ReaderData::BufferRef (so slicing
can't panic) and ReaderData::File (so the returned FilePointer does not expose
an invalid offset/size).
- Around line 230-236: The From<Box<dyn FileHandle>> for Reader<'a> impl must
ensure the incoming FileHandle is positioned at the start before assuming
data_offset = 0; rewind or seek the boxed handle to position 0 on the FileHandle
trait (e.g. call its rewind/seek-to-start method) before wrapping it in
ReaderData::File and setting data_offset to 0, and handle/report any error from
that operation so the Reader invariant (start at file beginning) holds for later
reads/peeks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: badd912d-42c7-481a-a9ad-ed5bb1249633
📒 Files selected for processing (4)
packages/onenote-converter/parser-utils/src/errors.rspackages/onenote-converter/parser-utils/src/reader.rspackages/onenote-converter/parser/src/fsshttpb/data/binary_item.rspackages/onenote-converter/parser/src/shared/compact_u64.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/onenote-converter/parser/src/fsshttpb/data/binary_item.rs
- packages/onenote-converter/parser-utils/src/errors.rs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/onenote-converter/parser/src/onenote/mod.rs (1)
78-83: Breaking API change:parse_onestore_rawnow accepts a path instead of raw bytes.The signature changed from
parse_onestore_raw(&mut self, data: &[u8])toparse_onestore_raw(&mut self, path: &str). While the doc comment indicates this is "for debugging purposes", any existing callers passing raw byte data will break.If raw byte parsing is still needed for debugging or testing, consider keeping a separate method or adding an overload that accepts
&[u8].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser/src/onenote/mod.rs` around lines 78 - 83, The public API was changed by replacing parse_onestore_raw(&mut self, data: &[u8]) with a path-based variant; restore backwards compatibility by keeping the existing path-based version (which uses fs_driver().open_file and Reader::try_from) and add back an overload parse_onestore_raw(&mut self, data: &[u8]) that constructs a Reader from the byte slice and delegates to parse_onestore (using Reader::try_from or a byte-cursor equivalent), so callers that pass raw bytes continue to work while the path-based helper remains available for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/onenote-converter/parser/src/onenote/mod.rs`:
- Around line 78-83: The public API was changed by replacing
parse_onestore_raw(&mut self, data: &[u8]) with a path-based variant; restore
backwards compatibility by keeping the existing path-based version (which uses
fs_driver().open_file and Reader::try_from) and add back an overload
parse_onestore_raw(&mut self, data: &[u8]) that constructs a Reader from the
byte slice and delegates to parse_onestore (using Reader::try_from or a
byte-cursor equivalent), so callers that pass raw bytes continue to work while
the path-based helper remains available for debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 89d44d2e-f0a2-4cec-998e-7baea4a17cac
📒 Files selected for processing (2)
packages/onenote-converter/parser-utils/src/reader.rspackages/onenote-converter/parser/src/onenote/mod.rs
Problem
Previously, Joplin would attempt to load entire
.onefiles into memory before import. This could cause import to fail for large.onefiles. This could be a problem, for example, for.onefiles that include many large attachments, since attachments are bundled within the file.In particular,
.onefiles larger than 2 GB would fail to import completely, with a "RangeError: File size (3180223840) is greater than 2 GiB".Notes:
.onefiles..onepkgand.zipfiles are archives that can include one or more.onefiles. As such,.onepkgand.zipfiles larger than 2 GiB could previously be imported, provided that none of the.onefiles they contain are large.Solution
Refactor to avoid loading full
.onefiles into memory during import.Remaining issues:
Testing
Windows:
.onefile and import it.Screencast.from.2026-04-15.21-24-47.webm