Description
Hi all,
I do apologize in advance if this topic was already discussed and documented somewhere - if that's the case, I'd be grateful for any pointers to those discussions.
I've been recently looking into some of the ways we can produce preview1-compatible WASM modules and at the same time not affecting the overall development of WASI. In our case, we run WASM runtime on customer devices, and the runtime in some cases can't be updated - so we're stuck with preview1 in the host for at least 4-5 years (and likely longer).
I've explored a few different options to stay up-to-date with the tooling (mainly, wasi-libc) and still remain compatible with the old runtimes we have in production:
- I considered having a bunch of ifdefs in the code; this is pretty much what @abrown proposed here: Document a plan for transitioning to preview 2 and beyond wasi-libc#476 and I think that's a good short-term plan (as mentioned in the doc, it's just for "transition"), I don't think that's going to be maintainable given that we need to keep preview1 around for quite some time
- Fork wasi-libc - that's another option we consider, but we'd rather avoid that given that backporting bugfixes and improvements from the upstream would become more and more difficult over time
- Implement wasi preview2 -> wasi preview 1 adapter
Whereas all three options are still on the table, I think the 3rd one is the least intrusive one, requires very little maintenance overhead (I think it's almost an one-time effort) and enables us to completely abandon preview1 references in the tooling.
I've started working on a prototype of the adapter. So far I've prototyped (it's very buggy and limited, don't use it yet :) ) clock and sockets APIs (using WAMR-specific preview1-like interfaces). I was able to run a simple tcp client/server application using @dicej 's wasi-libc branch. When implementing the streams API I realized that interfaces that return a list, e.g.
read: func(
len: u64
) -> list<u8>;
don't allow users to pass a pre-allocated buffer (e.g. on the stack); instead, host is expected to call WASM's malloc to allocate a memory for the return buffer. This is a problem for many embedded applications where dynamic allocations are not recommended or even not allowed.
In addition to that, for this to work with libc, the buffer allocated by the host must be then copied to a buffer provided as a parameter for the libc function, so we have something like (very high-level flow, missing lots of details but hopefully clearly explains the problem):
1. [wasm] user allocates a buffer (e.g. on the stack, char buf[1024])
2. [wasm] user calls wants to read data: recv(buf, 1024, ...other parameters)
3. [wasm] recv implementation calls host's read function
4. [host] read implementation calls WASM's malloc/realloc to allocate buffer for return value
5. [host] copies data into the newly allocated buffer
6. [host] returns a pointer back to WASM
7. [wasm] recv implementation copies data from the pointer returned from host into the user-provided buffer
8. [wasm] recv function frees the memory allocated by the host
So we not only spend some cycles to go from host back to WASM to call allocation function (which depending on the implementation might be slow) but also we only allocate a memory temporarily to copy the value back to a buffer already allocated by the user. This is how it works in wasi-libc and I'm not sure if other languages have similar problem - even if not, C and other languages relying on wasi-libc like C++ or Rust are probably popular enough to not neglect this problem.
I understand why list (and perhaps other data types) used as return values require allocating memory - WASM code doesn't know how big the return value is, so it's reasonable to let runtime do the allocation. However, there are some cases (like read function) where user already provides the maximum requested size, and in those cases they might want to also provide a buffer that's been already allocated.
I wonder if the problem of efficient data passing between host and WASM was discussed, and if so, what's the recommendation? I'm not sure whether this is a problem with a component model per se, or is it more a problem with a design of specific interfaces (or both)? For example, would it be possible to have a read
function to be something like:
read: func(out_data: list<u8>) -> error
? So the host knows that the list already points to a buffer of a specific length, and it should fill it with the data? This is just one of the ideas, but I'm curious what others think.
Please note it's not just a problem for the adapter from preview2 to preview1, and it's not only about any specific interface (I can imagine lots of different proposals follow the same pattern). If the problem is not being addressed, I think it might be a blocker for some of the embedded usecases to onboard to preview2+ (those projects would either stick to preview1 in some form, or not use WASM at all).
Activity
lukewagner commentedon Feb 28, 2024
Being able to use caller-supplied memory buffers for dynamically-sized return values of import calls has been discussed and also filed in #175. Having thought about it a bit more in the interim, it does seem quite doable (even addressing the tricky cases mentioned in #175) to add as a new
canonopt
. One reason it hasn't been prioritized yet is that there is a cute little hack you can do that mostly achieves the same effect (and, iiuc, this is what is currently done by the preview1-to-2 adapter):global
srealloc
function that returns the specialglobal
memory if the requested size is in boundsIf you're writing a preview2-to-1 adapter by hand (and thus you know the semantics of the call and thus that, in the function signature
read: func(length: u64, offset: u64) -> result<tuple<list<u8>, bool>, error-code>
, thatlength
will be the length of the returnedlist<u8>
, then you could:cabi_realloc(0, 0, 1, length)
to get a right-sized bufferand if the preview2 caller does the above cute-little-hack, then I think the whole path ends up forwarding the original caller-supplied buffer all the way to the preview1 call. There is the additional overhead of the
realloc
call, but perhaps it could be inlined.loganek commentedon Feb 29, 2024
Thanks @lukewagner , I'm glad to see that this scenario was taken into account already and there's a clean solution for that. I've created a simple example just to make sure I understand what you propose (there's lots of corner cases not being handled, it's just for demonstration purposes) - let me know if I got this right:
user_allocator_alloc
function returns the buffer or callsmalloc
in case the buffer is not big enough (that shouldn't happen though): https://github.com/loganek/wasm-wsp2-efficient-memory/blob/main/user_allocator.cread_data
function that mimicsread
-like libc function. The function initializes the allocator and calls theread
wit function. In caseread
function returns the user-provided buffer, I don't free the list, and I don't copy the data.user_allocator_alloc
to get the buffer and use it to store the data.The implementation is slightly different to what you proposed as I don't override realloc implementation. By doing that, we could probably apply the same optimization for non-adapter usecases, but I'd rather keep the first iteration as non-invasive as possible (although this approach won't work if the interface returns more than one list, which I don't think is the case for the preview1 adapter). We can extend the approach as next step if needed (or just wait for the #175 to have a proper support for that).
Given that supporting the adapter would require changes in wasi-libc, I'm curious if the maintainers would be open for those modifications (@sunfishcode , @sbc100, @abrown ). I think we'd need to have a high-level agreement whether the adapter is a way to go to support preview1 in wasi-libc, but for now just testing the water - would that be something we could perhaps have in wasi-libc?
The other potential place for this change would be a wit-bindgen itself (which eventually generates a code for wasi-libc anyway) but given we just want it (for now) for the preview2->preview1 adapter, and there are some edge cases that won't work for any WIT file, perhaps better to start with wasi-libc.
tschneidereit commentedon Feb 29, 2024
The wasmtime-(compiled to wasm)-based adapter @cpetig has been working on might fit the bill here.
loganek commentedon Feb 29, 2024
Thanks for sharing. I actually used some of the work done by @cpetig shared on Zulip to generate core modules with preview2 cannonical ABI to test my adapter :) (thanks @cpetig for the work BTW).
lukewagner commentedon Feb 29, 2024
@loganek I might be missing a constraint, but I think it should be possible to write this preview2-to-1 adapter without any changes to wasi-libc, since the adapter doesn't have to know about this realloc trick; it can just call realloc as a black box (and maybe a caller-supplied buffer is used, and maybe not). The trick is how to get line 22 of your adapter to be able to call the preview2-core-module-defined
realloc
function. If we're talking about vanilla wasi-sdk output, I think this is justcabi_realloc
, and so the question is how to link tocabi_realloc
from your adapter module (via static func indices or dynamic function table indices, etc).loganek commentedon Feb 29, 2024
Hmm, I can't imagine making the change without modifications of the wasi-libc; the wasi-libc code is the only place where the caller-supplied buffer is available. Once the preview2 function is being called, this information is lost. So I think that somewhere before the preview2 function is called, the wasi-libc must store the pointer of the caller-supplied buffer (for example, see here: https://github.com/WebAssembly/wasi-libc/pull/477/files#diff-36afeb04f659e0e36e627b55010e437bd0855e5a0b6edf82348cb7528b3c3d17R48).
Also, wasi-libc unconditionally copies the data from the buffer returned by preview2 function into a caller-supplied buffer, and frees the list (e.g. here https://github.com/WebAssembly/wasi-libc/pull/477/files#diff-36afeb04f659e0e36e627b55010e437bd0855e5a0b6edf82348cb7528b3c3d17R60-R61). If the buffer is the one that the caller provided, that function will try to release that buffer (which is wrong). So I don't really see how we could do that optimization without changing a bit wasi-libc code.
I think preview1->preview2 is a bit different - preview1 implementation calls preview2 function. Because preview1 has direct access to the buffer, it can do the "pseudo" allocation within the adapter.
Let me know if I'm missing anything obvious here. I'd love to encapsulate all the hacky logic within the adapter, but can't think of any good way to do that.
lukewagner commentedon Mar 1, 2024
Ah, I think the missing link is that while, iiuc, the preview1-to-2 adapter does do this caller-provided-buffer optimization, it looks like the native preview2 wasi-libc implementation does not (yet). But let's say that native preview2 wasi-libc added the caller-provided-buffer optimization (which would be well-motivated as a pure optimization): then this is not a detail that the preview2-to-1 adapter would have to care about: the preview2-to-1 adapter simply calls wasi-libc's
cabi_realloc
(as it would in any case), and it's just a wasi-libc impl detail that thiscabi_realloc
call is optimized to return the caller-provided buffer. Thus, while wasi-libc's preview2 support should probably do this optimization, it's not like it must or else the preview2-to-1 adapter breaks or anything like that.Does that make sense? If so, then what's nice is that there is no extra ABI contract beyond what plain preview2 specifies, which should reduce coupling.
loganek commentedon Mar 2, 2024
@lukewagner ok looks like we both talked about the same thing :) So we both agree that wasi-libc changes are necessary - I'll prepare a brief plan with changes and present it to the community. In my example repo I'm doing exactly that:
user_allocator_alloc
function acts as acabi_realloc
https://github.com/loganek/wasm-wsp2-efficient-memory/blob/8996b31bc329ff65ff7d7092002393a432a637f5/user_allocator.c#L26user_allocator_alloc
except for thecabi_realloc
: https://github.com/loganek/wasm-wsp2-efficient-memory/blob/main/adapter.c#L22I implement it as a separate function because I simply didn't want to modify the auto-generated
cabi_realloc
- and that's what I meant when I said that perhapswit-bindgen
is the place where the change should happen. I guess we could either add an option to wit-bindgen to not generatecabi_realloc
and let the consumer of those bindings provide their custom implementation, or already implement the allocator and optimizations there. I'll think a bit about it and propose a solution soon.Let me know if you're aligned with above and if so, I think we can close the issue.
lukewagner commentedon Mar 4, 2024
Ah, ok, gotcha, good to hear. Overall, it seems like a good optimization to have, so I think we're aligned that it'd be a good idea to do somehow. I'm not sure where exactly in the toolchain is the right place to make those changes, but what you're saying sounds plausible, so I'll leave you to it to discuss with the folks working on the individual parts.
xwang98 commentedon Mar 8, 2024
Hi @lukewagner , could you please explain a bit more about how to use using caller-supplied buffer for avoiding allocation and copy during calling read() API. I made a diagram for the workingflow based on my understanding. It may be inaccurate. Could you please share how to apply this method as you mentioned?
xwang98 commentedon Mar 8, 2024
Or here is another way to implement the read() API for libc with p2 in my guess. The right side flow is about how the data is returned from callee to the user caller. (Is the new wasi-libc source code that calls the p2 import functions available? I haven't found it. )
xwang98 commentedon Mar 8, 2024
Talk it a bit with my colleague Liang He. We figure it might be something like below:
This workflow only works for the native as the import provider. It implies the wasm runtime will do private handling for the native implementer of the WIT interface, like skipping all the memory reallocation and copying during the function call. And the native implementation of import and libc read() will share some special information.
@lukewagner we would appreciate your comments for clarification.
loganek commentedon Mar 8, 2024
Hi @xwang98 I think there important thing to know is that there's a
cabi_realloc
method implemented in WASM code that must be imported by the runtime. The runtime calls this method whenever it needs to allocate a memory, and it's up to WASM code how the memory should be allocated.The proposal here is that in the libc's
read()
function, we'll save a pointer to a caller-provided buffer;cabi_realloc
method will then return this buffer when runtime needs to allocate the memory for the read buffer. So in a very high level, it'd be something like:The implementation above just presents a high-level idea and obviously misses a lot of edge cases, but I hope that makes things clear. When it comes to preview2->preview1 adapter, the
preview2_read()
implementation in the adapter (so also in WASM code), and instead of callingfread()
, we'll be calling WASIfd_read
.I've implemented the idea here: https://github.com/loganek/wasm-wsp2-efficient-memory/ (please note though that I don't use
cabi_realloc
as it is autogenerated for now, instead, I calluser_allocator_alloc
but that's pretty much the same). I'm also now working on a long-term proposal to support preview1 in the toolchaing using the adapter and that efficient memory passing will be part of the proposal. I hope to finish it in a few weeks and present it to the community for review and approval.xwang98 commentedon Mar 9, 2024
@loganek it looks like our lastest diagram pretty much follows what you explained. The following question is what if there are two
list
arguments in the WIT interface?It looks like a potential need for WIT to support a dual-access memory concept. It is not preferred to introduce too many tricks in the libc implementation as it is so foundational.
loganek commentedon Mar 9, 2024
@xwang98 I was mainly interested in the preview2 to preview1 adapter scenario and I don't think there's a need for more than one pointer (but didn't look at the all API s yet, so I might be wrong). One idea that comes to my mind though is to have a list/stack of buffer pointers in the allocator instead of just one, and consecutive calls to realloc would get/pop pointers from the stack. That'd however require a contract between the wash code and runtime code that defines the order of allocations.
3 remaining items