-
Notifications
You must be signed in to change notification settings - Fork 133
Closed
Labels
Description
Bug description
We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in proxy
pod.
The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,kernel: TCP: out of memory -- consider tuning tcp_mem
.
I have checked the number of sockets using lsof.
/srv/configurable-http-proxy $ lsof
1 /usr/local/bin/node socket:[48829679]
1 /usr/local/bin/node socket:[48829681]
1 /usr/local/bin/node socket:[48825415]
1 /usr/local/bin/node socket:[48825417]
1 /usr/local/bin/node socket:[48829792]
1 /usr/local/bin/node socket:[48829790]
1 /usr/local/bin/node socket:[48829783]
1 /usr/local/bin/node socket:[48829785]
/srv/configurable-http-proxy $ lsof | wc -l
64708
/srv/configurable-http-proxy $ lsof | wc -l
64719
Your personal set up
- Version(s): Jupyterhub Helm chart: v1.1.2 ( https://jupyterhub.github.io/helm-chart/)
This chart use the proxy docker image from jupyterhub/configurable-http-proxy:4.5.0
The config.yaml related to proxy
proxy:
secretToken: xxxx
service:
loadBalancerIP: x.x.x.x
https:
enabled: true
hosts:
- "exmaple.com"
letsencrypt:
contactEmail: "help@example.com"
chp: # proxy pod, running jupyterhub/configurable-http-proxy
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 20
failureThreshold: 10 # retry 10 times before declaring failure
timeoutSeconds: 3
successThreshold: 1
resources:
requests:
cpu: 1000m # 0m - 1000m
memory: 5000Mi # Recommended is 100Mi - 600Mi -- we seem to run at 4.3GB a lot
traefik: # autohttps pod (optional, running traefik/traefik)
resources:
requests:
cpu: 1000m # 0m - 1000m
memory: 512Mi # 100Mi - 1.1Gi
secretSync: # autohttps pod (optional, sidecar container running small Python script)
resources:
requests:
cpu: 10m
memory: 64Mi
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
welcome commentedon Oct 12, 2022
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋
Welcome to the Jupyter community! 🎉
manics commentedon Oct 12, 2022
This sounds similar to #388
Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks!
yuvipanda commentedon Oct 31, 2022
@manics that looks like a memory leak while this is a socket leak. Should be different no?
yuvipanda commentedon Oct 31, 2022
We just saw a node with a lot of proxy pods spew a lot of
out of memory -- consider tuning tcp_mem
while a lot of the proxies stopped working. Could be related?consideRatio commentedon Nov 7, 2022
minrk commentedon Feb 6, 2023
ws release is unlikely to be relevant, since it's only used in tests.
consideRatio commentedon Nov 27, 2023
I think the
node
version could be important debugging info because we are relying onnode
library's http(s) server here:configurable-http-proxy/lib/configproxy.js
Lines 233 to 245 in cb03f77
4.6.1
v20.10.0
4.6.0
v18.18.2
4.5.6
v18.17.1
4.5.5
v18.15.0
4.5.4
v18.12.1
4.5.3
v16.17.0
4.5.2
v16.17.0
4.5.1
v16.13.2
4.5.0
v14.17.3
consideRatio commentedon Feb 2, 2024
I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp.
consideRatio commentedon Feb 2, 2024
CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics.
It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these.
consideRatio commentedon Feb 2, 2024
Looked at one CHP process and saw for example...
This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes.
consideRatio commentedon Feb 2, 2024
I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite.
Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout?
consideRatio commentedon Feb 2, 2024
@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right?
We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options,
timeout
andproxyTimeout
- should we try setting them to 24 hours both, or just one?consideRatio commentedon Feb 2, 2024
There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510.
The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a
.destroy
function instead of a.end
function. I figure the difference may be that.end
allows for re-use of a process file descriptor perhaps?Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559
Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc.
Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like
jupyterhub/configurable-http-proxy:4.6.1-fork
image?35 remaining items
consideRatio commentedon Mar 25, 2025
Wieeee!! Thank you for following up @jyounggo!!!
shaneknapp commentedon Mar 25, 2025
fix for hub-proxy memory leak
williamstein commentedon May 8, 2025
Hi JupyterHub Devs, I did a modern rewrite of http-proxy this week (it's not that much code!), because even with http-proxy-node16, I was seeing some major socket leaks in cocalc.com, where we use proxying very heavily, and I also it seems really sad that http-proxy isn't maintained. My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes. Another motivation is that "npm audit" showed a lot of issues so I updated all dependencies to the latest versions, and also some code in the implementation used API's from nodejs that were deprecated due to security concerns.
-- William
benz0li commentedon May 8, 2025
@williamstein No api changes = No HTTP/2 support?
Cross references:
How does it compare to https://github.com/unjs/httpxy?
williamstein commentedon May 8, 2025
Correct for now. And just to be clear: this is definitely not ready for production use yet. I’m hoping to start stirring a little interest in there being a modern maintained successor to http-proxy that merges PR’s, etc.
minrk commentedon May 8, 2025
Yeah, I'd absolutely be interested. Maybe collaborating with httpxy?
FWIW, my own effort on this was to try to vendor https://github.com/nxtedition/node-http2-proxy into this repo since it's also unmaintained, but if there's a maintained http-proxy fork, that would be easier to switch to.
williamstein commentedon May 8, 2025
@minrk thanks! I looked into httpxy, and it has the same motivation as this http-proxy-3 I made. It seems like a big difference is that I'm writing a lot of jest unit tests for http-proxy-3, whereas httpxy has very few. This could make a difference more longterm regarding how easy it is to contribute to http-proxy-3. Today I hope to write tests to particular address the socket leak issue discussed here. Anyway, if or when http-proxy-3 is "ready for production use" and tested for a while, I'll announce that again here, so you guys can consider using it as a drop in replacement.
minrk commentedon May 9, 2025
That would be awesome, I'll be very happy to switch when you're ready
williamstein commentedon May 11, 2025
I spent all week and modernized ALL the unit tests from the original http-proxy, and also made all of their examples into unit tests that are tested to work (they were not actually tested in http-proxy). I fixed various vulnerabilities and issues (some revealed by unit testing). I also added some tests involving websockets along with instrumenting counting how many sockets are open to ensure leaks like reported here don't recur. It's running live now on http://cocalc.com. So now I think this is ready for general use:
https://www.npmjs.com/package/http-proxy-3
When @shaneknapp above says "fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as
before.... i guess YMMV. :)", that was also my experience -- without cocalc we still had issues after using http-proxy-node16. I added additional code to further clean up sockets, added instrumentation to know it is working, and unit tests, as mentioned above. Now things seem to work fine without leaks for cocalc at least. I hope JupyterHub has a similar improved reliability when using hub-proxy-3.
shaneknapp commentedon May 11, 2025
@williamstein you. are. the. best.
i'm also glad that it wasn't just me going completely bonkers, and doubly glad you had the cycles to get this sorted and get to the bottom of it!
fwiw we had a workaround by increasing the ephemeral ports on the hub pods.
research here: #557
PR to increase eph ports here: https://github.com/berkeley-dsep-infra/datahub/pull/6441/files
minrk commentedon May 11, 2025
Nice! #572 switches to http-proxy-3. Only one test fails.