Skip to content

Socket leak  #434

@jyounggo

Description

@jyounggo

Bug description

We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in proxy pod.
The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,kernel: TCP: out of memory -- consider tuning tcp_mem.

I have checked the number of sockets using lsof.

/srv/configurable-http-proxy $ lsof
1	/usr/local/bin/node	socket:[48829679]
1	/usr/local/bin/node	socket:[48829681]
1	/usr/local/bin/node	socket:[48825415]
1	/usr/local/bin/node	socket:[48825417]
1	/usr/local/bin/node	socket:[48829792]
1	/usr/local/bin/node	socket:[48829790]
1	/usr/local/bin/node	socket:[48829783]
1	/usr/local/bin/node	socket:[48829785]
/srv/configurable-http-proxy $ lsof | wc -l
64708
/srv/configurable-http-proxy $ lsof | wc -l
64719

Your personal set up

The config.yaml related to proxy

proxy:
  secretToken: xxxx

  service:
    loadBalancerIP: x.x.x.x
  https:
    enabled: true
    hosts:
     - "exmaple.com"
    letsencrypt:
      contactEmail: "help@example.com"
  chp: # proxy pod, running jupyterhub/configurable-http-proxy
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 20
      failureThreshold: 10 # retry 10 times before declaring failure
      timeoutSeconds: 3
      successThreshold: 1
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 5000Mi # Recommended is 100Mi - 600Mi -- we seem to run at 4.3GB a lot
  traefik: # autohttps pod (optional, running traefik/traefik)
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 512Mi # 100Mi - 1.1Gi
  secretSync: # autohttps pod (optional, sidecar container running small Python script)
    resources:
      requests:
        cpu: 10m
        memory: 64Mi

Activity

welcome

welcome commented on Oct 12, 2022

@welcome

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

manics

manics commented on Oct 12, 2022

@manics
Member

This sounds similar to #388

Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks!

yuvipanda

yuvipanda commented on Oct 31, 2022

@yuvipanda
Contributor

@manics that looks like a memory leak while this is a socket leak. Should be different no?

yuvipanda

yuvipanda commented on Oct 31, 2022

@yuvipanda
Contributor

We just saw a node with a lot of proxy pods spew a lot of out of memory -- consider tuning tcp_mem while a lot of the proxies stopped working. Could be related?

consideRatio

consideRatio commented on Nov 7, 2022

@consideRatio
minrk

minrk commented on Feb 6, 2023

@minrk
Member

ws release is unlikely to be relevant, since it's only used in tests.

consideRatio

consideRatio commented on Nov 27, 2023

@consideRatio
Member

I think the node version could be important debugging info because we are relying on node library's http(s) server here:

// proxy requests separately
var proxyCallback = logErrors(this.handleProxyWeb);
if (this.options.ssl) {
this.proxyServer = https.createServer({ ...this.options.ssl, ...httpOptions }, proxyCallback);
} else {
this.proxyServer = http.createServer(httpOptions, proxyCallback);
}
// proxy websockets
this.proxyServer.on("upgrade", bound(this, this.handleProxyWs));
this.proxy.on("proxyRes", function (proxyRes, req, res) {
that.metrics.requestsProxyCount.labels(proxyRes.statusCode).inc();
});

CHP node Date
4.6.1 v20.10.0 Nov 27 2023
4.6.0 v18.18.2 Sep 19 2023
4.5.6 v18.17.1 Aug 10 2023
4.5.5 v18.15.0 Apr 3 2023
4.5.4 v18.12.1 Dec 5 2022
4.5.3 v16.17.0 Sep 9 2022
4.5.2 v16.17.0 Aug 19 2022
4.5.1 v16.13.2 Feb 3 2022
4.5.0 v14.17.3 Jul 19 2021
consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp.

consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics.

It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these.

consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

Looked at one CHP process and saw for example...

# cd /proc/<pid of chp process>
cat net/sockstat
sockets: used 556
TCP: inuse 174 orphan 2 tw 36 alloc 1572 mem 40536
UDP: inuse 0 mem 6
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes.

consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite.

Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout?

consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right?

We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options, timeout and proxyTimeout - should we try setting them to 24 hours both, or just one?

consideRatio

consideRatio commented on Feb 2, 2024

@consideRatio
Member

There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510.

The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a .destroy function instead of a .end function. I figure the difference may be that .end allows for re-use of a process file descriptor perhaps?

Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559

Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc.

Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like jupyterhub/configurable-http-proxy:4.6.1-fork image?

35 remaining items

consideRatio

consideRatio commented on Mar 25, 2025

@consideRatio
Member

Wieeee!! Thank you for following up @jyounggo!!!

shaneknapp

shaneknapp commented on Mar 25, 2025

@shaneknapp
added a commit that references this issue on May 5, 2025
williamstein

williamstein commented on May 8, 2025

@williamstein

Hi JupyterHub Devs, I did a modern rewrite of http-proxy this week (it's not that much code!), because even with http-proxy-node16, I was seeing some major socket leaks in cocalc.com, where we use proxying very heavily, and I also it seems really sad that http-proxy isn't maintained. My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes. Another motivation is that "npm audit" showed a lot of issues so I updated all dependencies to the latest versions, and also some code in the implementation used API's from nodejs that were deprecated due to security concerns.

-- William

benz0li

benz0li commented on May 8, 2025

@benz0li

My rewrite is at https://www.npmjs.com/package/http-proxy-3 (MIT licensed) in case anybody wants to test it out. It's a drop in replacement for http-proxy, with no api changes.

@williamstein No api changes = No HTTP/2 support?

Cross references:


How does it compare to https://github.com/unjs/httpxy?

williamstein

williamstein commented on May 8, 2025

@williamstein

@williamstein No api changes = No HTTP/2 support?

Correct for now. And just to be clear: this is definitely not ready for production use yet. I’m hoping to start stirring a little interest in there being a modern maintained successor to http-proxy that merges PR’s, etc.

minrk

minrk commented on May 8, 2025

@minrk
Member

Yeah, I'd absolutely be interested. Maybe collaborating with httpxy?

FWIW, my own effort on this was to try to vendor https://github.com/nxtedition/node-http2-proxy into this repo since it's also unmaintained, but if there's a maintained http-proxy fork, that would be easier to switch to.

williamstein

williamstein commented on May 8, 2025

@williamstein

@minrk thanks! I looked into httpxy, and it has the same motivation as this http-proxy-3 I made. It seems like a big difference is that I'm writing a lot of jest unit tests for http-proxy-3, whereas httpxy has very few. This could make a difference more longterm regarding how easy it is to contribute to http-proxy-3. Today I hope to write tests to particular address the socket leak issue discussed here. Anyway, if or when http-proxy-3 is "ready for production use" and tested for a while, I'll announce that again here, so you guys can consider using it as a drop in replacement.

minrk

minrk commented on May 9, 2025

@minrk
Member

That would be awesome, I'll be very happy to switch when you're ready

williamstein

williamstein commented on May 11, 2025

@williamstein

That would be awesome, I'll be very happy to switch when you're ready

I spent all week and modernized ALL the unit tests from the original http-proxy, and also made all of their examples into unit tests that are tested to work (they were not actually tested in http-proxy). I fixed various vulnerabilities and issues (some revealed by unit testing). I also added some tests involving websockets along with instrumenting counting how many sockets are open to ensure leaks like reported here don't recur. It's running live now on http://cocalc.com. So now I think this is ready for general use:

https://www.npmjs.com/package/http-proxy-3

When @shaneknapp above says "fwiw, ucb datahubs are still seeing this, albeit nowhere nearly as much as
before.... i guess YMMV. :)", that was also my experience -- without cocalc we still had issues after using http-proxy-node16. I added additional code to further clean up sockets, added instrumentation to know it is working, and unit tests, as mentioned above. Now things seem to work fine without leaks for cocalc at least. I hope JupyterHub has a similar improved reliability when using hub-proxy-3.

shaneknapp

shaneknapp commented on May 11, 2025

@shaneknapp

@williamstein you. are. the. best.

i'm also glad that it wasn't just me going completely bonkers, and doubly glad you had the cycles to get this sorted and get to the bottom of it!

fwiw we had a workaround by increasing the ephemeral ports on the hub pods.
research here: #557
PR to increase eph ports here: https://github.com/berkeley-dsep-infra/datahub/pull/6441/files

minrk

minrk commented on May 11, 2025

@minrk
Member

Nice! #572 switches to http-proxy-3. Only one test fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @yuvipanda@minrk@williamstein@shaneknapp@manics

        Issue actions

          Socket leak · Issue #434 · jupyterhub/configurable-http-proxy