Skip to content

Reduce rescanning on startup #1179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cdecker opened this issue Mar 6, 2018 · 13 comments
Closed

Reduce rescanning on startup #1179

cdecker opened this issue Mar 6, 2018 · 13 comments

Comments

@cdecker
Copy link
Member

cdecker commented Mar 6, 2018

Now that we have blockchain tracking with #1117 it would be nice to reduce the need to rescan from a fixed blockheight on every startup. The main issue that I see is that we have onchaind that may be running, triggered by seeing a spend on-chain, and not getting triggered after the restart.

A simple proposal would be to have a relative rescan period that is larger than the maximum onchaind lifetime, e.g., 288 blocks. However, that period may need to be rather large, also due to HTLCs timeouts.

The more involved solution would be to remember the state of onchaind across restarts, though I'm not clear about how much we'd need to remember there, and what format we could use. Again, a simple solution would be to just store the messages we sent to onchaind in an append-only log in the DB and pruning once onchaind is happy and closes on its own.

What do you guys think makes the most sense in this situation? I have reports from users that take hours to catch up with the blockchain and I think it may also be causing a few of the awaiting funding locking issues that get reported (funding depth callback not being triggered on the remote end).

@cdecker
Copy link
Member Author

cdecker commented Mar 6, 2018

Ping @rustyrussell and @ZmnSCPxj

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Mar 6, 2018

The "best" solution I can think, is indeed to save onchaind state ondisk. I think it should be possible to design a DB table for onchaind state and have well-defined changes to that table. However we would need to almost rewrite onchaind; instead of writing to in-memory structures, perform DB updates.

The alternative easier solution is indeed to log ondisk the messages that got sent intermediately. This feels like a hackish solution....

The issue is that when we save ondisk, we will be practically committing ourselves to support that form, or at least be able to upgrade from that form.

We could start with the "message log" solution though. We could add some kind of versioning (similar to db_migrations). Then later if we switch to the "proper ondisk table", we can translate the "message log" by starting with the initial state, then rewinding the messages and deleting the message log, trusting that the onchaind updates the ondisk table correctly and thus contains the appropriate state given the message log.

@cdecker
Copy link
Member Author

cdecker commented Mar 7, 2018

So I guess one final solution would be to adjust the first_blocknum logic to just start before the first ever funding_tx spend, which would maintain the current behavior and would add minimal code changes. It's quick and easy, but it may result in us rescanning up to 2016 blocks with some settings. At 6 blocks processing per second on my machine that's still 5.6 minutes.

@robtex
Copy link

robtex commented Mar 26, 2018

i would love if this issue was prioritised. it takes one of my nodes over a week to catch up with all blocks from the oldest channel. throwing more cpu and ram on the node did little improvement.
my fastest hardware running lightning has 24cores and 128G RAM and takes "only" 3-4 hours.

@robtex
Copy link

robtex commented Apr 1, 2018

still can't get my node in sync because of constant crashes (#1308) .
without crashes it would now take two weeks to catch up.
please advice

@Sjors
Copy link
Contributor

Sjors commented Apr 3, 2018

@robtex are you sure c-lightning is the bottleneck and not bitcoind?

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Apr 6, 2018

Promoting to 0.6, it is affecting people paying SLEEPYARK and slowing down Blockstream world domination.

@cdecker
Copy link
Member Author

cdecker commented Apr 6, 2018

Limited the rescan on SLEEPYARK, but that's just a stopgap solution.

@robtex
Copy link

robtex commented Apr 16, 2018

@cdecker is that something i can try? otherwise i think my node has caught up in less than a week from now, it has stopped crashing. touch wood.

2018-04-07T12:39:23.543Z lightningd(27562): Adding block 460000: 000000000000000000ef751bbce8e744ad303c47ece06c8d863e4d417efc258c
2018-04-10T04:42:30.457Z lightningd(27562): Adding block 470000: 0000000000000000006c539c722e280a0769abd510af0073430159d71e6d7589
2018-04-12T09:11:13.414Z lightningd(27562): Adding block 480000: 000000000000000001024c5d7a766b173fc9dbb1be1a4dc7e039e631fd96a8b1
2018-04-14T17:04:43.322Z lightningd(27562): Adding block 490000: 000000000000000000de069137b17b8d5a3dfbd5b145b2dcfb203f15d0c4de90

@robtex
Copy link

robtex commented Apr 16, 2018

@Sjors
hard to tell exactly what makes it slow. but bitcoind is using far less cpu than c-lightning.
it is a vps. i tried upgrading it to a 20vCPU 96GB RAM version, but it didn't help much at all.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27562 robban    20   0  187208 169432   3168 R  92.7  8.3  14001:27 lightningd
27570 robban    20   0   41632  33644   2136 R  60.8  1.6   7447:17 lightning_gossi
 1743 robban    20   0 1877232 625192  35428 S  15.6 30.5   4395:37 bitcoind

@cdecker
Copy link
Member Author

cdecker commented Apr 16, 2018

@robtex I'm working on the patch now, but it's hard since we were replaying some of the state from on-chain to drive onchaind and closingd, so we now need to have the facilities to restore that state from the DB instead.

@cdecker
Copy link
Member Author

cdecker commented Apr 16, 2018

If you update to the latest commit the rescan will not go below the blockheight of the first mainnet channels (504500) so that'll at least limit the rescan time considerably. I'll ping you as soon as I have the no-rescan PR ready.

@robtex
Copy link

robtex commented Apr 16, 2018

great, thanks! appreciate your work and looking forward to the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants