Description
-
Hardware: ESP8266EX
-
Core Version: SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
-
Development Env: Arduino IDE
-
Operating System: Windows
-
Module: LOLIN Wemos D1 mini Pro & Wemos D1 r2 mini
-
Flash Size: 16MB
-
lwip Variant: v2 Lower Memory and Higher Bandwidth
-
Flash Frequency: 40Mhz
-
CPU Frequency: 80Mhz
-
Upload Using: SERIAL
-
Upload Speed: 460800
At v2 Lower Memory and Higher Bandwidth I saw memory leak each LOOP - 32 bytes or more. Try many options, with or without debug and so on... After ~26 minutes of run ESP goes to :oom and reboot with dump. Sometimes one or two or three loops go without a leak, but than mem leak continues.
If I switch to IwIP variant 1.4 Higher Bandwidth - memory leak stops and all work fine!
( SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:1.4.0rc2/BearSSL:89454af - that variant work fine) I can provide debug log but it will be the same as below, exclude memory leak.
#include <PubSubClient.h>
#include <ESP8266WiFi.h>
#define ESP8266
#define DEBUG 1
const char* ssid = "zzzzzz";
const char* password = "xxxxxxx";
const char* mqtt_server = "111.222.222.222";
const char* mqtt_user = "zzzz";
const char* mqtt_pass = "zzzz";
const char* mqtt_clientId = "gost-temp";
const char* mqtt_ping_topic = "apr/home/ping";
const char* mqtt_online_message = "online";
const char* mqtt_last_will = "offline";
const char* mqtt_topic_base = "apr/home/";
const int ping_time = 30;
bool firstRun = true;
WiFiClient espClient;
PubSubClient client(espClient);
uint32_t originalram;
unsigned long lastMeasure = 0;
void setup_wifi() {
int zz = 0;
delay(10);
#ifdef DEBUG
Serial.println();
Serial.print("Connecting to ");
Serial.println(ssid);
#endif
WiFi.mode(WIFI_STA);
WiFi.begin(ssid, password);
delay(500);
while (WiFi.status() != WL_CONNECTED) {
delay(100);
#ifdef DEBUG
Serial.print(".");
#endif
zz=zz+1;
if (zz >= 100) {
ESP.restart();
}
Serial.print(zz);
}
#ifdef DEBUG
Serial.println("");
Serial.print("WiFi connected - ESP IP address: ");
Serial.println(WiFi.localIP());
#endif
delay(1000);
}
void reconnect() {
delay(500);
while (!client.connected()) {
#ifdef DEBUG
Serial.print("Attempting MQTT connection...");
#endif
String mqtt_clientIdRand = mqtt_clientId;
mqtt_clientIdRand += String(random(0xffff), HEX);
if (client.connect(String(mqtt_clientIdRand).c_str(), mqtt_user, mqtt_pass, (String(mqtt_ping_topic)).c_str(), 0, 1, mqtt_last_will)) {
#ifdef DEBUG
Serial.println("connected");
#endif
} else {
#ifdef DEBUG
Serial.print("failed, rc=");
Serial.print(client.state());
Serial.println(" try again in 5 seconds");
#endif
delay(5000);
}
}
}
void setup() {
randomSeed(millis());
Serial.begin(115200);
setup_wifi();
client.setServer(mqtt_server, 1883);
if (!client.connected()) {
reconnect();
}
client.loop();
client.publish(mqtt_ping_topic, mqtt_online_message);
client.loop();
originalram = ESP.getFreeHeap();
}
void loop() {
if ((millis() - lastMeasure) > (ping_time * 1000)) {
lastMeasure = millis();
client.loop();
client.publish(mqtt_ping_topic, mqtt_online_message);
client.loop();
uint32_t ram = ESP.getFreeHeap();
Serial.printf("RAM: %d change %d\n", ram, (ram - originalram ));
}
delay(30);
}
Debug log:
07:49:35.414 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
07:49:35.414 ->
07:49:35.414 -> Connecting to XXXXX
07:49:35.414 -> bcn 0
07:49:35.414 -> del if1
07:49:35.414 -> usl
07:49:35.414 -> mode : sta(84:f3:eb:db:5a:3c)
07:49:35.414 -> add if0
07:49:35.586 -> wifi evt: 8
07:49:36.171 -> .1.2wifi evt: 2
07:49:36.378 -> .3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23scandone
07:49:39.389 -> state: 0 -> 2 (b0)
07:49:39.389 -> .24state: 2 -> 3 (0)
07:49:39.389 -> state: 3 -> 5 (10)
07:49:39.389 -> add 0
07:49:39.389 -> aid 1
07:49:39.389 -> cnt
07:49:39.492 -> .25
07:49:39.492 -> connected with XXXXXX, channel 6
07:49:39.561 -> dhcp client start...
07:49:39.561 -> wifi evt: 0
07:49:39.561 -> .26ip:192.168.88.122,mask:255.255.255.0,gw:192.168.88.1
07:49:39.596 -> wifi evt: 3
07:49:39.666 -> .27
07:49:39.666 -> WiFi connected - ESP IP address: 192.168.88.122
07:49:41.182 -> Attempting MQTT connection...[hostByName] Host: 111.111.111.111 is a IP!
07:49:41.182 -> :ref 1
07:49:41.216 -> :wr 70 0
07:49:41.216 -> :wrc 70 70 0
07:49:41.216 -> :ack 70
07:49:41.216 -> :rn 4
07:49:41.216 -> :c0 1, 4
07:49:41.216 -> connected
07:49:41.216 -> :wr 23 0
07:49:41.216 -> :wrc 23 23 0
07:49:41.285 -> :ack 23
07:49:49.372 -> pm open,type:2 0
07:50:03.705 -> :rcl
07:50:03.705 -> :abort
07:50:05.117 -> RAM: 50008 change 1296
07:50:35.151 -> RAM: 49976 change 1264
07:51:05.188 -> RAM: 49944 change 1232
07:51:35.216 -> RAM: 49944 change 1232
07:52:05.217 -> RAM: 49944 change 1232
07:52:35.249 -> RAM: 49880 change 1168
07:53:05.287 -> RAM: 49816 change 1104
07:53:35.307 -> RAM: 49784 change 1072
07:54:05.336 -> RAM: 49784 change 1072
07:54:35.374 -> RAM: 49688 change 976
07:55:05.382 -> RAM: 49560 change 848
07:55:35.429 -> RAM: 49464 change 752
07:56:05.438 -> RAM: 49464 change 752
07:56:35.483 -> RAM: 49336 change 624
07:57:05.512 -> RAM: 49080 change 368
07:57:35.547 -> RAM: 48856 change 144
07:58:05.566 -> RAM: 48648 change -64
07:58:35.574 -> RAM: 48488 change -224
07:59:05.604 -> RAM: 48008 change -704
07:59:35.638 -> RAM: 47528 change -1184
08:00:05.669 -> RAM: 47368 change -1344
08:00:35.683 -> RAM: 47176 change -1536
Activity
Jeroen88 commentedon Feb 3, 2020
I am suspecting LWIP in a similar situation on the ESP32, see here, in this case (using a WiFiClientSecure) I suspect ssl_client->socket = lwip_socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) in ssl_client, because if I add lwip_close(ssl_client->socket) directly following this line, memory has leaked. These are lwip calls and so have nothing to do with mbedtls / sll and should thus be comparable to your situation. I am not sure, because I do not now if a combination of lwip_socket() followed by a lwip_close() should release all memory, I am not familiar enough with LWIP, but that seems logical to me. I am also not sure if the ESP32 uses the same lwip version.
TD-er commentedon Feb 3, 2020
I have something similar here.
It looks like the leak is related to making a WiFiClient connection.
I am running a test here in which my node cannot connect to a MQTT broker, so every N seconds (30 I believe) it will re-attempt a connection and after roughly 18 minutes it is out of memory.
Not sure if it is related to unsuccessful reconnects or not.
I've also tried to recreate the WiFiClient object again at each reconnect attempt, but that doesn't seem to make any difference.
mqtt = WiFiClient(); // workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864
civilman2006 commentedon Feb 4, 2020
It seems to be a different case because I create only 1 connection at setup and then at the loop cycle only send 1 mqtt message and even without any mqtt message memory leak will present. So this is about something inside IwIP lib because switching to the oldest version 1.4 helps to resolve the situation!
In yours case, in my mind, it's about creating new connections and after fail attempt to establish it there is time_wait state which destroys connection only after 2 minutes delay.
TD-er commentedon Feb 4, 2020
@civilman2006 I agree it does look like a different set of symptoms.
Last night I als tested with the latest Git head of esp8266/Arduino and that does at least seem to solve this reboot issue due to memory exhaustion.
It looks like this commit may have fixed that. So maybe you can test also with the latest merges, just to be sure?
The issue I was talking about needed roughly 18 minutes to run out of memory, so I am not sure those idle connection attempts were destroyed after 2 minutes.
d-a-v commentedon Feb 4, 2020
I just tried the OP sketch and I have not the same output
Current git master, no debug:
2.6.3, debug enabled
d-a-v commentedon Feb 4, 2020
@civilman2006 lwip2 version you use is "glue:1.2-16" which is about the one shipped with 2.6.3 while I have used "glue:1.2-31". That may explain that.
Unfortunately, #6887 is not merged yet.
Are you able to test it ?
If not I can try to generate an alpha release so you can try with the arduino board installer.
d-a-v commentedon Feb 4, 2020
Because I don't see what should have changed about an eventual memory leak, I tried with the same lwip2 version, and unfortunately I can see no leak.
civilman2006 commentedon Feb 4, 2020
Thx for the reply!
I try to like 2 hours to find how to update glue from 1.2-16 to 1.2-31+, but I can't find the right way to do it at windows, because I don't have 'make' and so on... I update the board from git, that works fine, but I have the same version of glue = 1.2-16 and got the same error with a memory leak.
If it possible comment me on how to update glue?
My current version of SDK & so on:
SDK:2.2.2-dev(38a443e)/Core:2.6.3-44-g6be56161=20603044/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:0645c68 (from esp8266 debug message)
And in the next message, you wrote that there is no problem at old glue - but the version is different glue:1.2-17 and mine is 1.2-16...
d-a-v commentedon Feb 4, 2020
1.2-16 and 1.2-17 make no difference.
lwip2 is updated by #6887 which is yet to be merged. You can however try a pull -request with for example this gist.
But as I said, it may not be the issue. More testers are needed, able to reproduce your issue.
civilman2006 commentedon Feb 5, 2020
Thx for the reply! I can't find way to update with #6887 because windows Arduino IDE & no compiler & make tool... So if you can help me with providing a link to some instruction I can try to test this future merge.. Or I can try alpha release...
Juppit commentedon Feb 10, 2020
civilman2006 commentedon Feb 16, 2020
Thx for the script!
I install and successfully build local version BUT I have same issue with memory because version of Glue is not updated?!
SDK:2.2.2-dev(38a443e)/Core:2.6.3-56-g5efdc776=20603056/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:0645c68
I think that I correctly install boards: "C:\Users\Дмитрий\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.0-dev-nightly+20200216"\
only this build is installed... I select Generic 8266 -> 4M (3SPIFF).. Dunno why glue still the same version?
laercionit commentedon Mar 15, 2020
HELLO, I'm having this problem using the 2.6.3 kernel, I still haven't found anything to solve.
Has anyone been successful in solving this?
27 remaining items
philbowles commentedon Apr 17, 2020
Latest: user has throttled his router and ESP now receiving max 100k/s all traffic ...and - heap stable @ 95% of start value, so its back to looking like a rate thing with ESP / core / lwip not able to free packets fast enough AND that triggering a memleak.
I don't know what else I can do., but happy try try sensible and polite suggestions
TD-er commentedon Apr 17, 2020
The idea I had when suggesting it was something like this.
Just assume the unprocessed packets remain in memory for N seconds until they are cleared.
So the amount of heap allocations can remain constant as long as it does have enough time to process all stored packets at less than the rate they come in.
If you exceed that threshold once, you will see it uses a bit more memory, but it should keep up later on.
This all goes well as long as the time needed to process it remains constant and the rate of messages fluctuates so that it gets below the limit the node can handle every now and then.
But the time needed to process is not constant if the memory gets more fragmented.
New allocations do take more time on fragmented memory.
So after each 'burst' of messages, it may use a bit more memory and the heap does get more fragmented and thus increases the processing time.
Given this theory, you would see an increase of speed at which the free heap declines. Or at least as long as the average rate of packets remains constant and just about the initial threshold of what the node can handle. Also the rate must fluctuate to see this happening.
TD-er commentedon Apr 17, 2020
A simple test could be to run the ESP at 160 MHz. If it can keep up longer with the other conditions the same, then my theory is a bit more plausible.
philbowles commentedon Apr 17, 2020
@TD-er thats our plan for tomorrow: Run 1: unthrottle router re-run @ 160Mhz, Run 2: throttle router to 100kb/s then rerun
d-a-v commentedon Apr 17, 2020
@philbowles #6895 was intended to solve a similar UDP issue (just read #6831).
Are you using 2.6.3 or did you and your tester tried using latest master ?
philbowles commentedon Apr 17, 2020
@d-a-v Sorry still using 2.6.3 trying to nail the beast - if its been fixed we are wasting time, so will try to get him to do latest master tomorrow, merci! (et salutations de L'Orne 61330) :)
devyte commentedon Apr 18, 2020
The stack dump above says core 2.5.2, not 2.6.3.
devyte commentedon Apr 18, 2020
@d-a-v in case you haven't noticed it, @philbowles is using the AsyncUDP lib, not ours.
philbowles commentedon Apr 18, 2020
Good news and bad news. "My man" has rerun some tests this morning and...debuggers worst nightmare: It's gone away. Nothing he has tried (2.5.2 reversion, 2.6.3 etc) will now cause the heap loss.
Even more surprising his SkyQ box is still flooding the netwrok, and my logger shows it is actually peaking at 55 broadcasts/second and stable as a rock, bouncing up n down between 80% and 95% as the rate fluctuates, but basically, "flatlining"
Tail of ths a.m. log after 45 minutes uptime:
His bewildered suggestion is that his boxes got firmware uploaded overnight. At a total loss for an explanation, I tend to agree with him, but only because I can think of few other realistic explanations. :(
The only +ve from this is that rate does not now seem to be the core issue. We think "bad packet by sky box (now fixed)" is/was the answer.
I wish I could tell you something different, but now neither of us can reproduce the problem.
I am still happy to try to help if i can , of course.
TD-er commentedon Apr 18, 2020
Maybe you also switched WiFi channels on the ESP, to one with less disturbances (less retransmits)?
FinduschkaLi commentedon Apr 18, 2020
Hi I am following this thread closely, since I have a prob with ESP8266 resetting since 6 months.
I run a websocket client and observe resets several times a day (in particular with one particular wifi hotspot consisting in a wifi repeater, I could not yet reproduce the same behaviour on my mobile phone hotspot)
I basically loose heap in every reconnection to Wifi attempt when my router is switched off and can't be reached...
I tried a bunch of different things, no solution yet. Currently running tests with the beta 0.0.2 as indicated above, but behaviour stays the same. Will share the next stack prints. I am currently trying to make a minimal version to be able to reproduce the behaviour.
Library Version 2.6.3 + Beta 0.0.2
Lwip v2 - lower Memory
CPU 80Mhz
Let me know if I can be of any help to this.
devyte commentedon Apr 18, 2020
@FinduschkaLi that sounds like a completely different problem. Please don't hijack this thread, which is specific to a reported mem leak on each loop.
devyte commentedon Apr 18, 2020
@philbowles it sounds to me like your friend had a corrupted build. I've seen that reported, and a clean build from scratch would make the problem go away.
devyte commentedon Apr 18, 2020
@civilman2006 Your original mcve uses pubsub mqtt. I've seen mem leaks reported when using that 3rd party lib, and failure to reproduce without it using just our core. I suggest working with tbe authors of pubsub to reach a mcve that uses only our core.
I'm closing this. If any of the involved parties can produce a mcve that shows the mem leak and that doesn't use 3rd party libs, please open a new issue and follow the instructions in the issue template.