March 3rd, 2017, 03:25 AM
502 bad gateway - how to debug
now I thought i posted this last night (UK time) but cant find the post - it should have been moved to another forum if it wsnt applicble here but to just remove it without letting me know - but perhaps i didnt press send either way i apologise.
the problem is that every so often our server (powering the apps) goes down for few minutes. and i am not sure how to debug this? I have looked at all the logs at /var/log/nginx but cant see anything to say why it went out -
the most i see is a connection timed out in some of the logs..
my server is also behind a load balancer.. not sure if that makes a difference?
finally i am running nginx but there is no forum for that or a general server forum..
March 3rd, 2017, 04:12 AM
Well, you can't do much looking at the server that's doing the proxying/load balancing. You have to look at what's behind it.
What is that server? Does it go down on a regular basis? Are you saying you looked at its logs or those of the proxy? How about other system logs besides the ones for the web server?
March 3rd, 2017, 04:25 AM
i literally looked at every log i could in the var/log.. nginx, mysql, server logs etc etc.. but i couldnt see why the server would go down? ie some catastrophic bug in the code or server - which i hadnt touched when (or immediately before) the server went down
server is Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-65-generic x86_64)
btw the only one error that looked useful to me was something about
WARNING: [pool www] server reached pm.max_children setting (5), consider raising it
but that is happening all the time - so if that would cause server to go down then it would be down all the time
March 3rd, 2017, 06:12 AM
Can you tell if there's an absence of any logging during the downtime? That would at least help narrow down whether it's the system or just the web server. You could set up an noop cronjob that runs every minute (a mere ";" as the command might work) then after the outage check the cron logs to see if it executed during the window.
Do you have any performance monitoring of the server? This would be a good reason to set it up.
And are you able to wait until the outage and do something while it's still going on? Like, keep an SSH window open and go about your day waiting until it happens. Would mean you'd have to know about the outage as soon as it starts.
March 3rd, 2017, 09:11 AM
theres definitely absence of logging - now i cant remember if it was on all logs or just some but i did notice it.. i will add more details the next time server goes down -am also adding another server + load balancer to see if issue occurs in both or just one.
March 3rd, 2017, 10:06 AM
Is it virtualized? How? Could that be to blame? For instance, a failover cluster on Windows Server will suspend an instance as it migrates across hosts.
Comments on this post
March 7th, 2017, 05:34 PM
Hi - it happened again this time though i figured it out that it was the loadbalancer that was playing up - so i did the usual looking at all logs syslogs, error, access and didnt see anything
one thing that concerned me was the time it went down one of the log
and after this no more logs until I restarted the load balancer..
Mar 7 23:01:37 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from 220.127.116.11:123 (ntp.ubuntu.com).
Mar 7 23:01:48 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from 18.104.22.168:123 (ntp.ubuntu.com).
Mar 7 23:01:58 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from 22.214.171.124:123 (ntp.ubuntu.com).
Mar 7 23:02:08 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from 126.96.36.199:123 (ntp.ubuntu.com).
Mar 7 23:02:18 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from [2001:67c:1560:8003::c8]:123 (ntp.ubuntu.com).
Mar 7 23:02:29 domain.com-loadbalancer systemd-timesyncd: Timed out waiting for reply from [2001:67c:1560:8003::c7]:123 (ntp.ubuntu.com).
March 7th, 2017, 05:45 PM
actually there is a bit more to the previous log the last few lines are:
notice the time is wrong - it should not be 6:25.. it should be 23:18 or something like that
Mar 7 23:17:01 domain-com-loadbalancer CRON: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 7 06:25:02 domain-com-loadbalancer rsyslogd: message repeated 8 times: [ [origin software="rsyslogd" swVersion="8.16.0" x-pid="1129" x-info="http://www.rsyslog.com"] rsyslogd was HUPed]
March 8th, 2017, 12:43 AM
- I doubt all those ntp.ubuntu.com IPs were actually inaccessible. Sounds like more networking problems.
- Without the NTP sync the clock shouldn't go crazy - at least not while the OS is still running. Again, is this virtualized?
- rsyslogd being HUPed right when the hourly cron runs could be because of logrotate, so that's not necessarily a problem. But 8 times is odd. This and the time issue could be because the log messages were combined (like to save space) and 06:25 was the time of the most recent message.
The NTP thing looks like another symptom of networking problems on that load balancer. The other issues could be explained normally - try temporarily turning off rsyslogd's RepeatedMsgReduction setting to stop combining messages.
March 8th, 2017, 03:26 AM
No the 6:25 time was at around 23:30~ when i was looking at the logs so there was no way that should have said 06:25.. and looking at the logs the timings seem to be in order ie the earlier time at the top and the latest time at the bottom
virtualised? its a digital ocean droplet - i am not 100% sure but its a VPS so I am assuming its virtualised.
PS> which logs should I really be looking at when server goes down, there are so many and because I dont know where I should be looking its hard and so i just check ALL of them.. i am guessing error.log ones instead of syslogs / access logs?
March 8th, 2017, 07:59 AM
Do you have support with DigitalOcean? They could help.
Unfortunately I don't really know of any logs that would help with large server-scale issues like this. Mostly it's just the kernel and dmesg logs, but they aren't always helpful - just messages like disks that need fscking.