This is a complete shot in the dark. W/o plenty of more information to go on, no one could diagnose this completely. But I'm wondering if there's any bright ideas out there I can investigate.

We have 6 web servers, all which host a UI and an API. The UI always calls the API.

We have a simple login procedure, where we take a huge randomized hash (way beyond infinitesimal chance it could be repeated), store it for that user, then return the hash cookie. We also check to make sure the has isn't currently in use, though to be honest, it probably would never matter. The hashed cookie is then compared on each call (through the UI to the API).

This has been going fine for years. About a month ago, people reported logging in and seeing other people's information. Completely. As if they were logged in as that person. That person could be across the country, with a different service provider, etc. Between 3 senior devs we can find no way that the user could log into the wrong account with our code, or that they both could have the same ses id. However, it does appear that both users are logged into the system at about the same time.

We can never reproduce, and it's only happened a few times, but data security is extremely important. We've seen screenshots, but have never gotten to a problem before it resolved itself. Users report typically logging in and out a few times and it's fine. That's also important to note. We've had users login and log out several times, even reboot, and they'd still get into someone else's account.

We have been having some hardware issues, with spikes in return times and NIC problems, and we aren't very happy with our server administration lately.

Other details
- we use REDIS, but it's not used in any of the login processes
- we have a cron which clears out old SES's
- when the user logs out, their SES clears with it on their end
- the original user has logged in at various times before the offending(ed) user, usually within an hour but not at the exact same time

Any bright ideas out there for things to test? Is there any way possible that the API is getting "confused" because of a networking issue, and sending back the wrong information?

Thanks for throwing any ideas out there.