Hi everyone, the recent outtage has been resolved, and the SOURCE OF THE PROBLEM FINALLY IDENTIFIED.
This is the same issue that has been affecting the site, on-and-off, for the last 2-3 weeks.
I wont normally describe in detail the causes of these types of problems, but in this case I think it is warranted so everyone knows exactly what was going on.
- about a month and a half ago we switched our websocket for streaming trade data, order data, etc, from our own custom solution, to the hosted-service
www.pusher.com- pusher.com (to their credit) was extremely easy to implement and worked very, very well for the first few weeks.
- our server has very strict firewall settings, so when pusher.com was setup, our firewall hole was opened to allow our server outgoing connections to the pusher API server, in order for us to push events to it to be re-broadcast over the websocket to all connected clients
- sometime over the past few weeks, the IP address of the pusher.com api server changed occasionally (maybe low priority round robin dns for automatic failover on their end?) so sometimes our server would be connecting to the correct IP that our firewall allows, or sometimes (very rarely, but increasingly frequent) it would be trying to connect to a different, FIREWALLED IP address.
- last night, it appears the IP address has changed more permanently (or, who knows, maybe it will change again)
- the reason the site was running slow, or not at all, is that when the site became active which caused pusher.com events to be triggered during key database session transactions (such as within the locked trade transaction) the system was waiting 30 seconds per pusher event for the pusher connection to timeout before continuing. This cause all the other users/clients waiting on that locked transaction to have to wait for access to that data.
SOLUTIONS:
- I have now added the new IP to the firewall, and requested from their support staff a list of all possible IP's that it might resolve to, much in the way cloudflare provides a list of their IP's.
- I have lowered the timeout from 30 seconds to 2 seconds
- I have moved all pusher.com events outside of locked transactions
With this, I hope the issue is finally resolved. I sincerely apologize for the recent problems you may have experienced with the site over the past few weeks, and thank you for your patience while we figured out the root cause.
If you have any questions, please dont hesitate to ask.
Cheers,
James