spiccioli
Legendary
Offline
Activity: 1379
Merit: 1003
nec sine labore
|
|
June 26, 2011, 10:20:49 PM |
|
I don't know how to change MSL in Linux, though I think it's possible by patching the kernel. This will however not fix the issue completly. I'm using tcp_max_tw_bucket (or something similar) atm - that limits TW-sockets to 10k max. It works, but it does not solve the problem Jine, I've google around a little, tcp_max_tw_bucket puts a limit (default 180000) on the maximum number of sockets in TIME_WAIT, if this limit is exceeded sockets entering TIME_WAIT state are closed without waiting and a warning issued... but first you have to reach the limit. tcp_fin_timeout, or net.ipv4.tcp_fin_timeout, on the other hand, seems to be what you need. Lowering it (it could be 60 seconds right now) you should be able to limit TIME_WAIT sockets to a lower number without/before reaching the bucket limit. I mean, if you set it to 3 second, you'll end up having as many TIME_WAIT sockets as your incoming connections in a 3 seconds period, so your 25K sockets should go down to 1500-2000. spiccioli.
|
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 26, 2011, 10:27:05 PM Last edit: June 26, 2011, 10:38:54 PM by JoelKatz |
|
Sorry to hear that. I'll audit my changes for the kinds of things that typically cause seg faults. Update: Found the problem. Fixing it now.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
Jine (OP)
|
|
June 26, 2011, 10:30:38 PM |
|
I mean, if you set it to 3 second, you'll end up having as many TIME_WAIT sockets as your incoming connections in a 3 seconds period, so your 25K sockets should go down to 1500-2000.
tcp_fin_timeout is already set to 3s, forgot to mention that in my previous post. These settings are currently used (since a week back), and we still have the problems described before echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse echo 0 > /proc/sys/net/ipv4/tcp_timestamps echo 3 > /proc/sys/net/ipv4/tcp_fin_timeout echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts echo 1 > /proc/sys/net/ipv4/tcp_syncookies
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
Jine (OP)
|
|
June 26, 2011, 10:31:21 PM |
|
Sorry to hear that. I'll audit my changes for the kinds of things that typically cause seg faults.
Thanks, please do! If we solve this permanently, I'm sending 20 BTC your way...
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 26, 2011, 10:41:44 PM Last edit: June 26, 2011, 11:29:22 PM by JoelKatz |
|
Okay, new build is up. It passes a pretty aggressive stress test, but I'll keep stressing it, just in case. Thanks for trying it.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
|
Jine (OP)
|
|
June 27, 2011, 12:03:50 AM |
|
Okay, new build is up. It passes a pretty aggressive stress test, but I'll keep stressing it, just in case. Thanks for trying it.
Same URL? Will try it out in a moment. Thats actually better, BUT it would require modifications to both bitcoind and pushpoold. This was the previous main target, but i couldn't find anyone able to implement it. It would be really. really. awesome. Offering another 10 BTC for such an implementation.
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 27, 2011, 12:15:53 AM |
|
Same URL? Will try it out in a moment. Yes, same URL. Thats actually better, BUT it would require modifications to both bitcoind and pushpoold. This was the previous main target, but i couldn't find anyone able to implement it. It's better in some ways and worse in others. Alone, it won't help the fact that bitcoind can't respond while it's waiting for another connection to respond. But it would eliminate the TIME_WAIT states that pile up because UNIX domain sockets don't have them. You'd need a multi-listening patch like mine to listen on a UNIX-domain socket and a TCP socket at the same time (or you could call select, but that's ugly). I haven't seen the code on the other side yet, so I don't know what's involved in making it use a UNIX domain socket. Of course, if you have multi-threaded listening, keep alives, and UNIX domain sockets, that should *really* solve the problem.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
hamdi
|
|
June 27, 2011, 12:17:44 AM |
|
best would be to remove the network way from bitcoind to pushpoold.
on one hand shitty to force both running on one machine, but the network bottleneck would be gone by sharing that data via ram instead of network
|
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 27, 2011, 12:18:27 AM |
|
best would be to remove the network way from bitcoind to pushpoold.
on one hand shitty to force both running on one machine, but the network bottleneck would be gone by sharing that data via ram instead of network
It is shared by RAM. Connections between two processes on the same machine don't actually flow over any network. It just emulates a network, and with that comes both good things and bad things.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
Jine (OP)
|
|
June 27, 2011, 01:01:47 AM |
|
Same URL? Will try it out in a moment. Yes, same URL. Thats actually better, BUT it would require modifications to both bitcoind and pushpoold. This was the previous main target, but i couldn't find anyone able to implement it. It's better in some ways and worse in others. Alone, it won't help the fact that bitcoind can't respond while it's waiting for another connection to respond. But it would eliminate the TIME_WAIT states that pile up because UNIX domain sockets don't have them. You'd need a multi-listening patch like mine to listen on a UNIX-domain socket and a TCP socket at the same time (or you could call select, but that's ugly). I haven't seen the code on the other side yet, so I don't know what's involved in making it use a UNIX domain socket. Of course, if you have multi-threaded listening, keep alives, and UNIX domain sockets, that should *really* solve the problem. Unfortunately, i could not get this version to work in our production environment either. The strange thing is that this DO work if i run it separately... Downloaded nightly build of blockchain, tried it out on port 1337 and it started, i could issue getwork-requests to it and it seems to have reused the socket to. On the other hand, when i replace our bitcoind with this patched version - it starts, runs for 4 seconds and then silently dies. I was able to issue one getinfo requests from it's client - but then it died. Straaaaange... but i really think we're onto something here.
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 27, 2011, 03:13:11 AM Last edit: June 27, 2011, 04:30:29 AM by JoelKatz |
|
It must be a problem that only appears under load or under some particular combination of requests. I'll try to do some more troubleshooting. I don't have much time left today, but I should have a few hours tomorrow that I can dedicate.
Update1: Do you compile bitcoind with any non-standard settings? And do you start it with any command line flags other than '-daemon'? Do you enable RPC over SSL?
Update2: I made a few cleanups and fixed a few very minor issues. But I can't replicate your issue, which means either I solved it or it requires something unique to your setup to replicate. If you get a chance to try my latest (same place) please do. If you can compile it with '-g' (I believe that's provided by default), make sure not to strip the executable, do a 'ulimit -c unlimited' before you execute it, and if it crashes, run 'gdb' on the core file like this: 'gdb /path/to/bitcoind /path/to/core.filename' and then message me the output of a 'where' command. (You may have to hit 'enter' a few times to get the full output. It'll be the last few lines that will be the most helpful.)
As soon as you have the core file, you can restart the original bitcoind. You don't need to have any downtime. Just make sure to pass 'gdb' the path to the bitcoind executable that generate the core file.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 27, 2011, 08:40:34 AM Last edit: June 28, 2011, 02:01:34 AM by JoelKatz |
|
There's another new version up. I realized the JSON code wasn't re-entrant, which creates a problem when you try to use it in more than one thread. Unfortunately, the 'getblock' code isn't re-entrant, and the simplest way to deal with that is to wrap all the RPC handlers in a big mutex. It doesn't seem to have any effect on performance though, so I'll leave it that way because it's much safer.
This version makes the JSON code re-entrant, but single threads calls to do the actual RPC. This is slightly less than optimum, but in all of my tests it made no difference. Multi-threading the actual RPC calls would carry a significant risk that some part of that code would break for no significant benefit and unless the 'getblock' code was pessimized for the most common case, it wouldn't benefit anyway. (Plus, there would have to be invasive changes to the code that handles when you successfully find a block, and that scares me because it's so critical and so hard to test.)
Please test this version. It should solve the problem.
I also have a version with UNIX domain sockets available if anyone's interested (it's not up at the moment, but PM me if you want it). It's very ugly right now because I haven't had time to polish it, but it does work. It supports a '-unixsocket=<filename>' option. The protocol is a single line query and a single line response, no headers, no authentication (so put the socket in a directory only the authorized user can access). There is also code to issue RPC calls over the UNIX-domain socket, so you can see how to do it and see that it works. The biggest issue with it right now is that if you make any errors, it just closes the socket. You can issue multiple requests over a single connection though, and of course there are no stale socket issues.
The biggest ugliness is that I couldn't figure out how to bind a basic_istream to a local::stream_protocol::socket. So I had to use a 'receive' call instead of 'getline'. If anyone knows how to do that, I'd appreciate a PM. (I always meant to learn boost.)
In truth, none of these are the right solution. I have some ideas for the 'right' solution (bitcoind should push changes to the mining controller so it doesn't have to poll), and I'll try to get them thought out and proposed as modifications to the official source. (Think of it as extending long polling back one more link in the chain.)
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
NANO
Newbie
Offline
Activity: 28
Merit: 0
|
|
June 28, 2011, 01:40:23 AM |
|
JoelKatz seems to be a Hero...
|
|
|
|
ius
Newbie
Offline
Activity: 56
Merit: 0
|
|
June 28, 2011, 02:17:25 AM |
|
In truth, none of these are the right solution. I have some ideas for the 'right' solution (bitcoind should push changes to the mining controller so it doesn't have to poll), and I'll try to get them thought out and proposed as modifications to the official source. (Think of it as extending long polling back one more link in the chain.)
Even so it would make much more sense to do it properly (so we end up having a useful pull request against bitcoin); the asio route appears to be the way to go (patch is there, yet bugged) instead of spawning multiple threads.
|
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 28, 2011, 02:23:34 AM |
|
Even so it would make much more sense to do it properly (so we end up having a useful pull request against bitcoin); the asio route appears to be the way to go (patch is there, yet bugged) instead of spawning multiple threads.
Oh, do you know where the patch is? There's a good chance I could debug it. I've been meaning to learn about boost anyway. (My day job involves high-performance, multi-threaded TCP server code.) Asio won't actually gain you much over my patch. The main advantage of asio is when you have large numbers of connections, most of which aren't very active or when large numbers of them become active at the same time. You would want it in a mining controller. Think about the large number of connections, the long periods of inactivity, and the sudden burst when a new block comes out. Without asio, you have to have a context switch for each connection. With asio, you do not. That said, there's basically no downside, and it's also the right thing to do.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
Jine (OP)
|
|
June 28, 2011, 02:25:13 AM |
|
Lets try this out right now. Compiling the code as we speak. Give you an update and/or coredump in a while The "right way" seems pretty awesome to me...
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 28, 2011, 02:32:08 AM Last edit: June 28, 2011, 02:52:46 AM by JoelKatz |
|
The "right way" seems pretty awesome to me... Okay. If I get the existing asio work for bitcoind, I'll work on that. I've grabbed the source code to the mining daemon and am looking at how it implements long polling. Update: It looks like pushpoold already has a way to do this, with blkmond. So the only issue is the connection buildup, which I think I've already fixed.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
Jine (OP)
|
|
June 28, 2011, 03:24:28 AM |
|
Something goes wrong when running your current patched bitcoind in our prod. environment. Now the process just hung, couldn't get any coredump out of it or similar Suggestions? I'll try this one in my personal dev-environment and see if i can replicate the issue.
|
Previous founder of Bit LC Inc. | I've always loved the idea of bitcoin.
|
|
|
JoelKatz
Legendary
Offline
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
|
|
June 28, 2011, 03:42:08 AM Last edit: June 28, 2011, 04:09:41 AM by JoelKatz |
|
Suggestions? I'll try this one in my personal dev-environment and see if i can replicate the issue.
I'm guessing that there's some path that doesn't release the RPC mutex. I put up a new build that might fix it, but it's hard to be sure since I can't replicate the problem. Update: If that fails, I can just pull the multi-threading stuff and only fix keep-alives. That's a no-brainer (two lines change in trivial ways) and is very unlikely to cause any problems.
|
I am an employee of Ripple. Follow me on Twitter @JoelKatz 1Joe1Katzci1rFcsr9HH7SLuHVnDy2aihZ BM-NBM3FRExVJSJJamV9ccgyWvQfratUHgN
|
|
|
|