Development discussions
Mon, 18 May 2020, 08:00pm #1
I2P Legend

There has been much effort to speed up the streaming lib by improving the handling of packet loss. Actually this is message loss since without connection breakup NTCP never loses packets and SSU only after an ultra-rare isolated sequence of four timeouts.

Messages usually get eaten by tail drops or CoDel drops on the various queues employed. Because many 24/7 I2P machines relaying lots of traffic may be cost and energy efficient ones, I used octacore ARM32 Odroid HC1 at 50% CPU, 2-3 MBps outbound, > 3000 tunnels, > 20% local traffic, > 70 ms garbage collection pauses for research.

Outbound side, timing issue independent of CPU speed:
Healthy SSU connections can run well over 100 KBps. At least any single ACK packet lost then causes more than a dozen retransmits typical. Thus the outbound queue runs over long before the connection is allowed to return to previous speeds. >1000 messages lost have been seen in the logs on a single occurrence while the connection continued.

Inbound side: issues depend on CPU speed:
Queues between threads are sized based on configured memory. In my opinion they should be sized by the inverse of CPU speed. No matter whether queues are correctly sized I think it is import to remove the bottlenecks that cause traffic to backlog and finally get dropped.

Methodology used to identify serious issues:
- find occasions of > 500 messages (from SSU, NTCP and I2CP in total) piling up in the router
- take thread dump if last dump > 60 sec ago
- record issue if majority of worker threads stuck in the same code section (always)

That revealed six recurring patterns (in addition to open trac tickets, namely UDP, Tunnel GW Pumper and FragmentHandler) that alone or in combination potentially cause messages drops unless appropriate buffering is employed.

1.) Top issue is worker threads stuck storing stats using getProfile() while a PeerManager Reorg (300-3000ms on the test machine) is going on. For each peer that beast iterates over all other peers, so it has at least O(n^2) complexity. Solved by using nonblocking calls and ditching the stats if unsuccessful.
2.) multiple threads blocking over the random number generator. See I2Speed for lockless retrieval.
3.) High CPU due to crypto functions. Code should not be surprised that traffic backlogs just because there is temporary high CPU usage most probably because of a traffic peak in local apps.
4.) I have often reported high CPU issues in the Sun security provider. These also occur intermittently with all worker threads stuck in the SHA256. NOT related to reading the same system properties many thousands of times per second which also must go through the security provider. Partial solution posted.
5.) Threads stuck within the central send function GetBids(). They block over isUnreachable() or spend time in functions like haveCapacity(), that finally go down to system properties and the security provider, checking which parameters were supplied on the command line. No reason can be seen why these functions that return near constant results must be called tens of thousands of times per second. Same holds true for checking whether an address is routable. Same holds true for determining the maximum number of connections multiple times and iterating over thousands of connected peers just to send a single message. Partially solved by storing the system properties at the application level and making the reachable check concurrent.
6.) Threads blocked on the poll() in the SHA256 generator. Blocks worker threads and/or most of the router including OCMOSJ on activeFloodQueries(). Will discuss below. Solution is to use lock free alternatives.

Solving issues 1, 2, 6 and part of 4 and 5, (will post source if someone cares or work on trac tickets targetting a release) eased most of the problem, did not get to the 500 msg threshold again. Lowered the threshold to 200 messages and saw reports again. On nearly every occurrence now all threads were RUNNABLE (which includes locking issues with the JVM using spinlocks), so only high CPU remains as an issue. Outside crypto functions only locks and retrieval of properties stood out.

LinkedBlockingQueues are used all over I2P. They are NOT lockless as the I2P docs suggest, they are full of locks and I view them as unsuitable for usage on any critical path. Reasons:
- When coupling threads through LBQ at high frequency, the CPU used by hammering the locks may exceed the actual execution time for app code
- In a MPMC scenario like the SHA256 generator, where all worker threads must go through, once a thread holding a lock gets preempted, the JVM suspends all other worker threads. Until the JVM scheduler and then the OS scheduler have everything going again, some traffic will have piled up.
Using any type of lock to avoid object churn creates CPU overhead on the hot path. Make it either lock free or generate new objects.

a note on CoDel: This assumes there is TCP-like flow control exactly at the same level packets are dropped. Drops that cause more data retransmitted than dropped make problems worse, like one UDP drop inbound or outbound may cause a flood of retransmissions as the packet may contain fragments from multiple messages and multiple ACKs.

Finally one must keep in mind the power saving features of Windoze and MacOS batching together thread executions which will extend a 1 ms sleep up to 60 seconds in low usage scenarios. Application code should not be surprised by the clock jumping any time or 1 sec worth of traffic suddenly appearing in some buffer or queue.

Tue, 19 May 2020, 10:08pm #2

Good stuff.

1) So we'd want to convert more getProfile() calls to getProfileNonblocking(). Do you have a list of the top callsites?

2) Definitely would like to look. You have a patch I could look at?

4) Not sure if you're saying the system properties is or is not a problem. We could maybe copy the system properties to a HashMap at startup? There's a couple places where we set system properties but we could find those. Where is your partial solution posted?

5) Worth looking at, getBids() could certainly be a hot spot

6) What's your recommendation for lock-free caching?

Sat, 23 May 2020, 07:27pm #3
I2P Legend

Complete browseable source here: http://i2speed.i2p/I2Speed.200522.0.9.45/
Zipped : http://i2speed.i2p/I2Speed.200522.0.9.45.zip
Updater: http://i2speed.i2p/i2pupdate.zip

1.) I noted all calls blocking during high backlog and changed them here: http://i2speed.i2p/I2Speed.200522.0.9.45/router...

2.) This one maintains separate 16k buffers for the data types most used, retrieves lockfree and locks down only once 16k are exhausted : http://i2speed.i2p/I2Speed.200522.0.9.45/core/j...

4.) That is the updated faster SHA2.java (requires recompiling your JDK) posted here : http://zzz.i2p/topics/2887-solving-the-sha256-i.... For system properties see 5.)

5.) On-The-Fly-migration of system properties in getProperty(String propName) here: http://i2speed.i2p/I2Speed.200522.0.9.45/core/j...
reachable check made concurrent: http://i2speed.i2p/I2Speed.200522.0.9.45/router...

6.) This one beats the TryCache by a magnitude in terms of speed and churn : http://i2speed.i2p/I2Speed.200522.0.9.45/core/j...
With zero to little code modification the following does the same for ConcurrentLinkedQueue and its extensions as well as the ConcurrentHashSet in many uses cases. See source doc.

1 day ago #4

6) I like LockFreeRingCache. If we took it I'd probably just drop the code into TryCache.

It would take quite a bit more staring at it to convince myself that poll() could never return the same entry to two callers, because that would be an impossible-to-debug disaster. I'd like to have zlatinb take a look at it also. And I'd like to do my own testing to see if I see the "magnitude" benefit you're promising.

LockFreeRingBuffer makes me nervous due to the comments about dups and code mod required, but it isn't really clear to me what's "imperfect" about it if anything.

2) is like a simple version of 6). I like this one also. Again, would take a little more staring at it to make sure that it can't ever return the same data twice, and would want to do a speed test also.

Not sure if 3 caches is any better than one. Also I'd just use the DataHelper.fromLong() method instead of the object churn of bytebuffer.wrap().

Also, in nextBytes() you need a check for length > CACHE_SIZE and throw IAE I guess, that would infinite loop otherwise.

1) This one looks good also. I looked at the original 2011 checkin to see why I only made some of the calls nonblocking, or if I had a plan to make the rest of them nonblocking later, but I didn't say anything on the subject. I see you didn't change all of them either... that was based on which ones you noted as frequently blocking?

5a) is interesting, would have to look at every place we set a system property in our code and make sure it wouldn't be after something else checks it

5b) looks like an easy change to switch _unreadhableEntries to a concurrent map. Not sure what the point of lastUnreachablePeer is?

It would be a lot easier to review your proposals if you submitted patches. Running the diffs here yields a bunch of extraneous stuff, due to unrelated changes both on your side and on ours.

8 hours ago #5
I2P Legend

Happy to submit patches for trac tickets with a fixed target release.

Short remarks:

2.) 3 buffers to reduce frequency of the spinlocks hitting.
nextbytes() works, but discards current buffer if length requested > 16k. Now going directly to fortuna in that case.
Datahelper is better, if IAE for negative gets patched out there.

1.) yes

5b) That is just a one item cache. Given typical traffic patterns those have up to 50% hit rate in I2P.

6 hours ago #6
I2P Legend

Remark for LockFreeRingBuffer: Without dup check it behaves like a queue. With dup check it is also like CHS for small quantities. When using poll() you may still get dups, that you want to check for work to do, when handing over work items between threads which is the use case for LockFreeRingBuffer.

Updated source for #5 2.) here: http://i2speed.i2p/