Cannot open more than 1023 slots

I am developing code that mimics networking hardware. I need to run several thousand simulated "agents", and each one needs to connect to a service. The problem is that after opening 1023 connections, connections start counting down and it all crashes.

The main code is in Go, but I wrote a very trivial python script that reproduces the problem.

One thing that is somewhat unusual is that we need to set a local address on the socket when we create it. This is because the equipment that the agents connect to expects the apparent IP to match what we are talking about. For this I have configured 10,000 virtual interfaces (eth0: 1 to eth0: 10000). They are assigned unique IP addresses on the private network.

Python script is simple (only works up to 2000 connections):

import socket

i = 0
for b in range(10, 30):
    for d in range(1, 100):
        i += 1
        ip = "1.%d.1.%d" % (b, d)
        print("Conn %i   %s" % (i, ip))
        s = socket.create_connection(("1.6.1.1", 5060), 10, (ip, 5060))

      

If I remove the last argument in socket.create_connection (source address), then I can get all 2000 connections.

What differs from using a local address is that the binding must be set before the connection can be configured, so the exit from this program running under strace looks like this:

Conn 1023   1.20.1.33
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("1.20.1.33")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("1.6.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

      

If I run without a local address the AF_INET binding disappears and it works.

So it seems that there is a certain limit to the number of ligaments that can be made. I missed all kinds of links about configuring TCP on Linux and I tried messing around with tcp_tw_reuse / recycle and I shortened fin_timeout and I did other things I can't remember.

This works on Ubuntu Linux (11.04, kernel 2.6.38 (64 bit)). This is a virtual machine in a VMWare ESX cluster.

Just before posting this post, I tried running a second instance of the python script with an additional run in 1.30.1.1. The first script broke up to 1023 connections, but the second couldn't even execute the first, indicating that the problem was with a lot of virtual interfaces. Can the internal data structure be limited? Somewhere some maximum memory setting?

Can anyone think of some limitation in Linux that might cause this?

Update:

This morning I decided to try an experiment. I modified a python script to use the "main" IP interface as the source IP and ephemeral ports in the 10000+ range. The script now looks like this:

import socket

i = 0
for i in range(1, 2000):
    print("Conn %i" % i)
    s = socket.create_connection(("1.6.1.1", 5060), 10, ("1.1.1.30", i + 10000))

      

This script works great, so it adds to my confidence that the problem is with a lot of alias IPs.

+3


source to share


3 answers


What a DOH moment. I was looking at the server using netstat and since I didn't see a lot of connections, I didn't think there was a problem. But finally I figured it out and checked /var/log/kernel

where I found this:

Mar  8 11:03:52 TestServer01 kernel: ipv4: Neighbour table overflow.

      



This led me to this post: http://www.serveradminblog.com/2011/02/neighbour-table-overflow-sysctl-conf-tunning/ which explains how to increase the limit. Hitting the thresh3 value fixed the problem right away.

+2


source


You can look at the sysctl settings related to net.ipv4.



These settings include things like maxconntrack and other related settings that you can tweak.

0


source


Are you absolutely sure the issue is not server side connection not closing sockets? that is, what does the lsof -n -p

server process show ? What does the plimit -p

server process show ? The server side can be tied to the fact that it cannot accept more connections, and the client side receives an EINPROGRESS result.

Check the ulimit for the number of open files on both sides of the connection - 1024 is too close to the ulimit level to be a match.

0


source







All Articles