Data structure for storing IP ranges to quickly search for a given IP (Java)
I am trying to present an efficient data structure for representing IP ranges as described below. I know what I want to try, pretty easy, I just can't seem to show up on it.
So, let's say I have hundreds of separate IP ranges in the format 1.1.1.0 - 1.1.2.255 or whatever (but not CIDR format like 1.1.1.0/24).
The different ranges are not sequential, so there could be millions of IP addresses between the end of one and the start of the next. They could / would have been represented in integer format instead if preferred (i.e. 16843008 - 16843519 in this example).
There would be no overlap of IP addresses in other ranges.
Basically, these ranges are ASN network blocks, if you're interested. And I need to make a tool to determine if any given IP address fits in one of these ranges, but the tool should be pretty fast (less than 0.5 seconds, ideally).
Now if I have hundreds or thousands of these ranges that cover millions of IPs and you want to know if a given IP is in one of the ranges (or not), which would be the fastest way and also not too memory intensive?
There are several options I can think of:
-
Create a HashSet that contains every IP from all ranges and just create (ip) with it. I would expect around 50 million IP addresses. Fast, but seems a little wasteful, memory wise?
-
You have a TreeMap whose key is the starting IP address of each range and whose value is the ending IP address. Walk through the tree and check each key if the test IP is greater than this key but less than the next key. If so, then investigate the value (i.e. the end of the IP range), and if the IP is less than the value of the card, then the IP is in the range - if not, then there is no point in continuing and might assume the IP is not in any of the ranges ... Can a binary search through the keys of the tree jump to the output faster rather than checking the order?
-
Another idea is to have HashMap keys that will be all possible subnets in all ranges (I understand there will be a lot of them), for example "123.123.123, 123.123.124, 123.123.125, 211.211.211, 211.211.215" etc. .d. Then if I am asked to check IP 123.123.124.144, I can see if its subnet (123.123.124) is the key on the map first. The map value can be a custom object containing the starting and ending IP addresses of the range associated with this particular subnet. Then you can just use that to check if the full IP address is in the range. This special object will be shared by many entries on the map, since it is obvious that there can be many subnets in a given range.
So, any thoughts / ideas / opinions? I have a feeling for my second idea, what might be a good way to go? Thanks for the info ... really glad to hear your ideas!
source to share
If the ranges do not contain sub-ranges, you can check the guava RangeSet.
https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained#RangeSet
In fact, I haven't analyzed the time and space complexity of the RangeSet, but the RangeSet seems to fit your requirements well enough.
source to share
I am using an AVL tree with an IP range as a node value and a suitable comparison function. (Where the range is a..b (a <= b) when comparing two ranges r1 and r2: r1 <r2 if r1.b <r2.a; r1 "==" r2 if r1.a> = r2.a and r1.b = r2.b; r1> r2 if r1.a> r2.b. So, "==" means r1 is equal to or included in r2.)
If you don't have a match, that's enough. If you have overlaps (like mine, but I handle network prefixes), you end up with AVL trees nested within AVL trees.
When you say that there is no overlap of ASN network blocks, I am assuming that if the ASN has / yy delegated to it, you split parent / xx into separate but contiguous network blocks.
Since your network block list doesn't change that often, you probably don't need an AVL tree. You can just sort the netblocks and get listed with a binary chop. If you need something that is faster than a binary tree / chop, you can have an auxiliary set of pointers to the binary chop using the ms byte of the start of the search range to determine the first and last ranges to look for in.
This is the structure I am using. Suppose there is another table in this situation, a table locations
to see the purpose and use of IP ranges in a real situation.
--
-- Table structure for table `locations`
--
CREATE TABLE IF NOT EXISTS `locations` (
`location_id` int(10) unsigned NOT NULL,
`parent_id` int(10) unsigned NOT NULL,
`location_name` varchar(64) NOT NULL,
PRIMARY KEY (`location_id`),
KEY `parent_id` (`parent_id`)
);
--
-- Table structure for table `locations_to_ip_ranges`
--
CREATE TABLE IF NOT EXISTS `locations_to_ip_ranges` (
`l_ip_r` int(10) unsigned NOT NULL AUTO_INCREMENT,
`location_id` int(10) unsigned NOT NULL,
`starting_ip` varchar(45) NOT NULL,
`ending_ip` varchar(45) NOT NULL,
`starting_cidr` int(10) unsigned NOT NULL,
`ending_cidr` int(10) unsigned NOT NULL,
PRIMARY KEY (`l_ip_r`),
KEY `location_id` (`location_id`)
);
Here are some records from the second table
l_ip_r location_id starting_ip ending_ip starting_cidr ending_cidr
-------------------------------------------------------------------------------
94005 47 217.147.0.0 217.147.15.255 3650289664 3650293759
94004 47 217.146.32.0 217.146.47.255 3650232320 3650236415
94003 47 217.145.144.0 217.145.159.255 3650195456 3650199551
94002 47 217.145.16.0 217.145.31.255 3650162688 3650166783
94001 47 217.144.176.0 217.144.191.255 3650138112 3650142207
Below is a useful function to convert an IP address to a CIDR number. It is written in PHP, but I believe it would be easy to convert it to Java. The function used here explode()
splits the string according to the specified delimiter.
function ip_address_to_cidr($ip_address){
$ips = explode(".", $ip_address);
return ($ips[3] + $ips[2] * 256 + $ips[1] * 65536 + $ips[0] * 16777216);
}
So, if you want to get the country for a given IP, you would have something like this
// call the ip_ip_address_to_cidr function passing the remote_address as an argument
$cidr = ip_address_to_cidr($_SERVER['REMOTE_ADDR']);
// pass the returned $cidr to the following query and get the location_id
$set = mysql_query("
SELECT location_id
FROM locations_to_ip_ranges
WHERE " . $cidr . " BETWEEN starting_cidr AND ending_cidr
");
$row = mysql_fetch_object($set);
echo $row->location_id
Since I need multi-language support, there is another table locations_to_languages
I left behind to keep it simple and straightforward. These tables currently contain tens of millions of data and I have no performance issues.
Side note: I haven't used Java for a long time, so the snippet above is in PHP, but I find it would not be easy to understand the logic and convert it to Java if needed.
source to share
We can use btrees through which we can map the ip address in primary memory and then map them to secondary memory. Since we know that we need to store a large enough number of ip addresses, we cannot store them in primary memory, since this is the case, it is better if we use btree. Since it is similar to the hashing principle, it is also memory efficient.
source to share