What can I do with this bottleneck

Question

What can I do with this bottleneck

I just found a way to optimize my code's algorithm for 50 to 15 minutes, but it takes 14 minutes. This will be part of a larger modeling system, so I cannot let it run any longer. Since I have to compare all the values of my vector that have about 100,000 values (10 billion comparison), I was wondering if there is a way to optimize the code.

struct Coor 
{ 
    double x1; double y1; //Coordinate of Node 1
    double x2; double y2; //Coordinate of Node 2
    std::vector<int> C1; //Index of the edges connected to Node 1
    std::vector<int> C2; //Index of the edges connected to Node 2
};

std::vector<Coor> Connection_S(std::vector<Coor> Nodes)
{
    N = Nodes.size();

    for (i = 0; i < N; ++i)
    {
        for (j = 0; j < N; ++j)
        {
            if (i == j)
            { 
                continue; 
            }
            if ( ( Nodes[i].x1 == Nodes[j].x1 && Nodes[i].y1 == Nodes[j].y1 ) ||
                 ( Nodes[i].x1 == Nodes[j].x2 && Nodes[i].y1 == Nodes[j].y2 ) )
            {
                Nodes[i].C1.push_back(j);
            }
            if ( ( Nodes[i].x2 == Nodes[j].x1 && Nodes[i].y2 == Nodes[j].y1 ) ||
                 ( Nodes[i].x2 == Nodes[j].x2 && Nodes[i].y2 == Nodes[j].y2 ) )
            {
                Nodes[i].C2.push_back(j);
            }
        }
    }
    return Nodes;
}

I'm still relatively new to C ++, so I'm not used to everything the language has to offer and the distinction that makes a function faster than another.

+3

c ++ optimization algorithm for-loop graph-algorithm

Dom C. June 26. 15 at 15:26

source to share

5 answers

astraujums · Answer 1 · 2015-06-26T15:51:20+0000

The function Connection_S_optimized

in the example code below can process an array of 100,000 Coor in 3 seconds, and the original code in 55 seconds.

The basic idea behind the code below is that the endpoints are sorted and put into a multimap. Both coordinates node 1 and node 2 are placed on the same map, marking which one it was.

Then with one pass through the groups of endpoints that match, we fill the C1 and C2 arrays of each Coor.

Note that the optimized version works differently than the original on vectors for which x1, y1 ir is the same as x2, y2.

As someone already pointed out, comparing doubles for equality can be dangerous, so you might want to adjust the function is_same_coordinate

to compare approximately.

This SEEMS code works, but use it at your own risk.

#include <vector>
#include <map>
#include <algorithm>
#include <time.h>
#include <iostream>

struct Coor
{
    double x1; double y1; //Coordinate of Node 1
    double x2; double y2; //Coordinate of Node 2
    std::vector<int> C1; //Index of the edge connected to Node 1
    std::vector<int> C2; //Index of the edge connected to Node 2
};

bool is_same_coordinate(std::pair<double, double> e1, std::pair<double, double> e2)
{
    return (e1.first == e2.first) && (e1.second == e2.second);
}

std::vector<Coor> Connection_S(std::vector<Coor> Noeuds)
{
    size_t N = Noeuds.size();

    for (size_t i = 0; i < N; ++i)
    {
        for (size_t j = 0; j < N; ++j)
        {
            if (i == j)
            {
                continue;
            }
            if ((Noeuds[i].x1 == Noeuds[j].x1 && Noeuds[i].y1 == Noeuds[j].y1) || (Noeuds[i].x1 == Noeuds[j].x2 && Noeuds[i].y1 == Noeuds[j].y2))
            {
                Noeuds[i].C1.push_back(j);
            }
            if ((Noeuds[i].x2 == Noeuds[j].x1 && Noeuds[i].y2 == Noeuds[j].y1) || (Noeuds[i].x2 == Noeuds[j].x2 && Noeuds[i].y2 == Noeuds[j].y2))
            {
                Noeuds[i].C2.push_back(j);
            }
        }
    }
    return Noeuds;
}


void Connection_S_optimized(std::vector<Coor>& Noeuds)
{
    // A map of an endpoint coordinates to the information about this enpoint <is it the Node 1 (x1, y1), index in Noeuds>
    std::multimap<std::pair<double, double>, std::pair<bool, size_t>> node_index;

    for (size_t i = 0; i < Noeuds.size(); i++)
    {
        Coor& c = Noeuds[i];
        node_index.insert(std::make_pair(std::pair<double, double>(c.x1, c.y1), std::pair<bool, size_t>(true, i)));
        node_index.insert(std::make_pair(std::pair<double, double>(c.x2, c.y2), std::pair<bool, size_t>(false, i)));
    }
    auto s_representative_it = node_index.begin();
    for (auto s_it = node_index.begin();; s_it++)
    {
        if (s_it == node_index.end() || !is_same_coordinate(s_representative_it->first, s_it->first))
        {
            auto start = s_representative_it;
            auto end = s_it;
            auto current = s_representative_it;
            while (current != end)
            {
                bool is_node_1 = current->second.first;
                Coor& current_coor = Noeuds[current->second.second];
                auto it = start;
                while (it != end)
                {
                    if (it != current)
                    {
                        if (is_node_1)
                        {
                            current_coor.C1.push_back(it->second.second);
                        }
                        else
                        {
                            current_coor.C2.push_back(it->second.second);
                        }
                    }
                    it++;
                }
                current++;
            }
            if (s_it == node_index.end())
            {
                break;
            }
            s_representative_it = s_it;
        }
    }
}

const size_t NUM_COORS = 100000;

void generate_sample_set(std::vector<Coor>& Noeuds)
{
    Coor c;
    size_t degenerated = 0;
    for (size_t i = 0; i < NUM_COORS + degenerated; i++)
    {
        c.x1 = i % 23;
        c.x2 = i % 13;
        c.y1 = i % 5;
        c.y2 = i % 17;
        if (is_same_coordinate(std::make_pair(c.x1, c.y1), std::make_pair(c.x2, c.y2)))
        {
            degenerated++;
            continue;
        }
        Noeuds.push_back(Coor(c));
    }
}

int main(int argc, char** argv)
{
    std::vector<Coor> Noeuds_input;

    generate_sample_set(Noeuds_input);

    std::vector<Coor> Noeuds_original = Noeuds_input;
    std::vector<Coor> Noeuds_optimized = Noeuds_input;

    double time_original = clock();
    Noeuds_original = Connection_S(Noeuds_original);
    time_original = (clock() - time_original) / CLOCKS_PER_SEC;

    double time_optimized = clock();
    Connection_S_optimized(Noeuds_optimized);
    time_optimized = (clock() - time_optimized) / CLOCKS_PER_SEC;

    for (size_t i = 0; i < std::min(Noeuds_input.size(), 100u); i++)
    {
        std::cout << i << ": " << Noeuds_original[i].C1.size() << "," << Noeuds_original[i].C2.size()
            << " vs " << Noeuds_optimized[i].C1.size() << "," << Noeuds_optimized[i].C2.size() << std::endl;
    }

    std::cout << "Processing time for " << NUM_COORS << " items (in seconds):" << std::endl;
    std::cout << "  original: " << time_original << std::endl;
    std::cout << " optimized: " << time_optimized << std::endl;

    return 0;
}

Thomas matthews · Answer 2 · 2015-06-26T16:55:19+0000

Compiler optimization options

First, print out a list of assembly languages.
Then set the optimization levels to High for Speed; recompile.
Compare an optimized build with a non-optimized build.

Variable preloads

You can get some savings by preloading the comparison values. (Note: the compiler can do this already, check your local assembly language for the truth.)
Example:

const double ni_x1(Nodes[i].x1);
const double ni_x2(Nodes[i].x2);
const double nj_y1(Nodes[i].y1);
const double nj_y2(Nodes[i].y2);

if (((ni_x1 == nj_x1) && (ni_y1 == nj_y1))
// ...

The optimization technique is to let the processor preprogram the data into its data cache.

Branch reduction instructions

Branch instructions take longer to execute by the processor than data instructions. Therefore, if possible, eliminate them.
(Some processors have enough cache to load a loop into the instruction cache without having to reboot. In any case, the processor still has additional logic to execute, which takes longer than processing a data instruction.)

You can use some boolean algebra. Again, take a look at assembly language to see if you have reached any speed. Example:

bool is_equal = false;
is_equal = (ni_x1 == nj_x1);
is_equal = is_equal && (ni_y1 == nj_y1);

The above may allow the compiler to generate conditional build instructions if your processor has that. Hopefully the compiler can generate continuous data instructions.

Fixed point arithmetic

Another option is to use fixed point arithmetic. This will allow you to perform integral arithmetic operations, which are usually faster than floating point operations.

For example, given volumes in Liters, it is possible to have 3.141 liters. If the value is in milliliters, the value will be cumulative: 3141.

Benefits: Improved accuracy and equity. For example, with a 32-bit processor, you might have 32 bits of the mantissa, whereas a floating point might only have 24 bits of the mantissa, because some bits are reserved for sign and exponent.

Satish chalasani · Answer 3 · 2015-06-26T15:30:39+0000

you can resize the Noeuds [i] vectors before the loop. This will improve memory management and performance. Pass the Noeuds structure by reference. Now he copies it. It will take time .

Change Connection_S(std::vector<Coor> Noeuds)

to Connection_S(std::vector<Coor> &Noeuds)

When you follow a link, you don't need to return it. The original will be updated directly.

On a side note: comparing doubles to == doesn't always give you the same result. So it's dangerous.

Pandrei · Answer 4 · 2015-06-26T15:41:12+0000

there are several things you can do:

sort the entire vector in advance. This operation is O (nlogn) compared to what you are currently doing, it is O (N * N). Than you can find a more efficient way to fill C1 / C2.
You can also try doing more than one comparison at a time; j will increase by 2, 4..s. This will improve your locality of your cache and generate more compact code.

Wouter verhelst · Answer 5 · 2015-06-26T15:43:19+0000

Side note: plural "Node" - "Nodes"; -)

std::vector

internally uses dynamically allocated arrays, which means that an operation push_back

as described here may need to reallocate memory and then copy all data already into a vector. Depending on the size of your vector, this can be quite an expensive operation.

Consider using some other containers instead. std::list

is efficient enough for adding elements at the end, but does not support random access. Alternatively, you can use std::deque

which does not guarantee that all elements are in contiguous storage (like a vector), but still allows for pretty efficient random access.

What can I do with this bottleneck

Compiler optimization options

Variable preloads

Branch reduction instructions

Fixed point arithmetic

More articles: