Speed ​​up plotting from 2.92M data points

I have a 2.92M data point in a 3.0GB CSV file and I need to skip it twice to create a graph that I want to upload to NetworkX. At the current rate, it will take me a few days to create this schedule. How can I speed this up?

similarity = 8

graph = {}
topic_pages = {}

CSV.foreach("topic_page_node_and_edge.csv") do |row|
                topic_pages[row[0]] = row[1..-1]
end

CSV.open("generate_graph.csv", "wb") do |csv|
        i = 0
        topic_pages.each do |row|
                i+=1
                row = row.flatten
                topic_pages_attributes = row[1..-1]
                graph[row[0]] = []
                topic_pages.to_a[i..-1].each do |row2|
                        row2 = row2.flatten
                        topic_pages_attributes2 = row2[1..-1]
                        num_matching_attributes = (topic_pages_attributes2 & topic_pages_attributes).count
                        if num_matching_attributes >= similarity or num_matching_attributes == topic_pages_attributes2.count or num_matching_attributes == topic_pages_attributes.count
                                graph[row[0]].push(row2[0])
                        end
                end
                csv << [row[0], graph[row[0]]].flatten
        end
end

      

+3


source to share


2 answers


  • test . For example using cProfile that ships with Python. It's easy to have some costly inefficiencies in your code, and they can easily get 10% performance in intensive applications.

    Pretty code like

    (topic_pages_attributes2 & topic_pages_attributes).count
    
          

    can be a major factor in your runtime environment that can be easily mitigated using more traditional code.

  • Use more effective language. For example, in benchmarksgame.alioth , for a number of intense problems, the fastest Python 3 program is in the median 63x slower than the fastest C program (Ruby is 67x, JRuby is 33x). Yes, the performance gap can be large, even with well-optimized Python code. But if you haven't optimized your code, it can be even bigger; and you will be able to get 100x-1000x speedup by using a more efficient language and optimizing your code carefully.

  • Consider a smarter formulation of your problem. For example, instead of repeating each node, repeat on each edge once. In your case, this probably means creating an inverted index, topic -> pages. This is very similar to how text search engines work, and a popular way to compute such operations on clusters is that individual topics can be split into separate nodes. This approach depends on the sparsity of your data. This can drastically reduce the execution time of your algorithm.

    You have about 3 Mio documents. Judging by your total data size, they probably have less than 100 topics on average? Your side-by-side comparison approach requires 3mio ^ 2 comparisons, that's what's stopping you. If the more popular topics are only used for 30,000 documents, you can get away with calculating only 30,000 ^ 2 * topics. Assuming you have 100 of these very popular themes (rare themes don't really matter), that would be a 100x speedup.

  • Simplify your problem. For example, first merge all documents that have exactly the same topics by sorting. To make it more efficient, also eliminate any threads that happen exactly once. But there are probably only about 10,000-100,000 different sets of documents. This step can be easily solved using sorting and will make your problem roughly 900-90000 times easier (assuming above the range of values).



Some of these numbers might be overly optimistic - for example, IO is not counted at all, and if your problem is I / O related, using C / Java may not help much. There may be some very popular topics that might hurt the approaches discussed in C. For D) you need O (n log n) time to sort your data; but there are very good implementations for this. But this is definitely a simplification you should make. These documents will also generate fully linked clicks in your final data, which can also hurt other analyzes.

+2


source


Most of the time is spent loading data from disk i vive. Parallelize the read data across multiple threads / processes and then create a graph.



Also you could create subset plots on different machines and combine them later.

0


source







All Articles