Comparing CPU speed is probably improved to justify the hardware rationale

I have a C # Console application fully integrated with a Monte Carlo processor, the execution time is inversely proportional to the number of allocated threads / cores (I keep a 1: 1 ratio between cores / threads).

It currently operates daily:

AMD Opteron 275 @ 2.21 GHz (4 cores)

Application is multithreaded using 3 threads, 4th thread for another Process Controller application.

It takes 15 hours a day to run .

I need to estimate as best as possible how long the same job will take to run on a system configured for the following processors:

http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540 

      

and compare the cases, I will recode it using the streams available. I want to justify that we need a server with 2 x x5570 processors versus the cheaper x5540 (they support 2 processors on one motherboard). This should make available 8 cores, 16 threads (which I believe works with Nehalem chips) into the operating system. So for my application, that's 15 threads for Monte Carlo simulation.

Any ideas how to do this? Is there a site I can go to and see benchmark data for all 3 CPUS participating in one flow test? Then I can extrapolate my case and the number of threads. I have access to the current system to install and run a test if needed.

Please note that the business is also dictating the workload for this application over the next 3 months, will increase by about 20x and should be completed within 24 hours.

Any help is greatly appreciated.

You posted this here as well: http://www.passmark.com/forum/showthread.php?t=2308 hope they can explain their benchmarking better so that I can effectively get a kernel score which would be much more helpful ...

+2


source to share


5 answers


tomshardware.com contains a comprehensive list of CPU benchmarks. However ... you can't just separate them, you need to find as close to apples as possible to compare apples to you, and you won't get it because the combination of instructions for your workload may or may not depend.

I would suggest that please don't take this official, you need real data for this, that you are probably using 1.5x - 1.75x single thread speedup if the work is cpu related and not heavily vectorized.

You also need to take into account that you are: 1) using C # and the CLR, unless you have taken steps to prevent it from being enabled and serialized. 2) nehalems have hyperthreads so you won't see perfect 16x speedup, most likely you will see speedup from 8x to 12x depending on how optimized your code is. Be optimistic here (just don't expect 16x). 3) I don't know how much competition you have, getting good scaling on 3 threads! = Good scaling on 16 threads, there can be dragons here (and usually are).

I would go around it like this:

15 hours * 3 threads / 1.5 x = 30 hours single thread working time on nehalem.

30/12 = 2.5 hours (best case)



30/8 = 3.75 hours (worst case)

implies parallel execution time if there is indeed a 20x increase: 2.5 hours * 20 = 50 hours (best case)

3.74 hours * 20 = 75 hours (worst case)

How much have you profiled, can you squeeze 2x out of the app? 1 might be enough, but most likely won't.

And to give it a try, try the parallel task library in .Net 4.0 or .Net 3.5 CTP, which should help with things like this.

-Rick

0


source


Have you considered recreating the algorithm in cuda ? It uses the GPU of the day to scale up computations like this 10 to 100 times. So you just need to buy a live graphics card



+2


source


Finding a single-threaded server that can scale to meet the needs you described will be difficult. I would recommend looking at Sun CoolThreads or other high-thread servers, even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp

T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml

Memory and processor cache widths can be a limiting factor for you if datasets are as large as they sound. How long does it take to get data from disk? Will the amount of RAM and cache be increased?

You might want to step back and see if there is another algorithm that can provide the same or similar solutions with less computation.

It looks like you've spent a lot of time optimizing the computation flow, but is every computation really important to the end result being performed?

Is there a way to shorten computation anywhere?

Is there a way to identify elements that have little impact on the end result and skip those calculations?

Can a lower resolution model be used for early iterations with detail added in progressive iterations?

The Monte Carlo algorithms I am familiar with are not deterministic and the execution time will be related to the number of samples; is there a way to optimize the sampling model to limit the number of items examined?

Obviously, I don't know what problem domain or dataset you are processing, but there may be another approach that can produce equivalent results.

+1


source


I'm going to go to the edge and say that even the dual connector of the X5570 won't be able to scale the workload you imagine. You must distribute your computations across multiple systems. Simple math:

Current workload

3 cores * 15 real-world-hours = 45 cpu-time-hours

      

Suggested Working Load 20X

45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores

      

Thus, to achieve your goal, you will need the equivalent of 45 2.2GHz Opteron cores (despite increasing processing time from 15 hours to 20 hours per day), assuming full linear performance scaling. Even if Nehalem processors are 3x faster per threads , you will still be on the outer edge of the performance conversion - no growth. This also assumes that hyper-threading will work even for your application.

The best ratings I've seen would put the X5570 at possibly 2X the performance of your existing Opteron.

Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm

0


source


This will swing a big hammer, but it might make sense to look at some heavy fiber 4-way servers. They are expensive, but at least you can get up to 24 physical cores in one box. If you've exhausted all other optimization tools (including SIMD) then this is something to consider.

I am also tired of other bottlenecks like memory bandwidth. I do not know the performance characteristics of Monte Carlo Simulations, but building up one resource may reveal some other bottlenecks.

0


source







All Articles