How can I handle a lot without gobbling cpu?

I know this is not the best question. Let me explain.

I am doing a TON of text processing which converts natural language to xml. These text files load pretty quickly and are thrown into the queue. From there they are pulled one by one into the background worker, which calls our parser (using boost spirit) to convert the text to xml and load the relevant parts into our db.

The parser can do about 100 of them at a time. I have bid limiters in the background to only poll our queue every so often right now, so it doesn't work that fast. Right now I can't quit more than one worker because my HTTP requests are starting to crash - the desktop and the webserver exist on the same machine and I believe this is due to the 80-95% CPU usage though we can also use more bar on it.

I need to scale this better. How would you do it?

In answers to a few questions:

  • we use amazon webservices, so buying cheap extra hardware is a little different from creating a new Amazon instance - maybe someone did some code that copies autospawns instances for the amount of load?

  • We have an http server that just populates our files in the queue, so the only reason this would be affected is that the CPU is busy processing a ton of parsing related stuff.

  • I'm already limiting the number of our background workers, although we don't use this in the parser itself

  • I haven't tried it yet, but I've used it in the past - I need to write some tests on this

  • the parser is completely separate from the web server - we have nginx / merb as our web application server and the rake task calling C ++ as our background worker - but they exist on the same machine

+1


source to share


10 replies


I would buy a couple of cheap computers and process them. As Jeff says in his last post, "Always try to spend your way out of a performance issue first by dropping faster hardware."



+4


source


Maybe just put a background worker with a lower scheduling priority (e.g. with nice ). This means your server can handle requests when needed, but when not busy, you can start full word processing.



Obviously, this will do you much more value than arbitrarily dropping the background worker.

+8


source


I'm not sure if I'm following your question exactly, but it looks like you have an HTTP engine that feeds a pending queue. Right? The background thread takes these requests into a queue and does the heavy lifting part, right?

So it sounds like the background process is computed as bound, and the foreground process is essentially I / O binding ... or at least limited so that new work can be submitted.

The best way to optimize such a process is to set a background process with a lower priority than the foreground process. This will ensure that the background process is running the job. Then you set the depth of the interprocess queue so that its size is limited by the maximum amount of work that you want to defer right away.

+3


source


One thing I have done, if you have this, is to migrate these parsing services to a cloud hosting service.

I moved a few of my distributed services (search engines, bulk email, error log) to the cloud service from my main machine and it was a fantastic download on our main web server.

Plus cloud computing is cheaper and scalable almost infinitely.

+1


source


I don't understand why you would worry about your cpu being at 100%. If the job is required and it is not related to IO, then your CPU should be 100%.

Left:

  • Do you have enough CPU to do all the work you need in the time available?

If you don't need more machines, a faster processor, or more CPU efficient algorithms. The first two options are probably cheaper than the third - depending on the size of your business!

  • Are there any tasks that need to be more responsive than others?

It looks like there is. It sounds like you want the HTTP server to be responsive, while the parser jobs can run at their own pace (as long as the queue is emptying faster than filling up). As others have noted, nice tells the OS to highlight low priority processes when CPU cycles are "abandoned" after higher priority processes have taken what they need (although it's not quite as black and white).

+1


source


I am assuming you have multiple threads, each of which belongs to one of two groups

  • group A that loads text files
  • group B which converts text to xml

If you think Group B is limiting your bandwidth, I would put my "threads" on a lower priority. If there is enough work, the CPU will still be used at 100%, but the load will not be affected.

If my assumption is correct, you should also use multi-core and multi-processor machines, as your performance should scale very well with more processors.

0


source


I would put a parser on my machine. This way it won't affect the web server.

If you don't have a budget for another computer, use virtualization ( OpenVZ is great if your web server is hosted on Ubuntu or CentOS) to limit the CPU quota for the analyzer.

0


source


If you are having trouble serving interrupt requests, you can try improving the quality of CPU-bound tasks. Then we downsize the HTTP server. Basically, try using the system scheduler to your advantage and don't treat all tasks as equals.

0


source


I don't know what OS you are using, but most of them have functions for prioritizing threads / processes. As long as the parser process has a lower priority than the HTTP process, it should be good.

0


source


never forget energy / hosting prices. try to find the bottleneck in your code. if you haven't, i am sure you can reduce cpu consumption by up to 25-50%

0


source







All Articles