How can I handle a lot without gobbling cpu?
I know this is not the best question. Let me explain.
I am doing a TON of text processing which converts natural language to xml. These text files load pretty quickly and are thrown into the queue. From there they are pulled one by one into the background worker, which calls our parser (using boost spirit) to convert the text to xml and load the relevant parts into our db.
The parser can do about 100 of them at a time. I have bid limiters in the background to only poll our queue every so often right now, so it doesn't work that fast. Right now I can't quit more than one worker because my HTTP requests are starting to crash - the desktop and the webserver exist on the same machine and I believe this is due to the 80-95% CPU usage though we can also use more bar on it.
I need to scale this better. How would you do it?
In answers to a few questions:
-
we use amazon webservices, so buying cheap extra hardware is a little different from creating a new Amazon instance - maybe someone did some code that copies autospawns instances for the amount of load?
-
We have an http server that just populates our files in the queue, so the only reason this would be affected is that the CPU is busy processing a ton of parsing related stuff.
-
I'm already limiting the number of our background workers, although we don't use this in the parser itself
-
I haven't tried it yet, but I've used it in the past - I need to write some tests on this
-
the parser is completely separate from the web server - we have nginx / merb as our web application server and the rake task calling C ++ as our background worker - but they exist on the same machine
source to share
I would buy a couple of cheap computers and process them. As Jeff says in his last post, "Always try to spend your way out of a performance issue first by dropping faster hardware."
source to share
Maybe just put a background worker with a lower scheduling priority (e.g. with nice ). This means your server can handle requests when needed, but when not busy, you can start full word processing.
Obviously, this will do you much more value than arbitrarily dropping the background worker.
source to share
I'm not sure if I'm following your question exactly, but it looks like you have an HTTP engine that feeds a pending queue. Right? The background thread takes these requests into a queue and does the heavy lifting part, right?
So it sounds like the background process is computed as bound, and the foreground process is essentially I / O binding ... or at least limited so that new work can be submitted.
The best way to optimize such a process is to set a background process with a lower priority than the foreground process. This will ensure that the background process is running the job. Then you set the depth of the interprocess queue so that its size is limited by the maximum amount of work that you want to defer right away.
source to share
One thing I have done, if you have this, is to migrate these parsing services to a cloud hosting service.
I moved a few of my distributed services (search engines, bulk email, error log) to the cloud service from my main machine and it was a fantastic download on our main web server.
Plus cloud computing is cheaper and scalable almost infinitely.
source to share
I don't understand why you would worry about your cpu being at 100%. If the job is required and it is not related to IO, then your CPU should be 100%.
Left:
- Do you have enough CPU to do all the work you need in the time available?
If you don't need more machines, a faster processor, or more CPU efficient algorithms. The first two options are probably cheaper than the third - depending on the size of your business!
- Are there any tasks that need to be more responsive than others?
It looks like there is. It sounds like you want the HTTP server to be responsive, while the parser jobs can run at their own pace (as long as the queue is emptying faster than filling up). As others have noted, nice tells the OS to highlight low priority processes when CPU cycles are "abandoned" after higher priority processes have taken what they need (although it's not quite as black and white).
source to share
I am assuming you have multiple threads, each of which belongs to one of two groups
- group A that loads text files
- group B which converts text to xml
If you think Group B is limiting your bandwidth, I would put my "threads" on a lower priority. If there is enough work, the CPU will still be used at 100%, but the load will not be affected.
If my assumption is correct, you should also use multi-core and multi-processor machines, as your performance should scale very well with more processors.
source to share