Large enclosure wiring resets our stack - but only in production mode?

Question

Large enclosure wiring resets our stack - but only in production mode?

Our Rails application has two environments to which we deploy servers for the Staging environment and the default environment.

The staging.rb file is a copy of the production.rb file from the config / environment folder. The difference between the two is whiny nils is true:

config.whiny_nils = true

Since the rails app is primarily used for this API, we ran it on one of our internal staging servers for the developers they work with. It worked without a hitch for almost 4 months. When it came time to move to our production server, the stack started to crash consistently whenever a POST or PUT came in with a large (sometimes VERY VERY large) body. When testing between two servers, the same requests were processed on a staging server.

The most frustrating part of crashes / freezes was missing logs or tracking where on the stack (nginx, phusion passenger, ruby 1.9 patch level 243, rails 2.3.4) where the crash occurred. Nothing showed up in the nginx error log, rails logs, or anywhere else we could find. Since we were running the production server with updated versions of nginx, passenger and rubies (higher level of patches and then stage, but still 1.9), we started returning each component one at a time, even as long as all the executables were passed files and support files (basically everything we installed in / usr / local) on the production machine to no avail. Just as we were about to wipe the machine down and try every step again, someone suggested switching the production machine to a "staging" environment, and like magic,problem solved!

Wanting to know what might have caused the error, we started combing the rails core, our own code and all our plugins looking for some clue as to what might have caused such a massive hang / crash in the environment production, again useless.

The only clue I could find was behavior. When testing the application "on" (one of the pages is actually running a rails application), I would crash the application by sending a request, and then after frequent updates (usually 3-4) I could generate an error from Nginx, and eventually the application will start processing requests again. The error is as follows:

    Error during failsafe response: incompatible character encodings: UTF-8 and ASCII-8BIT
    2009/10/09 17:52:40 [error] 8691 # 0: * 88 upstream prematurely closed connection while reading response header from upstream, client: * my ip address *, server: myapp.mydomain.com, request: "GET / api / sections / 4 / edit HTTP / 1.1 ", upstream:" passenger: // unix: /tmp/passenger.8677/master/helper_server.sock: ", host:" myapp.mydomain.com "
    *** Exception NoMethodError in application (undefined method `each 'for nil: NilClass) (process 8703): from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger /rack/request_handler.rb:95:in `process_request '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_request_handler.rb:206:in `main_loop '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:376:in `start_request_handler '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:334:in `block in handle_spawn_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/utils.rb:182:in `safe_fork '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:332:in `handle_spawn_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:162:in `start '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:213:in `start '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:261:in `block (2 levels) in spawn_rails_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:126:in `lookup_or_add '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:255:in `block in spawn_rails_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:80:in `block in synchronize '
    from: 8: in `synchronize '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:79:in `synchronize '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:254:in `spawn_rails_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:153:in `spawn_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:286:in `handle_spawn_application '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously '
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/bin/passenger-spawn-server:61:in ''

Usually when a character encoding error occurs, my first move is ruby 1.9. However, as you can tell from my testing, it was the same version on both machines!

After all this, I think I'm wondering ... does anyone have any idea what is going on? Obviously we can run our application under development, but I'm worried that I may have found something deeper that needs to be addressed. Any ideas for the next place where I should be looking where this is going?

Our setup: Mac OS X Server: 10.6.1,
Rails 2.3.4,
Ruby 1.9p243,
Nginx 0.8.17,
Passenger 2.2.5

Our required gems:
environment.rb
demons
RMagick
test.rb
RSpec
RSpec rails
factory -Girl
rack

Our installed plugins:
act-as-dag (active recording plugin for creating directional-acyclic graphs)
daemon_generator
globalize2
no-peep (for testing)
thinking-sphinx

UPDATE (In response to khelll):

I tried adding config.whiny_nils = true to the production environment, however the crash is still happening.

Also, I went back to our staging server and set the environment to "Production".,. Same crash!

Some clarification on what I mean by "large" request authority. One POST / PUT that sequentially crashed the application was about 20,000 characters (json). Since the API is consistently used throughout the day with small PTS / MAILS and stayed valid but only crashed / hung when these large requests were made, I assumed they were related.

As of Rack / Ruby 1.9. Due to the amount of information about Rack and 1.9, I have updated our Rack gem to the latest in the git repository (which supposedly fixed some of the 1.9 issues). I read about significant difficulties regarding rewindable_input, ruby 1.9 and more ... however, since I was not getting the rewindable_input error that I encountered with another 1.9 application, I assumed it was a different problem. Also, I ruled out Rack when the rails environment changed, resolving the issue (when I searched the Rack source code, there didn't seem to be any environment specific methods that would throw the error).

Hope this helps!

UPDATE in response to pauliephonic

There are no posts that beat the rails logs at all (which actually prompted me to find some time in our web stack for this issue). My clue that the crash / freeze occurred is that after executing a large request, the application only returns 500 errors for each request, however those 500 errors do not show up in the Rails logs.

Our database configuration is identical (we used a mysql cluster, so it was literally identical, now uses a local mysql database, but confirmed that the error ended regardless of the database used)

Regarding multiple bytes / unicode. We are working in an internationalized application. However I don't think rails handle unicode changes between production and others? As I said above, this happened on POST

or PUT

. The way I tested during my debugging is to go to the same edit pages of one of my large, heavily nested models and just try to "save" it. This will crash the application in production, but it will not crash the application in development. Every time I tested the same symbols, same content, same button, same behavior., Different response based on environment. I couldn't even transfer the statementsputs

everywhere in my code because (it seemed like) the requests weren't getting to the rails app. I have not received any error messages in Rails logs or Nginx error logs (other than the ones I posted on multiple updates).

+2

ruby ruby-on-rails passenger rack

BushyMark 10 oct. '09 at 1:52

source to share

4 answers

khelll · Answer 1 · 2009-10-10T02:51:59+0000

I understand what config.whiny_nils

makes all the difference. You would look at the activesupport/lib/active_support/whiny_nils.rb

file (which looks so simple) and try to play around from there to see what makes the difference. I believe this has to do with the type of exception you are getting in production, which may not be thrown when using whiny_nils.

I believe you need to give more details on "sending a large body is flying out of our stack" because that might be a problem for Rack and Ruby 1.9

Bob aman · Answer 2 · 2009-10-19T23:14:27+0000

Because of the appearance of this error, Ruby is constantly looking at encodings. You have a string that it considers to be UTF-8 and you treat it as raw bytes, perhaps correctly. You need to identify the problem line and call buggy_string.force_encoding(Encoding::ASCII_8BIT)

. But heck if I know which part of the code is working on that line or why it only happens in production. I would not be surprised if we find that the problem is indeed deep within the Passenger's intestines.

As far as the difference between stage and production goes, it is almost certainly an issue with something that needs to be moved as a byte string and it is treated as a character string. There's a ton of code that only happens in production (like caching), and if any of that code does it, your problem.

The thing whiny_nils

probably doesn't matter.

pjb3 · Answer 3 · 2009-10-20T03:42:24+0000

I tried to run my application in production on apache / passenger, see if the problem is nginx / passenger specific

0

pjb3 Oct 20 '09 at 3:42

source to share

Adam Elhardt · Answer 4 · 2009-11-06T19:32:03+0000

I've had a similar problem over the past few months - not often enough to really debug it until recently. In my case, telling the Passenger where to put the temporary buffer files was a trick. What made it difficult to find was not only the absence of error messages in the log files, but the fact that this buffer is used not only for multi-page messages, but for any type of large message body.

PassengerUploadBufferDir /tmp

Large enclosure wiring resets our stack - but only in production mode?

More articles: