Browser client should limit the number of parallel requests

Bug #1912834 reported by Bill Erickson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenSRF
New
High
Unassigned

Bug Description

Evergreen 3.6

Following up on bug #1896285.

It's very easy to write Angular/AngularJS code that results in large numbers of parallel API calls, which can result in overwhelming EG servers -- or at least maxing out service drones, which negatively impacts usability of the system.

Instead of playing wack-a-mole with the errant code, I propose we bake a hard limit into the Ang/AngJS networking code to prevent more than X active parallel requests at any one time. With this, it should not be possible for a single client to max-out a service.

I propose a max of 5 active requests per browser tab, but I could see making it configurable.

Revision history for this message
Bill Erickson (berick) wrote :

On reflection, I think it makes more sense to bake this into OpenSRF's websocket client library so we only have to modify one part of the code.

Revision history for this message
Bill Erickson (berick) wrote :

Here's a commit which bakes the request limiting into opensrf.js

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1912834-max-parallel-net

To test:

[1] Navigate to the record bucket interface in the browser client.
[2] Create a bucket
[3] Go to Record Query, set Row Count to 50, and search for e.g. piano
[4] Select and add all of the result records to the bucket (46 records in my test)
[5] Note a blast of API calls in the console and a warning message: "Net throttling activated on max requests".

Also note that even though 40+ open-ils.actor calls are made, the open-ils.actor service drone counts stay within a reasonable range (5 for me).

Prior to this patch, the steps above would max out my actor drones.

Changed in opensrf:
milestone: none → 3.2.2
tags: added: pullrequest
Changed in evergreen:
assignee: Bill Erickson (berick) → nobody
Changed in opensrf:
importance: Undecided → High
no longer affects: evergreen
Revision history for this message
Bill Erickson (berick) wrote :

Note I just force-pushed a fix to remove an old bogus commit from the branch that should not have been included.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a new branch with an improved implementation, based on issues found while testing.

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1912834-max-parallel-net-v2

Revision history for this message
Chris Sharp (chrissharp123) wrote :

More results from testing this fix on PINES production (3.6.1). Dojo Acquisitions UIs do not fully load and we're seeing console errors like the following:

Failed to load resource: the server responded with a status of 400 (Bad Request)
VM283:180 Uncaught Error: Transport error method=open-ils.acq.fund.check_balance_percentages, status=400
    at Object.args.ontransporterror (eval at dojo.eval (dojo.js:16), <anonymous>:180:23)
    at OpenSRF.XHRequest.transport_error_handler (eval at dojo.eval (dojo.js:16), <anonymous>:130:19)
    at OpenSRF.XHRequest.core_handler (eval at dojo.eval (dojo.js:16), <anonymous>:104:21)
    at XMLHttpRequest.xreq.onreadystatechange (eval at dojo.eval (dojo.js:16), <anonymous>:59:29)
/osrf-http-translator:1 Failed to load resource: the server responded with a status of 400 (Bad Request)
VM283:180 Uncaught Error: Transport error method=open-ils.acq.fund.check_balance_percentages, status=400
    at Object.args.ontransporterror (eval at dojo.eval (dojo.js:16), <anonymous>:180:23)
    at OpenSRF.XHRequest.transport_error_handler (eval at dojo.eval (dojo.js:16), <anonymous>:130:19)
    at OpenSRF.XHRequest.core_handler (eval at dojo.eval (dojo.js:16), <anonymous>:104:21)
    at XMLHttpRequest.xreq.onreadystatechange (eval at dojo.eval (dojo.js:16), <anonymous>:59:29)
/osrf-http-translator:1 Failed to load resource: the server responded with a status of 400 (Bad Request)
VM283:180 Uncaught Error: Transport error method=open-ils.acq.fund.check_balance_percentages, status=400
    at Object.args.ontransporterror (eval at dojo.eval (dojo.js:16), <anonymous>:180:23)
    at OpenSRF.XHRequest.transport_error_handler (eval at dojo.eval (dojo.js:16), <anonymous>:130:19)
    at OpenSRF.XHRequest.core_handler (eval at dojo.eval (dojo.js:16), <anonymous>:104:21)
    at XMLHttpRequest.xreq.onreadystatechange (eval at dojo.eval (dojo.js:16), <anonymous>:59:29)
/osrf-http-translator:1 Failed to load resource: the server responded with a status of 400 (Bad Request)
VM283:180 Uncaught Error: Transport error method=open-ils.acq.fund.check_balance_percentages, status=400
    at Object.args.ontransporterror (eval at dojo.eval (dojo.js:16), <anonymous>:180:23)
    at OpenSRF.XHRequest.transport_error_handler (eval at dojo.eval (dojo.js:16), <anonymous>:130:19)
    at OpenSRF.XHRequest.core_handler (eval at dojo.eval (dojo.js:16), <anonymous>:104:21)
    at XMLHttpRequest.xreq.onreadystatechange (eval at dojo.eval (dojo.js:16), <anonymous>:59:29)
/osrf-http-translator:1 Failed to load resource: the server responded with a status of 400 (Bad Request)
VM283:180 Uncaught Error: Transport error method=open-ils.acq.fund.check_balance_percentages, status=400
    at Object.args.ontransporterror (eval at dojo.eval (dojo.js:16), <anonymous>:180:23)
    at OpenSRF.XHRequest.transport_error_handler (eval at dojo.eval (dojo.js:16), <anonymous>:130:19)
    at OpenSRF.XHRequest.core_handler (eval at dojo.eval (dojo.js:16), <anonymous>:104:21)
    at XMLHttpRequest.xreq.onreadystatechange (eval at dojo.eval (dojo.js:16), <anonymous>:59:29)

Rolling back the patch resolves that issue.

Revision history for this message
Bill Erickson (berick) wrote :

Here's a version that just logs an annoying error message when the max is exceeded. It does no message throttling. I also upped the max to 10 to better focus on the worst offenders.

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1912834-max-parallel-net-logging

The hope is the error message will be enough to clarify situations that must be fixed and, over time, direct coders to modify code in progress when it's discovered too many network requests are going out.

Revision history for this message
Blake GH (bmagic) wrote :

Confirmed - the acq issue happens for us as well. We really don't like when production goes down because of overloaded connections, so we want this patch! But we don't want acq to be a casualty. Right now, we forked the acq traffic to a different application server that does not have this patch. Trying to have our cake and eat it too!

Other than the acq issue, I can say that the production environment has bee much more stable so far today. The number of apache2 processes spike up once in a while but don't say there. Quickly return to "normal" and the brick CPU goes back to normal. The day is young but I'm hopeful.

bill++
chris++

tags: added: parallel-requests
Galen Charlton (gmc)
Changed in opensrf:
milestone: 3.2.2 → none
Revision history for this message
Galen Charlton (gmc) wrote :

Noting while noodling around with Bill's lp1912834-max-parallel-net-v2 that the following patch will allow the Dojo acquisitions UI to perform a search and retrieve line items:

diff --git a/src/javascript/opensrf.js b/src/javascript/opensrf.js
index 4e138f5f..c168c1ec 100644
--- a/src/javascript/opensrf.js
+++ b/src/javascript/opensrf.js
@@ -708,7 +708,7 @@ function log(msg) {
 OpenSRF.Stack.push = function(net_msg, callbacks) {
     var ses = OpenSRF.Session.find_session(net_msg.thread);
     if (!ses) return;
- ses.remote_id = net_msg.from;
+ ses.remote_id = null; // was net_msg.from;

     // NetMessage's from websocket connections are parsed before they get here
     osrf_msgs = net_msg.osrf_msg;

Now, this is hardly a fix, as it surely breaks stateful connections, but it does point to why the HTTP translator starts returning 400 errors: eventually ses.remote_id gets set to a non-null value, so XHR sends an X-OpenSRF-to header rather than an X-OpenSRF-service one - but one that somehow doesn't correspond to the correct cached session.

Revision history for this message
Bill Erickson (berick) wrote :

Just noting if/when we adopt the Rust weboscket gateway, I added request throttling with a configurable max-parallel value (defaults to 8, like ye old timey httprequests). It does not help on the Dojo side, but that'll be gone soon enough.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.