Ceilometer

mongo aggregation pipeline for resource retrieval fails with excessive memory use

Bug #1262571 reported by Eoghan Glynn on 2013-12-19

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	High	Eoghan Glynn	Ceilometer 2014.1 "icehouse"
	Havana	Fix Released	High	Eoghan Glynn	Ceilometer 2013.2.2

Bug Description

The mongodb storage driver currently uses an aggregation pipeline over the meter collection in order to construct a list of resources adorned with first & last sample timestamps etc.

The problem with this approach is that the mongodb aggregation framework performs sorting in-memory, in this case operating over a potentially very large collection (particularly if the GET /v2/resources was not constrained with query params, e.g. to limit to a single tenant for example).

It turns out the mongodb innards are hardcoded to abort any sorts in an aggregation pipeline that will consume more than 10% of physical memory. The net result is that we see failures in production such as:

ERROR wsme.api [-] Server-side error: "command SON([('aggregate', u'meter'), ('pipeline', [{'$match': {}},
{'$sort': {'timestamp': -1, 'project_id': -1, 'user_id': -1}}, {'$group': {'meters_unit': {'$push': '$counter_unit'},
'source': {'$first': '$source'}, 'project_id': {'$first': '$project_id'},
'user_id': {'$first': '$user_id'}, 'last_sample_timestamp': {'$max': '$timestamp'},
'meters_name': {'$push': '$counter_name'}, 'first_sample_timestamp': {'$min': '$timestamp'},
'meters_type': {'$push': '$counter_type'}, '_id': '$resource_id', 'metadata': {'$first': '$resource_metadata'}}}])])
failed: exception: terminating request: request heap use exceeded 10% of physical RAM"

Discussion of the fossil record on gerrit indicates that the use of the aggregation framework in this context was primarily for convenience:

https://review.openstack.org/35297

Switching over to storing the first and last timestamps in the resource collection directly (and updating these on every sample insert) is not a workable approach, as there are no universal first and last timestamps for a resource that will always be applicable regardless on the constraints on the resource query.

Hence the workable approaches to resolving this issue are:

1. avoid the need for sorting in-memory by ensuring sufficient indices exist on the meter collection (currently the sort instructions for resource retrieval default to timestamp, project_id, user_id all descending)

2. avoid the aggregation framework altogether and instead revert to the equivalent map-reduce

Note that resource retrieval is the only case where the aggregation framework is currently used by the mongodb storage driver.

Eoghan Glynn (eglynn) on 2014-01-07

Changed in ceilometer:
milestone:	none → icehouse-2
assignee:	nobody → Eoghan Glynn (eglynn)
importance:	Undecided → High
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-10: Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/65962

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-15: Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/65962
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ba6641afacfc52e7391d2095751ee96d62a64c25
Submitter: Jenkins
Branch: master

commit ba6641afacfc52e7391d2095751ee96d62a64c25
Author: Eoghan Glynn <email address hidden>
Date: Thu Jan 9 16:30:10 2014 +0000

Replace mongo aggregation with plain ol' map-reduce

Fixes bug 1262571

    Previously, the mongodb storage driver an aggregation pipeline
    over the meter collection in order to construct a list of resources
    adorned with first & last sample timestamps etc.

    However mongodb aggregation framework performs sorting in-memory,
    in this case operating over a potentially very large collection.
    It is also hardcoded to abort any sorts in an aggregation pipeline
    that will consume more than 10% of physical memory, which is
    observed in this case.

Now, we avoid the aggregation framework altogether and instead
use an equivalent map-reduce.

Change-Id: Ibef4a95acada411af385ff75ccb36c5724068b59

Changed in ceilometer:
status:	In Progress → Fix Committed

Eoghan Glynn (eglynn) on 2014-01-15

tags:

added: havana-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-15: Fix proposed to ceilometer (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/66861

Thierry Carrez (ttx) on 2014-01-22

Changed in ceilometer:
status:	Fix Committed → Fix Released

Alan Pevec (apevec) on 2014-02-04

tags:

removed: havana-backport-potential

Thierry Carrez (ttx) on 2014-04-17

Changed in ceilometer:
milestone:	icehouse-2 → 2014.1

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1259446

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.