test_list_servers_negative failing two tests

Bug #1061167 reported by Jay Pipes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Fix Released
Critical
Jay Pipes

Bug Description

Repeatable locally and in latest Tempest full gate run:

21:39:35 ======================================================================
21:39:35 ERROR: Return an empty list for servers detail when no active servers
21:39:35 ----------------------------------------------------------------------
21:39:35 Traceback (most recent call last):
21:39:35 File "/opt/stack/tempest/tempest/tests/compute/test_list_servers_negative.py", line 331, in test_list_servers_detail_when_no_servers_running
21:39:35 self.client.wait_for_server_termination(server['id'])
21:39:35 File "/opt/stack/tempest/tempest/services/nova/json/servers_client.py", line 167, in wait_for_server_termination
21:39:35 raise exceptions.BuildErrorException(server_id=server_id)
21:39:35 BuildErrorException: Server b04debc5-943f-49a3-93df-a1e446bfb436 failed to build and is in ERROR status
21:39:35
21:39:35 ======================================================================
21:39:35 ERROR: Return an empty list when there are no active servers
21:39:35 ----------------------------------------------------------------------
21:39:35 Traceback (most recent call last):
21:39:35 File "/opt/stack/tempest/tempest/tests/compute/test_list_servers_negative.py", line 89, in test_list_servers_when_no_servers_running
21:39:35 self.client.wait_for_server_termination(server['id'])
21:39:35 File "/opt/stack/tempest/tempest/services/nova/json/servers_client.py", line 167, in wait_for_server_termination
21:39:35 raise exceptions.BuildErrorException(server_id=server_id)
21:39:35 BuildErrorException: Server b04debc5-943f-49a3-93df-a1e446bfb436 failed to build and is in ERROR status

Looking into the log files I found the following in n-sch's log:

Attempting to build 1 instance(s) from (pid=22058) schedule_run_instance /opt/stack/nova/nova/scheduler/filter_scheduler.py:66
2012-10-02 21:31:38 DEBUG nova.scheduler.filters.retry_filter [req-3010657d-1acc-483a-95c0-f9a377af1757 ListServersNegativeTest-user ListServersNegativeTest-tenant] Previously tried hosts: [u'devstack-1349212287']. (host=devstack-1349212287) from (pid=22058) host_passes /opt/stack/nova/nova/scheduler/filters/retry_filter.py:39
2012-10-02 21:31:38 DEBUG nova.scheduler.host_manager [req-3010657d-1acc-483a-95c0-f9a377af1757 ListServersNegativeTest-user ListServersNegativeTest-tenant] Host filter function <bound method RetryFilter.host_passes of <nova.scheduler.filters.retry_filter.RetryFilter object at 0x4b41e10>> failed for devstack-1349212287 from (pid=22058) passes_filters /opt/stack/nova/nova/scheduler/host_manager.py:253
2012-10-02 21:31:38 WARNING nova.scheduler.driver [req-3010657d-1acc-483a-95c0-f9a377af1757 ListServersNegativeTest-user ListServersNegativeTest-tenant] [instance: b04debc5-943f-49a3-93df-a1e446bfb436] Setting instance to ERROR state.

I'm not very familiar with this RetryFilter and when it was added to Nova. I will chat with folks on IRC to try and figure out why this is suddenly failing...

Revision history for this message
Jay Pipes (jaypipes) wrote :
Download full text (7.0 KiB)

Relevant IRC conversation:

jaypipes> comstud, sandywalsh: around? I need some assistance trying to figure out the root cause of https://bugs.launchpad.net/tempest/+bug/1061167
<uvirtbot> Launchpad bug 1061167 in tempest "test_list_servers_negative failing two tests" [Critical,Confirmed]
<comstud> jaypipes: yes
<jaypipes> comstud, sandywalsh: has to do with "RetryFilter" in scheduler... I'm not familiar with that. Is it recently added?
<comstud> a couple months ago
<comstud> belliott added it
<comstud> but i'm familiar with it
<jaypipes> comstud: can you give me the sknny?
<comstud> essentially almost all instances -> ERROR are kicked back to the scheduler and retried
<jaypipes> I see.
<comstud> any exception caught in compute manager while building
<comstud> it's configurable... you can turn it off
<comstud> however.. it's needed to handle race conditions
<jaypipes> comstud: k.
<comstud> where we think compute has enough room
<comstud> but it really doesn't
<comstud> this goes along with the new resource tracking stuff
<comstud> (nova/compute/resource_tracker.y)
<comstud> +p
* alunduil_ has quit (Ping timeout: 260 seconds)
* alunduil has quit (Ping timeout: 245 seconds)
<jaypipes> k
<rmk> Folsom horizon still has the admin warning banner across it. I thought all the bugs around this got sorted?
* dani4571 has quit (Quit: Leaving)
<comstud> jaypipes: There needs to be some better logging, etc, in the scheduler... and questionably a fault created...
<adam_g> is this a new one or is it already being tracked elsewhere? https://bugs.launchpad.net/nova/+bug/1061166
<uvirtbot> Launchpad bug 1061166 in nova "ec2 instance IDs are broken after folsom upgrade" [Undecided,New]
<comstud> so we know better which hosts were tried
<comstud> jaypipes: i think right now, you have to scan through your hosts for the instance uuid to find the traceback
<comstud> for the ultimate reason for failure
<jaypipes> comstud: that is what I have been doing :)
<comstud> yeah :)
<jaypipes> comstud: http://paste.openstack.org/show/21279/
<jaypipes> comstud: that
<jaypipes> 's the "error" that occurred on initial scheduling request
<jaypipes> comstud: which doens't make sense to me.
<comstud> weird
* danwent (~danwent@64.125.181.92) has joined #openstack-dev
<comstud> tempest is okay with us for xen
<comstud> at least on code within the past week
<jaypipes> comstud: and the line directly after that is even weirder...
<jaypipes> 2012-10-02 21:31:25 AUDIT nova.compute.manager [req-3010657d-1acc-483a-95c0-f9a377af1757 ListServersNegativeTest-user ListServersNegativeTest-tenant] [instance: b04debc5-943f-49a3-93df-a1e446bfb436] Starting instance...
<comstud> lol
<jaypipes> comstud: so perhaps the QEMU error is indeed expected.
<comstud> same request id
<comstud> yeah
<jaypipes> comstud: and it's just output when an instance with the instance ID does not exist (checked prior to launch)
<comstud> what's the last log message for req-3010657d?
<comstud> or last log messages
<comstud> if something bombed out, i'd expect it to be a traceback
<comstud> they should still be logged before compute kicks it back to scheduler for retrying
* aeperezt (~aeperezt@fedora/aeperezt) has joined #o...

Read more...

Revision history for this message
David Kranz (david-kranz) wrote :

I don't think I agree with comstud's tentative suggestion that a deleted server could be expected to be in an ERROR state on its way to being deleted. But this is really a spec issue. The question from an API standpoint is What does it mean if a server is listed and shows a state of ERROR? Does it depend on what API was last called? Or next to last, etc.? Was this a behavior change?

Revision history for this message
Jay Pipes (jaypipes) wrote :

This was a behaviour change that was introduced by the expected status exception classes and is highlighted in Tempest only because the server is created and destroyed in such rapid succession...

I have rewritten the test_list_servers_negative substantially and will submit a pull request shortly that addresses this bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/14049

Changed in tempest:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/14066

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/14066
Committed: http://github.com/openstack/tempest/commit/444c3e64ed1347adcfc4204c1e585e2d91022722
Submitter: Jenkins
Branch: master

commit 444c3e64ed1347adcfc4204c1e585e2d91022722
Author: Jay Pipes <email address hidden>
Date: Thu Oct 4 19:26:35 2012 -0400

    Refactor list servers negative test and fix remaining gate bugs

    test_list_servers_negative had so many issues with it, it's hard
    to list them here... but at least the following has been fixed:

    * No longer will the test flatly delete all instances the tenant
      and alt tenant have during every test method run (!!)
    * Fixtures -- 2 active and 1 deleted server -- are created in the
      class' setUpClass() method instead of having test methods
      constantly re-launch instances -- this reduces the test time
      down about 3 minutes
    * Removed all non-negative tests
    * Removed all tests that were skipped due to bugs that have been
    * Skip the XML create_service() test for Identity Admin tests
      because of the upstream Keystone bug 1061738

    This patch also includes patches for bug fixes that Sean Gallagher
    contributed but needed to be rebased into this one in order
    to get the full gate to pass.

    Change-Id: I0b38cf045520d93031c7916dae2479daf69d0411
    fixes: LP bug #1059802
    fixes: LP bug #1060373
    fixes: LP bug #1061740
    fixes: LP bug #1061167

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
Matt Riedemann (mriedem) wrote :

I've noticed recently that if I run the full test_list_servers_negative module with allow_tenant_isolation=false in tempest.conf, I can get test_list_servers_by_changes_since to consistently fail. If I run it by itself, it's OK. If I run the full module with allow_tenant_isolation=true (the default value), it's also fine.

I'm running against havana master.

This is what I get:

FAIL: tempest.api.compute.servers.test_list_servers_negative.ListServersNegativeTestJSON.test_list_servers_by_changes_since[gate]
----------------------------------------------------------------------
_StringException: Traceback (most recent call last):
  File "/tmp/tempest/tempest/tempest/api/compute/servers/test_list_servers_negative.py", line 181, in test_list_servers_by_changes_since
    self.assertEqual(num_expected, len(body['servers']))
  File "/usr/lib/python2.6/site-packages/testtools/testcase.py", line 322, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/usr/lib/python2.6/site-packages/testtools/testcase.py", line 417, in assertThat
    raise MismatchError(matchee, matcher, mismatch, verbose)
MismatchError: 3 != 4

Suppose I should open a new bug...

Sean Dague (sdague)
Changed in tempest:
milestone: none → havana-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.