SSH timeout in tempest.test.boto.test_ec2_instance_run

Bug #1117555 reported by John Griffith
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
tempest
Fix Released
Critical
Attila Fazekas

Bug Description

quantum devstack gate failure:
http://logs.openstack.org/21179/1/gate/gate-tempest-devstack-vm-quantum/5156 : FAILURE

2013-02-06 16:26:40.052 | ======================================================================
2013-02-06 16:26:40.053 | ERROR: tempest.tests.boto.test_ec2_instance_run.InstanceRunTest.test_integration_1
2013-02-06 16:26:40.054 | ----------------------------------------------------------------------
2013-02-06 16:26:40.054 | _StringException: Traceback (most recent call last):
2013-02-06 16:26:40.054 | File "/opt/stack/new/tempest/tempest/tests/boto/test_ec2_instance_run.py", line 205, in test_integration_1
2013-02-06 16:26:40.054 | pkey=self.keypair.material)
2013-02-06 16:26:40.054 | File "/opt/stack/new/tempest/tempest/common/utils/linux/remote_client.py", line 31, in __init__
2013-02-06 16:26:40.054 | if not self.ssh_client.test_connection_auth():
2013-02-06 16:26:40.055 | File "/opt/stack/new/tempest/tempest/common/ssh.py", line 142, in test_connection_auth
2013-02-06 16:26:40.055 | connection = self._get_ssh_connection()
2013-02-06 16:26:40.055 | File "/opt/stack/new/tempest/tempest/common/ssh.py", line 75, in _get_ssh_connection
2013-02-06 16:26:40.055 | password=self.password)
2013-02-06 16:26:40.055 | SSHTimeout: Connection to the 172.24.4.228 via SSH timed out.
2013-02-06 16:26:40.055 | User: cirros, Password: None
2013-02-06 16:26:40.055 |
2013-02-06 16:26:40.055 | -------------------- >> begin captured logging << --------------------
2013-02-06 16:26:40.056 | tempest.config: INFO: Using tempest config file /opt/stack/new/tempest/etc/tempest.conf
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: DEBUG: "POST /v2.0/tokens HTTP/1.1" 200 None
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: DEBUG: "GET /v2.0/users/40d750dffc934dc293df7f5ce621ad33/credentials/OS-EC2 HTTP/1.1" 200 None
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.056 | requests.packages.urllib3.connectionpool: DEBUG: "POST /v2.0/tokens HTTP/1.1" 200 None
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: DEBUG: "GET /v2.0/users/40d750dffc934dc293df7f5ce621ad33/credentials/OS-EC2 HTTP/1.1" 200 None
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: DEBUG: "POST /v2.0/tokens HTTP/1.1" 200 None
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
2013-02-06 16:26:40.057 | requests.packages.urllib3.connectionpool: DEBUG: "GET /v2.0/users/40d750dffc934dc293df7f5ce621ad33/credentials/OS-EC2 HTTP/1.1" 200 None

James E. Blair (corvus)
affects: openstack-ci → tempest
Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

This was a problem for a while but doesn't seem to occur anymore and I'm unable to replicate it.

John are you able to still see the problem with the latest version of devstack?

Changed in tempest:
status: New → Incomplete
Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

This appears to be related to the security group rules for the test not ending up being correctly set resulting in the instance not being accessible via ssh which eventually results in a timeout. I saw the problem briefly today again, but it went away before I could fully debug it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/23129

Changed in tempest:
assignee: nobody → Christopher Yeoh (cyeoh-0)
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/23300

Changed in tempest:
assignee: Christopher Yeoh (cyeoh-0) → Sean Dague (sdague-b)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/23300
Committed: http://github.com/openstack/tempest/commit/305fe99820334ecd0c630232bf34891ba189b0db
Submitter: Jenkins
Branch: master

commit 305fe99820334ecd0c630232bf34891ba189b0db
Author: Sean Dague <email address hidden>
Date: Fri Mar 1 12:00:35 2013 -0500

    skip ec2 test until it can be debugged

    punts on bug #1117555 until it can be looked at more carefully

    Change-Id: I2a5d332f2ccd3c3f111fbda0756a4a4eb1c8cf03

Changed in tempest:
status: In Progress → Fix Released
Changed in tempest:
status: Fix Released → New
importance: Undecided → Critical
status: New → Triaged
Changed in tempest:
assignee: Sean Dague (sdague-b) → Attila Fazekas (afazekas)
Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

The extra assertions in https://review.openstack.org/#/c/23129/1 when it goes through should help narrow down the problem to whether it is an infrastructure issue or not.

Unfortunately I've found this problem almost next to impossible to replicate. The only time I've managed to get the exactly same error (the timeout in the initial connection, not the console creating output command) occur is when I forgot to set my local ethernet interface to promisc mode before starting up devstack (this is needed apparently on Ubuntu 12.10). This results in public ips never being accessible and so the initial connection times out.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/23129
Committed: http://github.com/openstack/tempest/commit/86732f93387ef9561c68b68ad7b92dc31c96b5ae
Submitter: Jenkins
Branch: master

commit 86732f93387ef9561c68b68ad7b92dc31c96b5ae
Author: Chris Yeoh <email address hidden>
Date: Wed Feb 27 20:52:06 2013 +1030

    More assertions for test_integration_1

    For the boto/test_ec2_instance_run.py test_integration_1 test
    add assertions to check that the security group modifications
    and public ip association calls actually succeed.

    This is to help debug the intermittent bug 1117555 which appears
    to be due to the instance sometimes being unreachable via its
    public ip.

    Change-Id: I8c1ffeee62e1021ecf24f05ac26518ced640c5a1

Changed in tempest:
status: Triaged → Fix Released
Revision history for this message
Attila Fazekas (afazekas) wrote :

The strange issue:
http://logs.openstack.org/22765/4/check/gate-tempest-devstack-vm-full/6460/
The above log tels us tempest connected successfully once (the server's client version logged), but later the test case failed.

The ssh.py by default considers a connection dead if does not gets input on the transport channel for more than 10 sec.

Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

I think the problem seen in http://logs.openstack.org/22765/4/check/gate-tempest-devstack-vm-full/6460/ should be "fixed" by https://review.openstack.org/#/c/23067/3 which ups the timeout to 60 secs.

There's two timeout's we see here: One in RemoteClient's __init__ making the initial connection which has a long timeout and appears to be caused by the public ip not being reachable (perhaps the firewall settings aren't getting through properly) and the timeout with the console output, which I *think* should now be fixed.

I'm moving the status of the bug back to open because 23067 doesn't fix the first issue.

Changed in tempest:
status: Fix Released → Triaged
Revision history for this message
Attila Fazekas (afazekas) wrote :

The 'select' and 'threading' and break condition usage in the ssh.py looks very strange...

Revision history for this message
Attila Fazekas (afazekas) wrote :

Looks like the paramiko using pipes to give you the ability to use the channel in a select.

We should verify is the OrPipe used in a context where thread safety matters and is it thread-safe on a multi-core machine.

ssh.py not fully correct in extreme situations, but in those cases you would see a different exception.

- Is it possible the Cirros forgot his IP-address ?
- Is it possible the sshd or the qemu or the guest OS crashed ?
- Is it possible the firewall changes caused pocket lost ?
- is it possible the above mentioned promisc mode temporary turned off by an internal operation ?

Revision history for this message
Attila Fazekas (afazekas) wrote :

Looks like it can get an IP address information from metadata too.

Some relevant log event collected into a single file from http://logs.openstack.org/22765/4/check/gate-tempest-devstack-vm-full/6460/.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

I've seen issues where paramiko has trouble connecting in the past. I wonder if it might be better to set a shorter timeout and retry a couple of times?

Revision history for this message
Attila Fazekas (afazekas) wrote :

I have switched to multi-core (4) test VM, and now with a very low chance I can see the issue on F18 with python-paramiko-1.7.7.1-3. (latest upstream version: 1.10.0)
Looks like it is not a timeout issue.

2013-03-05 10:10:12,930 [chan 1] Sesch channel 1 request ok
2013-03-05 10:10:12,930 [chan 1] EOF received (1)
2013-03-05 10:10:12,931 [chan 1] EOF sent (1)

* We have "received" the EOF earlier than we sent (at least it is logged in this order).
Our channel.shutdown_write() (EOF sent) is before the select()...

* select on a closed socket normally throws an exception, but we have a "fake socket" and in this case the expected select behavior is return immediately with a read event.

Note: Some cases the libvirtd modifies the iptables rules as well.
Note: I did not found evidence (yet) about the cirrios tires do anything else with local-ipv4 metadata than downloading.

Revision history for this message
Attila Fazekas (afazekas) wrote :

I did not found any write attempt on the pipe.

Revision history for this message
Attila Fazekas (afazekas) wrote :

Looks like both peer can know the channel is EOF -ed in both direction, but non of them tries to close the channel.

Revision history for this message
Attila Fazekas (afazekas) wrote :

Suspected issue:

channel._handle_eof() would set the read indicator pipe to a "forever read event" (set_forever) on an incoming EOF or CLOSE message, but the channel.fileno(), which creates and set the pipe as event indicator just called after this test case ERRORed .

The remote test command not expected write to the stdout, so the pipe does not expected to indicate a normal read event anyway.

So if the EOF message arrives before our select() gets it's real file descriptor from the fileno() , we get our SSHTimeout exception.

It should be reported to the paramiko issue tracker and we should have a workaround.

Revision history for this message
Attila Fazekas (afazekas) wrote :

s/after this test case ERRORed/after channel._handle_eof()/
s/expected write to/expected to write to/

(time to sleep, sorry about the noise, where is the comment edit button ? :) )

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/23666

Changed in tempest:
status: Triaged → In Progress
Revision history for this message
Attila Fazekas (afazekas) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/23666
Committed: http://github.com/openstack/tempest/commit/e14e5a47253bbe43fc5d265dc8907993b58b5314
Submitter: Jenkins
Branch: master

commit e14e5a47253bbe43fc5d265dc8907993b58b5314
Author: Attila Fazekas <email address hidden>
Date: Wed Mar 6 07:52:51 2013 +0100

    Have paramico to register the event pipe in time

    paramiko is multi-thread package which has a dedicated thread for input
    processing.
    paramiko using a pipe trick to communicate his buffered_pipe state for
    event based wait operations.
    It is possible the reading (transport thread) gets an EOF (or CLOSE) message before
    the event pipe created, unfourtunatly it leads to the EOF and CLOSE
    messages not communicated correctly by the event pipe which is created later.

    The applied change expected to be compatible with newer paramiko
    versions, basically it haves the paramiko to create the pipe earlier.

    test_integration_1 test case enabled.
    Fixing bug ##1117555.

    Change-Id: I5e7bff253ed21fc424acc426545cc48f22e96d74

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

looks like this is fixed on the tempest side, marking as invalid from nova's point of view.

Changed in nova:
status: New → Invalid
Sean Dague (sdague)
Changed in tempest:
milestone: none → havana-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.