naily should not run more than one instance

Bug #1268688 reported by Matthew Mosesohn
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Aleksandr Didenko

Bug Description

supervisor somehow permitted 4 naily master processes to run, causing errors with ssh key generation. We should somehow limit supervisor from launching more than one instance.

tags: added: nailgun
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

It's ok, it should have 4 instances (1 master and 3 workers), number of workers configured by -w key in supervisor config

    command=/usr/bin/nailyd -c /etc/naily/nailyd.conf -l /var/log/naily/naily.log -w 3

>> causing errors with ssh key generation

It's not a problem of several instances, each instance working on one task.

Changed in fuel:
status: New → Invalid
importance: Undecided → Medium
milestone: none → 4.1
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

You marked this invalid too soon. There were 4 masters with 3 workers for each, with a total of 16 processes.

Changed in fuel:
status: Invalid → New
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

I think ssh issue was not related to 4 masters. And my understanding that master does nothing besides keeping workers alive. And our architecture allows any number of workers.

So this is not a big problem to have lots of workers. Anyway, it should be investigated and fixed.

Changed in fuel:
status: New → Confirmed
assignee: Fuel Python Team (fuel-python) → nobody
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This issue has not reoccurred since it was first reported, I think it's safe to set it back to Invalid until it occurs again.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

++ Dmitry,

I think it was supervisor's or configuration issue.

Changed in fuel:
status: Confirmed → Invalid
tags: added: customer-found
Revision history for this message
Evgeniy L (rustyrobot) wrote :

We started to face with this problem very often, here we can see several instances of naily, one of them under supervisor

root 8658 0.0 2.7 217832 28496 ? Sl 13:43 0:01 naily master
root 8726 0.0 3.7 628876 38356 ? Sl 13:43 0:02 naily worker[0]
root 8728 0.0 2.6 354392 27328 ? Sl 13:43 0:01 naily worker[1]
root 8732 2.3 4.0 698240 40868 ? Sl 13:43 1:33 naily worker[2]

root 8809 0.1 1.1 197024 11508 ? Ss 13:43 0:04 /usr/bin/python /usr/bin/supervisord
root 8813 0.0 2.7 217832 28500 ? Sl 13:43 0:01 naily master
root 8869 0.0 2.6 354524 27324 ? Sl 13:43 0:01 naily worker[0]
root 8873 0.0 2.6 354528 27328 ? Sl 13:43 0:01 naily worker[1]
root 8877 0.0 2.6 354532 27332 ? Sl 13:43 0:01 naily worker[2]

Also in attached file we can see that nailgun and ostf not attached to supervisor too, and they fails with error "cannot bind to port N because it's already in use".

Changed in fuel:
status: Invalid → Confirmed
importance: Medium → High
Revision history for this message
Evgeniy L (rustyrobot) wrote :
Changed in fuel:
importance: High → Critical
assignee: Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

Actually I don't understand how it was working, because when I did 'kill -KILL supervisor_pid' (killproc) by hand, it killed supervisor but child still alive. I think we should rewrite stop function in init.d/supervisor file.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I also get situation where old naily workers do not finished. I think that problem in restart method, because when i use stop/start this situation do no repeat.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/71899

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Confirmed → In Progress
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Guys, I have increased timeout. Please comment if it helps in gerrit.

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

It looks like we should use a newer init script. I proposed a new patch to 71899.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Alexander Didenko (adidenko)
Changed in fuel:
assignee: Alexander Didenko (adidenko) → Matthew Mosesohn (raytrac3r)
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Alexander Didenko (adidenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/71899
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=518146d5c0a69918e2850f6ad7dab0b1e4e94c03
Submitter: Jenkins
Branch: master

commit 518146d5c0a69918e2850f6ad7dab0b1e4e94c03
Author: Vladimir Kuklin <email address hidden>
Date: Fri Feb 7 14:31:08 2014 +0100

    Increase supervisord stop delay

    Applied redhat initscripts from commit
    64f85465c6aaeb37b2385f3331004cf0a7d2182f
    from Supervisor project.

    Change-Id: Iaade68a30803ac6ebb24b484ea6cb510793d3bdf
    Closes-bug: #1268688

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #169
"build_id": "2014-02-20_12-38-56",
"mirantis": "no",
"build_number": "169",
"nailgun_sha": "1cafb7c9a81946a056dcaa6554d48bf396c90e9e",
"ostf_sha": "380d376b8f16d1cf040b7cabbe9133fd0dcbeadd",
"fuelmain_sha": "15637d29a59f299ee8ffe6560245a6884e954cbe",
"astute_sha": "3d43abeefb60677ce6cae83d31ebbba1ff3cdbe2",
"release": "4.1",
"fuellib_sha": "35299a0aa5c9f75ee20c5b05003403a3d51af11c"

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.