Bug #1452823 “retype ignores extra_specs info including backend_...” : Bugs : Cinder

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-07: Fix proposed to cinder (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/181079

Changed in cinder:
assignee:	nobody → John Griffith (john-griffith)
status:	New → In Progress

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2015-05-08:

#2

retype does go through scheduler (https://github.com/openstack/cinder/blob/master/cinder/volume/api.py#L1417) and scheduler does verifies the input (extra specs for the new type) against all possible candidates (https://github.com/openstack/cinder/blob/master/cinder/scheduler/manager.py#L222, https://github.com/openstack/cinder/blob/master/cinder/scheduler/filter_scheduler.py#L124).

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2015-05-08:

#3

John, are you able to actually reproduce the issue with the steps you described? I doubt that.

Revision history for this message

John Griffith (john-griffith) wrote on 2015-05-08:

#4

@Huang
Not quite sure what to say on this, initial report was lvm-->lvm, I reproduced and tested SolidFire --> lvm (both with volume_backend_name set)

The call makes it to the manager and drops right into the driver.retype call and manager then proceeds to consider the retype as complete, updates all of the db entries etc

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2015-05-08:

#5

John, it'd be helpful if you can post more details about the experiment you have don, configurations, volume type definitions and error log etc. Anything belongs to extra specs (except being put under scoped key, and the scoped key name isn't capabilities) should be verified by scheduler filters. If it wasn't, then we should fix the filter instead of reimplementing a sub-set of similar logic in volume manager.

Revision history for this message

John Griffith (john-griffith) wrote on 2015-05-08:

#6

@Huang
Umm, well yeah. So maybe I can try to explain this better, I think we might be looking at this from different perspectives. Here's the scenario:
1. volume_type: lvmdriver-1 extra-specs:volume_backend_name=lvmdriver-1
2. volume_type: solidfire extra-specs:voume_backend_name=solidfire
3. cinder create --volume-type solidfire 1
4. cinder retype <volume-uuid> lvmdriver-1 --migration-policy on-demand

The retype goes through the filter scheduler and verifies the new/requested type. It also does go through and find valid hosts for the placement via weighed_hosts. So this is all "good" it checks for valid hosts for the new type and comes back with "lvmdriver-1" and only "lvmdriver-1" just like it should. That's all fantastic.

The problem is the stupid code in the manager issues a driver.retype to the original volume regardless of valid_hosts etc. It just sends the call to the driver (maybe some folks think they'll have their own internal retype w/ migrate impl).

The fact that it sends that call to the driver is the problem. The drivers don't currently duplicate (nor should they) any sort of capability filtering. Most of them just iterate through the extra-spec keys, pick up and set the ones that are valid and call it good. None of them do things like raise, return error etc in cases where they get an invalid (to them) key.

So the result is that even though we told the manager via the scheduler that "Yes, here's the one and only valid host", the manager doesnt know/care about any of that and just sees "driver returned "ok" saying the migration is done, nothing to do here... moving on"

BUT the driver didn't DO anything. So now the manager code just blindly updates all the DB info and everything and you're pretty well jacked.

My comments about the missing scheduler bits here pertain to "we shouldn't even send the driver.retype call if that device isn't in the supported hosts". But in the case of anybody that does an internal migrate impl that creates an issue.

Do you understand what I'm saying now? It's actually rather straight-forward (well as straight forward as any of that particular code segment is).

I hope this clears it up.

In this case I'm

@Huang
Umm, well yeah.  So maybe I can try to explain this better, I think we might be looking at this from different perspectives.  Here's the scenario:
1.  volume_type: lvmdriver-1 extra-specs:volume_backend_name=lvmdriver-1
2. volume_type: solidfire extra-specs:voume_backend_name=solidfire
3. cinder create --volume-type solidfire 1
4. cinder retype <volume-uuid> lvmdriver-1 --migration-policy on-demand

The retype goes through the filter scheduler and verifies the new/requested type.  It also does go through and find valid hosts for the placement via weighed_hosts.  So this is all "good" it checks for valid hosts for the new type and comes back with "lvmdriver-1" and only "lvmdriver-1" just like it should.  That's all fantastic.

The problem is the stupid code in the manager issues a driver.retype to the original volume regardless of valid_hosts etc.  It just sends the call to the driver (maybe some folks think they'll have their own internal retype w/ migrate impl).

The fact that it sends that call to the driver is the problem.  The drivers don't currently duplicate (nor should they) any sort of capability filtering.  Most of them just iterate through the extra-spec keys, pick up and set the ones that are valid and call it good.  None of them do things like raise, return error etc in cases where they get an invalid (to them) key.

So the result is that even though we told the manager via the scheduler that "Yes, here's the one and only valid host", the manager doesnt know/care about any of that and just sees "driver returned "ok" saying the migration is done, nothing to do here... moving on"

BUT the driver didn't DO anything.  So now the manager code just blindly updates all the DB info and everything and you're pretty well jacked.

My comments about the missing scheduler bits here pertain to "we shouldn't even send the driver.retype call if that device isn't in the supported hosts".  But in the case of anybody that does an internal migrate impl that creates an issue.

Do you understand what I'm saying now?  It's actually rather straight-forward (well as straight forward as any of that particular code segment is).

I hope this clears it up.

In this case I'm

Revision history for this message

John Griffith (john-griffith) wrote on 2015-05-08:

#7

BTW, one additional note. It turns out you in fact won't see this on LVM --> LVM. The reason being that LVM driver has no retype implementation so it will always return "failed" and the manager will fall through to the migrate routine.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-11: Change abandoned on cinder (master)

#8

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/181079
Reason: I have a much simpler approach after thinking about this some more.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-11: Fix proposed to cinder (master)

#9

Fix proposed to branch: master
Review: https://review.openstack.org/182055

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-20: Fix merged to cinder (master)

#10

Reviewed: https://review.openstack.org/182055
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=bbb83aff023c90cfc186459603320ad3c622a40b
Submitter: Jenkins
Branch: master

commit bbb83aff023c90cfc186459603320ad3c622a40b
Author: John Griffith <email address hidden>
Date: Mon May 11 13:55:41 2015 -0600

Check volume_backend in retype

    The retype command will always attempt a call
    to the driver.retype method. In *most* cases this
    will hit the default impl which returns False because
    most drivers don't implement any retype (most, a few do).

    The problem is that the drivers that do implement this in
    most cases will iterate through the settings and just make
    the changes that are valid and ignore the rest, and then
    return True. I think this is "ok" for the drivers to do,
    drivers should be allowed to be somewhat dumb WRT Cinder
    state management and placement info. If we give them an
    invalid command (which we're doing here) then it's on us
    higher up the chain IMO.

    The result is that for example if you're trying to retype
    from backend-a to backend-b and backend-a implements retype
    it can return True telling the manager that the volume was
    succesfully retyped, even when it wasn't.

    There's a lot of confusion around this bug, YES the
    filter scheduler is used to determine if the retype is
    valid and to what host. That's not the issue, the issue
    is that regardless of the source and destination host settings
    that are provided from the filter-scheduler, we always make the
    call to the driver, introducing the opportunity for a false
    success status being reported back.

    This patch adds a very simple check between the source and
    destination host settings as provided by the scheduler and in
    the case that the two are "different"(not including pool designations)
    we skip calling the driver.retype method altogether and fall through
    to the migrate process.

This introduces a new hosts_are_equivalent method in
cinder.volume.utils

Change-Id: Idfd7dfa2284fcea66cf23c4273efda61ee8f7eba
Closes-Bug: #1452823

Reviewed:  https://review.openstack.org/182055
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=bbb83aff023c90cfc186459603320ad3c622a40b
Submitter: Jenkins
Branch:    master

commit bbb83aff023c90cfc186459603320ad3c622a40b
Author: John Griffith <john.griffith8@gmail.com>
Date:   Mon May 11 13:55:41 2015 -0600

Check volume_backend in retype
    
    The retype command will always attempt a call
    to the driver.retype method.  In *most* cases this
    will hit the default impl which returns False because
    most drivers don't implement any retype (most, a few do).
    
    The problem is that the drivers that do implement this in
    most cases will iterate through the settings and just make
    the changes that are valid and ignore the rest, and then
    return True. I think this is "ok" for the drivers to do,
    drivers should be allowed to be somewhat dumb WRT Cinder
    state management and placement info.  If we give them an
    invalid command (which we're doing here) then it's on us
    higher up the chain IMO.
    
    The result is that for example if you're trying to retype
    from backend-a to backend-b and backend-a implements retype
    it can return True telling the manager that the volume was
    succesfully retyped, even when it wasn't.
    
    There's a lot of confusion around this bug, YES the
    filter scheduler is used to determine if the retype is
    valid and to what host.  That's not the issue, the issue
    is that regardless of the source and destination host settings
    that are provided from the filter-scheduler, we always make the
    call to the driver, introducing the opportunity for a false
    success status being reported back.
    
    This patch adds a very simple check between the source and
    destination host settings as provided by the scheduler and in
    the case that the two are "different"(not including pool designations)
    we skip calling the driver.retype method altogether and fall through
    to the migrate process.
    
    This introduces a new hosts_are_equivalent method in
    cinder.volume.utils
    
    Change-Id: Idfd7dfa2284fcea66cf23c4273efda61ee8f7eba
    Closes-Bug: #1452823

Changed in cinder:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-21: Fix proposed to cinder (stable/kilo)

#11

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/184827

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-29: Fix merged to cinder (stable/kilo)

#12

Reviewed: https://review.openstack.org/184827
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=03562c362cfa217ec794dd98f197f623c1ffe45d
Submitter: Jenkins
Branch: stable/kilo

commit 03562c362cfa217ec794dd98f197f623c1ffe45d
Author: John Griffith <email address hidden>
Date: Mon May 11 13:55:41 2015 -0600

Check volume_backend in retype

    The retype command will always attempt a call
    to the driver.retype method. In *most* cases this
    will hit the default impl which returns False because
    most drivers don't implement any retype (most, a few do).

    The problem is that the drivers that do implement this in
    most cases will iterate through the settings and just make
    the changes that are valid and ignore the rest, and then
    return True. I think this is "ok" for the drivers to do,
    drivers should be allowed to be somewhat dumb WRT Cinder
    state management and placement info. If we give them an
    invalid command (which we're doing here) then it's on us
    higher up the chain IMO.

    The result is that for example if you're trying to retype
    from backend-a to backend-b and backend-a implements retype
    it can return True telling the manager that the volume was
    succesfully retyped, even when it wasn't.

    There's a lot of confusion around this bug, YES the
    filter scheduler is used to determine if the retype is
    valid and to what host. That's not the issue, the issue
    is that regardless of the source and destination host settings
    that are provided from the filter-scheduler, we always make the
    call to the driver, introducing the opportunity for a false
    success status being reported back.

    This patch adds a very simple check between the source and
    destination host settings as provided by the scheduler and in
    the case that the two are "different"(not including pool designations)
    we skip calling the driver.retype method altogether and fall through
    to the migrate process.

This introduces a new hosts_are_equivalent method in
cinder.volume.utils

Conflicts:
cinder/volume/utils.py

    Closes-Bug: #1452823
    (cherry picked from commit bbb83aff023c90cfc186459603320ad3c622a40b)
    Change-Id: Idfd7dfa2284fcea66cf23c4273efda61ee8f7eba

Reviewed:  https://review.openstack.org/184827
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=03562c362cfa217ec794dd98f197f623c1ffe45d
Submitter: Jenkins
Branch:    stable/kilo

commit 03562c362cfa217ec794dd98f197f623c1ffe45d
Author: John Griffith <john.griffith8@gmail.com>
Date:   Mon May 11 13:55:41 2015 -0600

Check volume_backend in retype
    
    The retype command will always attempt a call
    to the driver.retype method.  In *most* cases this
    will hit the default impl which returns False because
    most drivers don't implement any retype (most, a few do).
    
    The problem is that the drivers that do implement this in
    most cases will iterate through the settings and just make
    the changes that are valid and ignore the rest, and then
    return True. I think this is "ok" for the drivers to do,
    drivers should be allowed to be somewhat dumb WRT Cinder
    state management and placement info.  If we give them an
    invalid command (which we're doing here) then it's on us
    higher up the chain IMO.
    
    The result is that for example if you're trying to retype
    from backend-a to backend-b and backend-a implements retype
    it can return True telling the manager that the volume was
    succesfully retyped, even when it wasn't.
    
    There's a lot of confusion around this bug, YES the
    filter scheduler is used to determine if the retype is
    valid and to what host.  That's not the issue, the issue
    is that regardless of the source and destination host settings
    that are provided from the filter-scheduler, we always make the
    call to the driver, introducing the opportunity for a false
    success status being reported back.
    
    This patch adds a very simple check between the source and
    destination host settings as provided by the scheduler and in
    the case that the two are "different"(not including pool designations)
    we skip calling the driver.retype method altogether and fall through
    to the migrate process.
    
    This introduces a new hosts_are_equivalent method in
    cinder.volume.utils
    
    Conflicts:
    	cinder/volume/utils.py
    
    Closes-Bug: #1452823
    (cherry picked from commit bbb83aff023c90cfc186459603320ad3c622a40b)
    Change-Id: Idfd7dfa2284fcea66cf23c4273efda61ee8f7eba

Thierry Carrez (ttx) on 2015-06-23

Changed in cinder:
milestone:	none → liberty-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in cinder:
milestone:	liberty-1 → 7.0.0

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	Undecided	John Griffith	Cinder 7.0.0 "liberty"
	Kilo	Fix Released	Undecided	John Griffith	Cinder 2015.1.1

Cinder

retype ignores extra_specs info including backend_name

Bug Description

Other bug subscribers

Remote bug watches