XenServer resize: the second resize of the same vm fails

Bug #892765 reported by Giuseppe Civitella
This bug report is a duplicate of:  Bug #949477: Prevent bad images from corrupting SR. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

I'm experiencing the following problem doing a multiple
resize test using XenServer (version 5.6 SP2 and verion 6) as hypervisor.
On first resize everything works as expected. If I do another resize a
few minutes later, I receive an error about "VHD coalesce
attempts exceeded" on destination host.
Here's an extract from nova-compute.log:
http://paste.openstack.org/show/3368/

On the hypervisor's /var/log/SMlog I can find errors like this:
http://paste.openstack.org/show/3369/

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Chris Behrens (cbehrens) wrote :

The timeout for waiting on coalesce is too low. I'd try raising the value of this flag:

FLAGS.xenapi_vhd_coalesce_max_attempts

You might need to raise it substantially. coalesce could take an hour, maybe. And max attempts works with this flag:

FLAGS.xenapi_vhd_coalesce_poll_interval

We will check for it to be done <max_attempts> times... checking every <poll_interval> seconds. Default of 5 for poll_interval, means you may want to have max_attempts set to 720 to make it one hour total.

Try that and see what happens? If that fixes your problem, we can raise the default values in nova.

Revision history for this message
Giuseppe Civitella (gcivitella) wrote :

I did try trwice raising coalesce flags.
Once setting xenapi_vhd_coalesce_max_attempts=720 and the other setting xenapi_vhd_coalesce_max_attempts=720 and xenapi_vhd_coalesce_poll_interval : 10.0.

This did non solve the problem.

Revision history for this message
Rick Harris (rconradharris) wrote :

Guiseppe: Are you still seeing this issue?

I wasn't able to replicate this on 5.6.100-39215p.

One thing that may help debug this is if you could start with a fresh SR, replicate the issue, and paste more-or-less the full SM.log output.

We should be able to see what VDI is erroneously holding a reference to the base-copy preventing the VHD from coalescing.

Revision history for this message
Giuseppe Civitella (gcivitella) wrote : Re: [Bug 892765] Re: XenServer resize: the second resize of the same vm fails

Hi Rick,

I'm not seeing the problem anymore.
I should be able to do some final tests on my environment at least by
the end of next week.
As soon as I get the results I'll update the bug.
Anyway I found a solution following Dan Prince's advices reported here
https://bugs.launchpad.net/nova/+bug/862653 about vm image
installation.
Increasing xenapi_vhd_coalesce_poll_interval also helped dealing with
resize in case of big images (10 to 15 GB thin provisioned disk).

2012/2/2 Rick Harris <email address hidden>:
> Guiseppe: Are you still seeing this issue?
>
> I wasn't able to replicate this on 5.6.100-39215p.
>
> One thing that may help debug this is if you could start with a fresh
> SR, replicate the issue, and paste more-or-less the full SM.log output.
>
> We should be able to see what VDI is erroneously holding a reference to
> the base-copy preventing the VHD from coalescing.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/892765
>
> Title:
>  XenServer resize: the second resize of the same vm fails
>
> Status in OpenStack Compute (Nova):
>  Confirmed
>
> Bug description:
>  I'm experiencing the following problem doing a multiple
>  resize test using XenServer (version 5.6 SP2 and verion 6) as hypervisor.
>  On first resize everything works as expected. If I do another resize a
>  few minutes later, I receive an error about "VHD coalesce
>  attempts exceeded" on destination host.
>  Here's an extract from nova-compute.log:
>  http://paste.openstack.org/show/3368/
>
>  On the hypervisor's /var/log/SMlog I can find errors like this:
>  http://paste.openstack.org/show/3369/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/892765/+subscriptions

Revision history for this message
Giuseppe Civitella (gcivitella) wrote :

Hi Rick,

I can confirm I'm not having the problem anymore.
My env is XenServer6, nova Essex (milestone 2 and 3) and images built
not using xenconverter's OVA.
A huge thanks to Tod Deshane and Paul Voccio for their help and patience:
they can barely imagine how much I owe them :-)

Regards
Giuseppe

2012/2/2 Rick Harris <email address hidden>:
> Guiseppe: Are you still seeing this issue?
>
> I wasn't able to replicate this on 5.6.100-39215p.
>
> One thing that may help debug this is if you could start with a fresh
> SR, replicate the issue, and paste more-or-less the full SM.log output.
>
> We should be able to see what VDI is erroneously holding a reference to
> the base-copy preventing the VHD from coalescing.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/892765
>
> Title:
>  XenServer resize: the second resize of the same vm fails
>
> Status in OpenStack Compute (Nova):
>  Confirmed
>
> Bug description:
>  I'm experiencing the following problem doing a multiple
>  resize test using XenServer (version 5.6 SP2 and verion 6) as hypervisor.
>  On first resize everything works as expected. If I do another resize a
>  few minutes later, I receive an error about "VHD coalesce
>  attempts exceeded" on destination host.
>  Here's an extract from nova-compute.log:
>  http://paste.openstack.org/show/3368/
>
>  On the hypervisor's /var/log/SMlog I can find errors like this:
>  http://paste.openstack.org/show/3369/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/892765/+subscriptions

Revision history for this message
Giuseppe Civitella (gcivitella) wrote :

While troubleshooting problems like this it could be useful to have a
look at /var/log/SMlog on XenServer's dom0.
It can be confused with a general vdi coalesce problem.
As example, you can have a look at this:

http://paste.openstack.org/show/4685/

in this case the error was generated by some previous failed deploy.
To solve the problem I just had to delete the vdi with uuid
a6661c5d-31ca-416d-b749-c41ea13c29f3 which was totally unrelated with
the resize process that was generating an error.

Hope it helps someone else.
Giuseppe

2012/2/3 Giuseppe Civitella <email address hidden>:
> Hi Rick,
>
> I'm not seeing the problem anymore.
> I should be able to do some final tests on my environment at least by
> the end of next week.
> As soon as I get the results I'll update the bug.
> Anyway I found a solution following Dan Prince's advices reported here
> https://bugs.launchpad.net/nova/+bug/862653 about vm image
> installation.
> Increasing xenapi_vhd_coalesce_poll_interval also helped dealing with
> resize in case of big images (10 to 15 GB thin provisioned disk).
>
>
>
> 2012/2/2 Rick Harris <email address hidden>:
>> Guiseppe: Are you still seeing this issue?
>>
>> I wasn't able to replicate this on 5.6.100-39215p.
>>
>> One thing that may help debug this is if you could start with a fresh
>> SR, replicate the issue, and paste more-or-less the full SM.log output.
>>
>> We should be able to see what VDI is erroneously holding a reference to
>> the base-copy preventing the VHD from coalescing.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/892765
>>
>> Title:
>>  XenServer resize: the second resize of the same vm fails
>>
>> Status in OpenStack Compute (Nova):
>>  Confirmed
>>
>> Bug description:
>>  I'm experiencing the following problem doing a multiple
>>  resize test using XenServer (version 5.6 SP2 and verion 6) as hypervisor.
>>  On first resize everything works as expected. If I do another resize a
>>  few minutes later, I receive an error about "VHD coalesce
>>  attempts exceeded" on destination host.
>>  Here's an extract from nova-compute.log:
>>  http://paste.openstack.org/show/3368/
>>
>>  On the hypervisor's /var/log/SMlog I can find errors like this:
>>  http://paste.openstack.org/show/3369/
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/nova/+bug/892765/+subscriptions

Revision history for this message
Rick Harris (rconradharris) wrote :

Guiseppe:

Yep, we've seen that traceback before, it's related to https://bugs.launchpad.net/nova/+bug/949477.

That has since been fixed, so we shouldn't be corrupting the SR anymore, and we've included a vdi_chain_cleanup script in the `tools/xenserver` to fix-up any bad SRs.

Unless you can think of something else we should be worrying about, I think it's safe to close this bug now.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.