db replicator should quarantine on missing columns

Bug #1414588 reported by Gil Vernik
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Medium
Unassigned

Bug Description

I am not sure if it's a bug or something missed in my configuration,
I noticed those errors in the logs in my test swift cluster.
I constantly upgrade it with trunk.
I know that 'object-count' was added with Storage Policies. But i thought there is some backward compatible and those errors should not present.

Can someone explain me what is wrong?

Jan 26 11:32:25 ubuntu-swift-dev container-replicator: ERROR reading db /srv/3/node/sdb3/containers/33/bc3/084910b7fb0e80a62ee67010a1ac8bc3/084910b7fb0e80a62ee67010a1ac8bc3.db: #012Traceback (most recent call last):#012 File "/home/swift/OpenStack/swift/swift/common/db_replicator.py", line 436, in _replicate_object#012 info = broker.get_replication_info()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 543, in get_replication_info#012 info = self.get_info()#012 File "/home/swift/OpenStack/swift/swift/container/backend.py", line 411, in get_info#012 ''') % (trailing_sync, trailing_pol)).fetchone()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 129, in execute#012 self.timeout, self.db_file, lambda: sqlite3.Cursor.execute(#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 67, in _db_timeout#012 return call()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count
Jan 26 11:32:25 ubuntu-swift-dev container-replicator: ERROR reading db /srv/3/node/sdb3/containers/544/3e5/8811f9249de0f711d6f652ab48aba3e5/8811f9249de0f711d6f652ab48aba3e5.db: #012Traceback (most recent call last):#012 File "/home/swift/OpenStack/swift/swift/common/db_replicator.py", line 436, in _replicate_object#012 info = broker.get_replication_info()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 543, in get_replication_info#012 info = self.get_info()#012 File "/home/swift/OpenStack/swift/swift/container/backend.py", line 411, in get_info#012 ''') % (trailing_sync, trailing_pol)).fetchone()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 129, in execute#012 self.timeout, self.db_file, lambda: sqlite3.Cursor.execute(#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 67, in _db_timeout#012 return call()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count
Jan 26 11:32:25 ubuntu-swift-dev container-replicator: ERROR reading db /srv/3/node/sdb3/containers/632/054/9e20091f129e0ef83dfd9d112a9d8054/9e20091f129e0ef83dfd9d112a9d8054.db: #012Traceback (most recent call last):#012 File "/home/swift/OpenStack/swift/swift/common/db_replicator.py", line 436, in _replicate_object#012 info = broker.get_replication_info()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 543, in get_replication_info#012 info = self.get_info()#012 File "/home/swift/OpenStack/swift/swift/container/backend.py", line 411, in get_info#012 ''') % (trailing_sync, trailing_pol)).fetchone()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 129, in execute#012 self.timeout, self.db_file, lambda: sqlite3.Cursor.execute(#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 67, in _db_timeout#012 return call()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count

Gil Vernik (gilv)
description: updated
Revision history for this message
Gil Vernik (gilv) wrote :

Adding more to this, i have couple of container DB that constantly fail in container replication.
Getting errros like this:

 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count
Jan 26 13:47:08 ubuntu-swift-dev container-replicator: Replication run OVER
Jan 26 13:47:08 ubuntu-swift-dev container-replicator: Attempted to replicate 105 dbs in 0.32922 seconds (318.93944/s)
Jan 26 13:47:08 ubuntu-swift-dev container-replicator: Removed 0 dbs
Jan 26 13:47:08 ubuntu-swift-dev container-replicator: 202 successes, 4 failures
Jan 26 13:47:08 ubuntu-swift-dev container-replicator: no_change:202 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Jan 26 13:47:11 ubuntu-swift-dev container-replicator: Beginning replication run
Jan 26 13:47:11 ubuntu-swift-dev container-replicator: ERROR reading db /srv/4/node/sdb4/containers/544/3e5/8811f9249de0f711d6f652ab48aba3e5/8811f9249de0f711d6f652ab48aba3e5.db: #012Traceback (most recent call last):#012 File "/home/swift/OpenStack/swift/swift/common/db_replicator.py", line 436, in _replicate_object#012 info = broker.get_replication_info()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 543, in get_replication_info#012 info = self.get_info()#012 File "/home/swift/OpenStack/swift/swift/container/backend.py", line 411, in get_info#012 ''') % (trailing_sync, trailing_pol)).fetchone()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 129, in execute#012 self.timeout, self.db_file, lambda: sqlite3.Cursor.execute(#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 67, in _db_timeout#012 return call()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count
Jan 26 13:47:11 ubuntu-swift-dev container-replicator: ERROR reading db /srv/4/node/sdb4/containers/462/54e/7383600001a3888360f31a2c9257a54e/7383600001a3888360f31a2c9257a54e.db: #012Traceback (most recent call last):#012 File "/home/swift/OpenStack/swift/swift/common/db_replicator.py", line 436, in _replicate_object#012 info = broker.get_replication_info()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 543, in get_replication_info#012 info = self.get_info()#012 File "/home/swift/OpenStack/swift/swift/container/backend.py", line 411, in get_info#012 ''') % (trailing_sync, trailing_pol)).fetchone()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 129, in execute#012 self.timeout, self.db_file, lambda: sqlite3.Cursor.execute(#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 67, in _db_timeout#012 return call()#012 File "/home/swift/OpenStack/swift/swift/common/db.py", line 130, in <lambda>#012 self, *args, **kwargs))#012OperationalError: no such column: object_count

Revision history for this message
Alistair Coles (alistair-coles) wrote :

Gil - I'm not hugely familiar with that piece of the code but 'object_count' was not a new column with storage policies so OperationalError: no such column: object_count suggests maybe a db corruption? You can use swift-container-info to exercise the same code path (get_info in db.py) e.g.

$ swift-container-info /srv/4/node/sdb4/containers/544/3e5/8811f9249de0f711d6f652ab48aba3e5/8811f9249de0f711d6f652ab48aba3e5.db

an repeat on each replica of the db to verify if it is just one bad replica.

Revision history for this message
clayg (clay-gerrard) wrote :

so talking with Gil about this in #opentack-swift it seemed these containers had a container_stat *table* with no object_count field - which is confusing because back when container_stat was a table it always had an object_count field, and post storage policy migration the container_stat table gets migrated to container_info and replaced with a view in a transaction.

Not sure the exact sequence that leads to this state, but it would be nice to verify at a minimum that it will get flagged by container-auditor - which I think maybe it will not (container.auditor.ContainerAuditor.container_audit doesn't seem to rename the db on errors?)

Revision history for this message
Tim Burke (1-tim-z) wrote :

Yeah, container-auditor definitely doesn't cover as much quarantining as container-replicator.

I'm still at a loss as to how we might get in this state, but it seems like we should

- quarantine on missing columns during get_info in container-replicator and
- quarantine on missing tables or missing columns during get_info in container-auditor.

I find it interesting that truncating a container db gets the auditor to quarantine, while dropping an account db in the containers dir doesn't :-/ Account DB will get quarantined by container-replicator though!

Changed in swift:
status: New → Confirmed
importance: Undecided → Medium
summary: - db replicator - object count not found
+ db replicator should quarantine on missing columns
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.