EC reconstruct a non-detectable corrupt fragment if one of other fragments is corrupt.

Bug #1971546 reported by bakawang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Triaged
High
Unassigned

Bug Description

Reconstructor rebuild a corrupt ec fragment that can't be detected when there is corrupt fragment in available fragments.

Reproduce:

1. upload file
```
# generate 10M file
$dd if=/dev/urandom of=10M bs=10M count=1
# upload
$swift --os-storage-url http://127.0.0.1:8080/v1/AUTH_TEST upload test 10M
$swift --os-storage-url http://127.0.0.1:8080/v1/AUTH_TEST test 10M
```

2. corrupt fragment

A normal fragment may become corrupt due to bit-rot or other reasons.
We just write zero to random position in fragment#0 to simulate bit-rot:
```
# check md5 before bit-rot
$md5sum 1650604523.21965#0#d.data
9acf9e57969a27af0694b830a812e828 1650604523.21965#0#d.data

# write zero to position 1000
$dd if=/dev/zero of=1650604523.21965#0#d.data bs=1 seek=1000 count=1 conv=notrunc

# check md5 changed
$md5sum 1650604523.21965#0#d.data
a170f58736fac750d8c77ac0e5f40f1a 1650604523.21965#0#d.data
```

3. reconstruct

Remove fragment#1 and execute reconstructor in primary neighbor node(#2).
```
$swift-object-reconstructor {config} -p {partition} -o

```
After reconstruction, fragment#0 will be quarantined but fragment#1 will be rebuilt successfully.
Fragment#1 etag match it's md5.

4. download

Download fail with etag mismatch:
```
$swift --os-storage-url http://127.0.0.1:8080/v1/AUTH_TEST download test 10M
Error downloading object 'test/10M': 'Error downloading test: md5sum != etag, 1473872ce26c5e0cf19b545d14f07cb8 != 37143a3844b46a90100197a0f1334f1f'
```

Enviroment:

swiftversion - wallaby
policy_type - erasure_coding
ec_type - isa_l_rs_vand
ec_num_data_fragments - 16
ec_num_parity_fragments - 4

Revision history for this message
clayg (clay-gerrard) wrote :

maybe it's because we don't (can't?) send the md5 of the reconstructing fragment along with the PUT to the restored node? Like *that* (invalid) frag would think it's correct.

It could be the reconstructor is missing an oppertunity to check the etag of the fragments it's recieving (by the end it could have noticed same as the object-server quarantine) - but since there's no two phase ec commit in ssync I don't see how it could notify the reciever.

A future roadmap could include rebuilds moving off the ssync protocol.

Changed in swift:
importance: Undecided → High
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.