fluentd not reconnecting to ES on failures

Bug #1830724 reported by Krzysztof Klimonda
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Medium
Doug Szumski
Rocky
New
Medium
Unassigned
Stein
Fix Released
Medium
Radosław Piliszek
Train
Fix Released
Medium
Radosław Piliszek
Ussuri
Fix Released
Medium
Doug Szumski

Bug Description

According to the fluentd-plugin-elasticsearch documentation, the plugin, by default, will only reconnect to the ES cluster when it receives "host unreachable" exception. This can be changed by setting `reconnect_on_error` to True. This is even more strongly recommended for connecting to ES clusters running security guard.

What I'm currently experiencing in my deployment seems to be related: Once fluentd-es plugin loses connectivity to the ES cluster, it never recovers and logs are no longer being sent:

```
2019-05-22 21:47:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-05-22 21:47:33 +0000 chunk="58980e875da18f46c6c1030714d07a5d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): read timeout reached"
2019-05-23 19:04:44 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-05-23 19:04:45 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
2019-05-23 19:04:45 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2019-05-23 19:04:46 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
2019-05-23 19:04:46 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2019-05-23 19:04:48 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
[...]
```

If I wait enough I can see that fluentd gives up on pushing chunks and drops them.

I'll open a review with a proposed configuration change that I've just deployed on one of my controller nodes to see if it helps.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661747

Changed in kolla-ansible:
assignee: nobody → Krzysztof Klimonda (kklimonda)
status: New → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :

We have seen this on Rocky-based clouds. Jack Heskett and Doug Szumksi spent some time on it so may be able to help. I added them as reviewers.

Changed in kolla-ansible:
assignee: Krzysztof Klimonda (kklimonda) → Doug Szumski (dszumski)
Changed in kolla-ansible:
assignee: Doug Szumski (dszumski) → Krzysztof Klimonda (kklimonda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/671080

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/stein)

Change abandoned by Will Szumski (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/671080
Reason: Not merged in master

Changed in kolla-ansible:
assignee: Krzysztof Klimonda (kklimonda) → Michal Nasiadka (mnasiadka)
Changed in kolla-ansible:
assignee: Michal Nasiadka (mnasiadka) → Doug Szumski (dszumski)
Changed in kolla-ansible:
assignee: Doug Szumski (dszumski) → Michal Nasiadka (mnasiadka)
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Duplicate: https://bugs.launchpad.net/kolla-ansible/+bug/1855528

It seems to give up after some time and works again for a bit.
I suspect there is also some bug in pooling because there is no other indication that there was an issue with connectivity between fluentd and ES - could be some intermittent load at most.

Changed in kolla-ansible:
assignee: Michal Nasiadka (mnasiadka) → Radosław Piliszek (yoctozepto)
Changed in kolla-ansible:
assignee: Radosław Piliszek (yoctozepto) → Doug Szumski (dszumski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/661747
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=0c573062fc25e208bfa1206146fb31b401c8b7e5
Submitter: Zuul
Branch: master

commit 0c573062fc25e208bfa1206146fb31b401c8b7e5
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

    Make fluentd-elasticsearch configuration more robust

    Enable reconnect_on_error option so that ES plugin re-establishes
    a new session to the ES cluster on errors. Also, enable buffering
    to the file, so that the buffer survives container restarts.

    Co-Authored-By: Michal Nasiadka <email address hidden>
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Co-Authored-By: Doug Szumski <email address hidden>
    Closes-Bug: #1830724
    Change-Id: Ia40685b9d4fc02194e03c8791ddeb3d29d7f07f6

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/700927

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/671080
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=7b3b1def82262dd44ba8e3865b53855a7e3a3143
Submitter: Zuul
Branch: stable/stein

commit 7b3b1def82262dd44ba8e3865b53855a7e3a3143
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

    Make fluentd-elasticsearch configuration more robust

    Enable reconnect_on_error option so that ES plugin re-establishes
    a new session to the ES cluster on errors. Also, enable buffering
    to the file, so that the buffer survives container restarts.

    Co-Authored-By: Michal Nasiadka <email address hidden>
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Co-Authored-By: Doug Szumski <email address hidden>
    Closes-Bug: #1830724
    Change-Id: Ia40685b9d4fc02194e03c8791ddeb3d29d7f07f6
    (cherry picked from commit 0c573062fc25e208bfa1206146fb31b401c8b7e5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/700927
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=51adfd0100e00353daacc032f155919a818c0289
Submitter: Zuul
Branch: stable/train

commit 51adfd0100e00353daacc032f155919a818c0289
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

    Make fluentd-elasticsearch configuration more robust

    Enable reconnect_on_error option so that ES plugin re-establishes
    a new session to the ES cluster on errors. Also, enable buffering
    to the file, so that the buffer survives container restarts.

    Co-Authored-By: Michal Nasiadka <email address hidden>
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Co-Authored-By: Doug Szumski <email address hidden>
    Closes-Bug: #1830724
    Change-Id: Ia40685b9d4fc02194e03c8791ddeb3d29d7f07f6
    (cherry picked from commit 0c573062fc25e208bfa1206146fb31b401c8b7e5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.1.0

This issue was fixed in the openstack/kolla-ansible 8.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.0.1

This issue was fixed in the openstack/kolla-ansible 9.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.