Skip to content

The persistence broker id is used in place of address as the unique identification #5989

Closed
@TheR1sing3un

Description

@TheR1sing3un

现阶段问题

现在采用brokerAddr作为broker的唯一标识,而brokerId作为一个可容忍丢失的标识。导致如下情景出现问题:

  • 容器环境下,每次Broker的重启会导致ip发生变化,导致之前的brokerAddr留下的记录没办法和重启后的broker联系起来,比如说syncStateSet等数据。
  • 临时加上VIP或者变更VIP,都会导致brokerAddr发生变化。

改进方案

在Controller侧采用ClusterName:BrokerName:BrokerId作为唯一标识,不再以BrokerAddr作为唯一标识,并且需要对BrokerId进行持久化存储,由于ClusterName和BrokerName都是启动的时候在配置文件中配置好的,所以只需要处理BrokerId的分配和持久化问题。

上线流程

第一次上线

1. GetNextBrokerIdReq

Broker第一次上线的时候,只有配置文件中配置的ClusterName和BrokerName,以及自身的BrokerAddr。那么我们需要和Controller协商出一个在整个集群生命周期中都唯一确定的标识:BrokerId。该BrokerId从1开始,当Broker被选为Master的时候,即在其任期内都需要将BrokerId替换为0,当转变为Slave的时候再转化为原始的BrokerId。
这时候发起一个GetNextBrokerId的请求到Controller,为了拿到当前的下一个待分配的BrokerId(从1开始分配)。

1.1 ReadFromDLedger

此时controller接收到请求,然后走DLedger去获取到状态机的nextBrokerId数据。

2. GetNextBrokerIdResp

Controller将nextBrokerId返回给Broker。

2.1, 2.2 CreateTempMetaFile

Broker拿到NextBrokerId之后,创建一个临时文件.broker.meta.temp,里面记录了NextBrokerId(也就是期望apply的brokerId),以及自己生成一个Code也持久化到临时文件中。

3. ApplyBrokerIdReq

Broker携带着当前自己的基本数据(ClusterName、BrokerName和BrokerAddress)以及此时期望apply的BrokerId和Code,发送一个ApplyBrokerId的请求到Controller。

3.1 CASApplyBrokerId

Controller通过DLedger共识的append该事件,当该事件(日志)被apply到状态机的时候,判断此时是否可以apply该brokerId(若BrokerId已被分配则失败)。并且此时会记录下来该BrokerId和Code之间的关系。

4. ApplyBrokerIdResp

若上一步成功Apply了该BrokerId,此时则返回成功给Broker,若失败则返回当前的nextBrokerId。

4.1, 4.2 CreateMetaFileFromTemp

若上一步成功的apply了该BrokerId,那么此时可以视为Broker侧成功的分配了该BrokerId,那么此时我们也需要彻底将这个BrokerId的信息持久化,那么我们就可以直接原子删除.broker.meta.temp并创建.broker.meta。删除和创建这两步需为原子操作。

经过上述流程,第一次上线的broker和controller成功协商出一个双方都认同的brokeId并持久化保存起来。

正常重启后的节点上线

若是正常重启,那么则已经在双方协商出唯一的BrokerId,并且本地也在broker.meta中有该BrokerId的数据,那么就该注册流程不需要进行,直接继续后面的流程即可。

如果在正常上线流程中出现了各种情况的宕机,则以下流程保证正确的BrokerId分配

CreateTempMetaFile失败

image.png
如果是上图中的流程失败的话,那么Broker重启后,Controller侧的状态机本身也没有分配任何BrokerId。Broker自身也没有任何数据被保存。因此直接重新按照上述流程从头开始走即可。

CreateTempMetaFile成功,ApplyBrokerId未成功

若是Controller侧已经认为本次ApplyBrokerId请求不对(请求去分配一个已被分配的BrokerId,或者Code不相等),并且此时返回当前的NextBrokerId给Broker,那么此时Broker直接删除.broker.meta.temp文件,接下来回到第2步,重新开始该流程以及后续流程。
image.png

ApplyBrokerId成功,CreateMetaFileFromTemp未成功

上述情况可以出现在ApplyResult丢失、CAS删除并创建broker.meta但是失败了,这俩流程中。
那么重启后,Controller侧是已经认为我们apply流程是成功的了,而且也已经在状态机中修改了BrokerId的分配数据,那么我们这时候重新直接开始步骤3,也就是发送applyBrokerId请求的这一步。
image.png
因为我们有.broker.meta.temp文件,可以从中拿到我们之前成功在Controller侧apply的BrokerId和Code,那么直接发送给Controller,如果Controller中存在该BrokerId并且Code和请求中的Code相等,那么视为成功。

正确上线后使用BrokerId作为唯一标识

当正确上线之后,之后broker的请求和状态记录都以brokerId作为唯一标识。心跳等数据的记录都以brokerId为标识。
同时controller侧也会记录当前该brokerId的address,在主从切换等时候用于通知broker主节点的address。

The issue tracker is used for bug reporting purposes ONLY whereas feature request needs to follow the RIP process. To avoid unnecessary duplication, please check whether there is a previous issue before filing a new one.

It is recommended to start a discussion thread in the mailing lists in cases of discussing your deployment plan, API clarification, and other non-bug-reporting issues.
We welcome any friendly suggestions, bug fixes, collaboration, and other improvements.

Please ensure that your bug report is clear and self-contained. Otherwise, it would take additional rounds of communication, thus more time, to understand the problem itself.

Generally, fixing an issue goes through the following steps:

  1. Understand the issue reported;
  2. Reproduce the unexpected behavior locally;
  3. Perform root cause analysis to identify the underlying problem;
  4. Create test cases to cover the identified problem;
  5. Work out a solution to rectify the behavior and make the newly created test cases pass;
  6. Make a pull request and go through peer review;

As a result, it would be very helpful yet challenging if you could provide an isolated project reproducing your reported issue. Anyway, please ensure your issue report is informative enough for the community to pick up. At a minimum, include the following hints:

BUG REPORT

  1. Please describe the issue you observed:
  • What did you do (The steps to reproduce)?

  • What is expected to see?

  • What did you see instead?

  1. Please tell us about your environment:

  2. Other information (e.g. detailed explanation, logs, related issues, suggestions on how to fix, etc):

FEATURE REQUEST

  1. Please describe the feature you are requesting.

  2. Provide any additional detail on your proposed use case for this feature.

  3. Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have). Are you currently using any workarounds to address this issue?

  4. If there are some sub-tasks involved, use -[] for each sub-task and create a corresponding issue to map to the sub-task:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions