Visit Open-E website
Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Resolving a split brain on replication

  1. #1
    Join Date
    Aug 2008
    Posts
    236

    Default Resolving a split brain on replication

    Hi Guys. I have a problem with two nodes where I cannot reestablish replication. I've tried almost everything. I've removed the contents of units on both nodes and nothing works. Here is what is being logged on the primary:

    block drbd0: sock_recvmsg returned -110
    block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
    block drbd0: short read expecting header on sock: r=-110
    block drbd0: sock_sendmsg returned -32
    block drbd0: short sent ReportBitMap size=4096 sent=2904
    block drbd0: asender terminated
    block drbd0: Terminating asender thread
    block drbd0: Connection closed
    block drbd0: conn( BrokenPipe -> Unconnected )
    block drbd0: receiver terminated
    block drbd0: Restarting receiver thread
    block drbd0: receiver (re)started
    block drbd0: conn( Unconnected -> WFConnection )
    block drbd0: Handshake successful: Agreed network protocol version 91
    block drbd0: conn( WFConnection -> WFReportParams )
    block drbd0: Starting asender thread (from drbd0_receiver [26312])
    block drbd0: data-integrity-alg: <not-used>
    block drbd0: drbd_sync_handshake:
    block drbd0: self E32B7128246F202F:E5E9F5BDC1092F1F:0000000000000004 :0000000000000000 bits:326677276 flags:0
    block drbd0: peer 4D573A2961561D98:3D7D2ECABFAA8F87:0000000000000004 :0000000000000000 bits:326677275 flags:0
    block drbd0: uuid_compare()=-100 by rule 100
    block drbd0: Split-Brain detected, dropping connection!
    block drbd0: helper command: /sbin/drbdadm split-brain minor-0
    block drbd0: meta connection shut down by peer.
    block drbd0: conn( WFReportParams -> NetworkFailure )
    block drbd0: asender terminated
    block drbd0: Terminating asender thread
    block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
    block drbd0: conn( NetworkFailure -> Disconnecting )
    block drbd0: error receiving ReportState, l: 4!
    block drbd0: Connection closed
    block drbd0: conn( Disconnecting -> StandAlone )
    block drbd0: receiver terminated

    This doesn't seem resolvable from any of the options present in the gui or the console. We have removed contents of units, restored factory configuration and setup.. The results are the same. Any thoughts?

  2. #2
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    make sure nodes can ping each other.

    clear metadata on both nodes, and verify source/destination settings.

    reestablish connections and then the add the task.

  3. #3
    Join Date
    Aug 2008
    Posts
    236

    Default

    I've already done this. The nodes can see each other plainly. I've practically redone the configuration from scratch and the result is always the same.
    We have also replicating from the secondary node to the first node.

  4. #4
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    check and make sure NIC settings are same on both nodes... jumbo frames ?

    also make sure volumes are same size.

    have you done anything to tune drdb ? same settings on both nodes ?

  5. #5
    Join Date
    Aug 2008
    Posts
    236

    Default

    Keep in mind that as I've already said, I have completely restored factory configuration and setup. Removed contents of units, etc. These units are on the same vlan and when you configure replication, you can see the volume on the other end indicating good connectivity. In addition, the task gets created on the remote node..
    As I said previously, it does not appear that this error or situation is resolvable via any of the interfaces I have at my disposal..

  6. #6
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    send me the logs from each node, via support, or in PM

  7. #7
    Join Date
    Aug 2008
    Posts
    236

  8. #8
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    there are sock errors on the network connection between the nodes.

    ===
    block drbd0: [drbd0_worker/21962] sock_sendmsg time expired, ko = 4294967142
    block drbd0: sock_recvmsg returned -110
    block drbd0: sock_sendmsg returned -32
    block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
    block drbd0: short read expecting header on sock: r=-110
    block drbd0: short sent ReportBitMap size=4096 sent=1504
    ===

    reset the DRDB configuration to default and see if it works

    default values:
    max-buffers=2048
    max-epoch-size=2048
    unplug-watermark=128
    sndbuf-size=0
    al-extents=127
    no-disk-barrier=off
    no-disk-flushes=off

  9. #9
    Join Date
    Aug 2008
    Posts
    236

    Default

    Thanks for looking it over.
    We started with the defaults and modified it as a last resort to see if it would fix it.
    In short, we've already tried reestablishing it using the default settings..

  10. #10
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    Quote Originally Posted by enealDC
    Thanks for looking it over.
    We started with the defaults and modified it as a last resort to see if it would fix it.
    In short, we've already tried reestablishing it using the default settings..
    OK, I will dig deeper

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •