Cluster becomes unresponsive

Description

Premier customer is trying to run a three node cluster (.38.7,.38.4,*.38.5), but randomly one of the members goes out of the cluster and makes the whole cluster non functional unless they kill the problematic instance.

They are consistenly seeing this message when the cluster member becomes non operational:

2021-02-16 15:18:52,668 DEBUG [com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor]

[hz._hzInstance_1_MainRepository-24890ef7-652d-494a-8621-66379715efed.InvocationMonitorThread]

[10.190.38.7]:5701 [MainRepository-24890ef7-652d-494a-8621-66379715efed] [3.12] Invocations:925 timeouts:0

backup-timeouts:02021-02-16 15:18:53,172 WARN [com.hazelcast.spi.impl.operationservice.impl.Invocation]

[hz._hzInstance_1_MainRepository-24890ef7-652d-494a-8621-66379715efed.InvocationMonitorThread]

[10.190.38.7]:5701 [MainRepository-24890ef7-652d-494a-8621-66379715efed] [3.12] Retrying invocation: Invocation

{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=1081336880,

partitionId=66, replicaIndex=0, callId=6133677, invocationTime=1613488731768 (2021-02-16 15:18:51.768),

waitTimeout=-1, callTimeout=60000, name=cache.ticketsCache}, tryCount=250, tryPauseMillis=500, invokeCount=140,

callTimeoutMillis=60000, firstInvocationTimeMs=1613488640323, firstInvocationTime='2021-02-16 15:17:20.323',

lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.190.38.15]:5701,

pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=3,

/10.190.38.7:5701->/10.190.38.15:37695, qualifier=null, endpoint=[10.190.38.15]:5701, alive=false,

type=MEMBER]}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:

[10.190.38.15]:5701 - 5b1af170-4e92-4b4b-8419-10ee7b6cc663, partitionId: 66, operation:

com.hazelcast.map.impl.operation.PutOperation, service: hz:impl:mapService

In addition, the customer does have members which are non part part of the cluster, but consistently SplitBrainJoinMessage to non cluster members are noticed. Please see log snippet below:

2021-02-16 15:20:52,735 DEBUG [com.hazelcast.cluster.impl.TcpIpJoiner] [hz._hzInstance_1_MainRepository-

24890ef7-652d-494a-8621-66379715efed.cached.thread-35] [10.190.38.5]:5701 [MainRepository-24890ef7-652d-494a-

8621-66379715efed] [3.12] [10.190.38.15]:5701 is local? false2021-02-16 15:20:52,735 DEBUG

[com.hazelcast.cluster.impl.TcpIpJoiner] [hz._hzInstance_1_MainRepository-24890ef7-652d-494a-8621-

66379715efed.cached.thread-35] [10.190.38.5]:5701 [MainRepository-24890ef7-652d-494a-8621-66379715efed] [3.12]

[10.190.38.6]:5701 is local? false2021-02-16 15:20:52,735 DEBUG [com.hazelcast.cluster.impl.TcpIpJoiner]

[hz._hzInstance_1_MainRepository-24890ef7-652d-494a-8621-66379715efed.cached.thread-35] [10.190.38.5]:5701

[MainRepository-24890ef7-652d-494a-8621-66379715efed] [3.12] [10.190.38.7]:5701 is local? false2021-02-16

15:20:52,735 DEBUG [com.hazelcast.cluster.impl.TcpIpJoiner] [hz._hzInstance_1_MainRepository-24890ef7-652d-

494a-8621-66379715efed.cached.thread-35] [10.190.38.5]:5701 [MainRepository-24890ef7-652d-494a-8621-

66379715efed] [3.12] Sending SplitBrainJoinMessage to [10.190.38.6]:5701

All logs from three servers are avaialble in collab site along with alfresco-global.properties from active cluster nodes. Further, we have tuned the messages related to transactionalcaches becoming full

2021-02-16 14:32:27,635 WARN [org.alfresco.repo.cache.TransactionalCache.org.alfresco.cache.siteNodeRefTransactionalCache] [http-nio-127.0.0.1-8080-exec-3855] Transactional update cache 'org.alfresco.cache.siteNodeRefTransactionalCache' is full (5000).

https://collab.alfresco.com/share/page/site/premier-worldwide-documentation/documentlibrary#filter=path%7C%2FCustomer%2520Documentation%2FS-Z%2FUnited%2520States%2520Department%2520of%2520the%2520Navy%2FExternal%2520Share%7C&page=1

Configuration settings that have been tuned so far:

1) JVM params adjusted memory was increased to 96G
2) Adjusted the garbage collector settings
3) Other cache settings
cache.contentDataSharedCache.cluster.type=invalidating
cache.contentUrlSharedCache.cluster.type=invalidating
cache.authenticationSharedCache.cluster.type=invalidating
cache.permissionsAccessSharedCache.cluster.type=invalidating
cache.readersSharedCache.cluster.type=invalidating
cache.nodeOwnerSharedCache.cluster.type=invalidating
cache.personSharedCache.cluster.type=invalidating
cache.aclSharedCache.cluster.type=invalidating
cache.aclEntitySharedCache.cluster.type=invalidating

cache.authorizationCache.cluster.type=invalidating
cache.siteNodeRefSharedCache.cluster.type=invalidating
cache.solrFacetNodeRefSharedCache.cluster.type=invalidating

cache.contentDataSharedCache.timeToLiveSeconds=3600
cache.contentUrlSharedCache.timeToLiveSeconds=3600
cache.authenticationSharedCache.timeToLiveSeconds=3600
cache.permissionsAccessSharedCache.timeToLiveSeconds=3600
cache.readersSharedCache.timeToLiveSeconds=3600
cache.nodeOwnerSharedCache.timeToLiveSeconds=3600
cache.personSharedCache.timeToLiveSeconds=3600
cache.aclSharedCache.timeToLiveSeconds=3600
cache.aclEntitySharedCache.timeToLiveSeconds=3600

cache.authorizationCache.maxIdleSeconds=3600
cache.siteNodeRefSharedCache.timeToLiveSeconds=3600
cache.solrFacetNodeRefSharedCache.timeToLiveSeconds=3600

4) Changed the db.poo.max to 500
5) Update the alfresco.hazelcast.max.no.heartbeat.seconds to 60 in alfresco-global.properties
6) Addtional Note: Customer generally is encountering this error during the weekdays, we have requested thread dumps from all three cluster nodes when the issue occurs. So will probably have this information available by Monday.

Environment

Affected Version: 6.1.1.7, RM Version 3.2.0.10 it’s the hotfixed version

Testcase ID

None

Activity

Show:
Mohammad Janjua
March 18, 2021, 6:39 PM

If you can provide me the hotfix, i can have the customer validate it in their enviornment.

Mohammad Janjua
March 16, 2021, 6:27 PM

I have updated the impacted version list. Thanks.

Mohammad Janjua
March 16, 2021, 5:35 PM

ESC JIRA has been created

Mohammad Janjua
March 16, 2021, 5:25 PM

Thanks. Will submit it shortly.

Mohammad Janjua
March 16, 2021, 3:16 PM

Can we complete the proposed fix as hotfix? If yes, then what will be the ETA on this?

Fixed

Assignee

Eva Vasquez

Reporter

Mohammad Janjua

Escalated By

CSM

Hot Fix Version

ACT Numbers

00334929

Premier Customer

Yes

Delivery Team

Customer Excellence

Bug Priority

Category 1