Call us: +1-415-738-4000
BigMemory WAN Replication provides logging with messages that are easily parsable by third-party log watchers or scrapers. Logging includes:
Significant WAN replication log messages regard synchronization and incremental updates. The messages below provide important markers in the WAN replication logs.
Log message issued when the Master cache is sending a synchronization batch to the Replica cache:
Log message posting the total number of SYNC_UPDATEs that should be processed by the Replica:
Log message issued once the synchronization between Master and Replica has completed:
Log message issued when the Replica cache has started synchronzing with the Master cache:
Log message issued when the Replica cache has processed a batch of synchronization updates from the Master cache:
Log message issued when the Replica cache has completed synchronization with Master cache, but the Replica is not yet activated:
Log message issued when the Replica cache is successfully activated after synchronization:
Log messages for incremental updates are for Bidirectional mode only. Note that watermarks are for batches of updates, which have been batched as part of the internal WAN process.
Log messages issued when the Master is receiving acknowledgements of a Replica successfully storing a batch of updates in the TSA:
INFO [master-0] UnitReplicator - Replica cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' was at watermark '1131', is now at '1199'
INFO [master-0] MasterCache - Lowest watermark across all replicas for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1' is now '1199'
Log messages issued when a Replica has succesfully acklowledged the storage of the updates in the TSA to the Master cache:
INFO [New I/O worker #4] ReplicaCache - Replica 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' successfully acknowledged watermark 1131 with master 'localhost:9001'
INFO [New I/O worker #4] ReplicaCache - Replica 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' successfully acknowledged watermark 1199 with master 'localhost:9001'
If a cache that was registered with a TSA subsequently becomes inactive, then when the cache becomes active again, the TSA will attempt to verify that it has the same configuration. Therefore, in order to change the WAN enabled/disabled status of caches that have already been registered in a TSA, the existing WAN information will need to be removed. The
cleanup-wan-metadata utility, found in the /server/bin directory of the BigMemory kit, is provided for this purpose.
To run the script:
cleanup-wan-metadata.sh –f "configLocation"
The script requires the
configLocation argument, which is the URI location to either a
wan_config.xml file or an
configLocation specifies an
ehcache.xml, the script performs cleanup all the caches mentioned in it. If the
configLocation specifies a
wan_config.xml, the script will look for all the
ehcache.xml locations listed in it and cleanup their caches.
The cleanup is performed for every cache, whether WAN-enabled or not. This is useful when you want to convert a WAN-enabled cache to non-WAN cache, or a non-WAN cache to a WAN-enabled cache.
Note that the script does not modify
wan-config.xml files in any way. It simply removes WAN-related information from Terracotta servers.
In a wan-enabled system, if you need to EITHER convert a wan-enabled cache to a non-wan cache OR convert a non-wan cache to a wan-enabled-cache, follow the below steps:
ehcache.xmlfiles as per the new configuration requirement.
cleanup-wan-metadatascript with the new configuration as the parameter.
The Orchestrator activates a Replica cache when it has successfully synchronized the state of the cache with the Master cache. A Master cache can be immediately activated by an Orchestrator, as there is no synchronization to perform. Orchestrators managing Master caches perform continuous health-checking to verify the active status of the Replica caches managed by the other Orchestrators.
There are two main cases when the state of a Replica cache needs to be resynchronized by the Master cache: bootstrapping a new cache and recovering after a failure.
Perform these steps when starting a cache for the first time, or after the cache was fully cleared.
On startup, the new Replica cache will be inactive while synchronizing (clients cannot use the cache). In this mode, it is receiving incremental updates and synchronizing the full state of the cache. Once fully synchronized, the Replica cache will be active.
BigMemory WAN Replication is built with fault tolerance features to automatically handle short-term failures and most longer-term failures.
When a replica reconnects with the master, its behavior is governed by the Orchestrator configuration parameter
replicaDisconnectBehaviorType. If this parameter is set to:
remainDisconnected, the replica will remain disconnected from the master, and will continue to operate offline.
reconnectResync, the replica will reconnect to the master and will resync the contents of its cache. In this case, all local changes on the replica region will be dropped in favor of whatever is in the master region.
Recovery from most failures is accomplished automatically by the WAN replication service, however some scenarios require user intervention.
It is recommended to run multiple Orchestrators in a region, either on the same machine or rack, or in different racks to ensure availability. Although only one master Orchestrator is mandatory, you can start one or more standby Orchestrators at any time during application run time to provide failover protection.
Standby Orchestrators are passive, so running extra Orchestrator processes will have minimal runtime impact under normal operations. Upon failure, the other regions will look for the next available Orchestrator and resume replication(there will be a full sync for replica caches after master failover).
When a Master cache fails, control is given to the failover Master (if any) listed in
wan-config.xml. A Replica cache will not take over as a Master. If there is no failover Master, the Replicas will continue to operate in isolation (i.e., no replication will take place). When the Master re-starts, the behavior of its Replicas is governed by the
replicaDisconnectBehavior property in
wan-config.xml. By default, the Replicas will attempt to reconnect to the Master and to the failover Master listed in
wan-config.xml. Upon reconnection to a Master, the Replicas are deactivated, cleared, resynchronized to the Master cache, and then reactivated. All local changes on the Replica region will be dropped in favor of whatever is in the Master region. (Similarly, even if the Master does not fail but its Replicas become disconnected from their Master, their behavior is also controlled by the
replicaDisconnectBehavior property.) For more information, refer to Orchestrator Configuration Parameters.
If no failover Master is listed in
wan-config.xml and the master is lost, the operator has the option to restart a Replica to act as a Master, as described below.
In a bi-directional configuration without a failover Master, you should choose your most important Replica region to take over as the Master because any changes that occurred since the Master failed will be lost in all other regions. To do this, change your
wan-config.xml to reflect that the Replica is now the Master, and then restart the Replica. It would be a good idea to remove or comment out the old Master in case it comes back.
In a uni-directional configuration without a failover Master, since the "writes" are only performed in one region, you will not lose any changes. You may simply designate that Replica region as the new Master.
Upon any disconnection from its TSA, the Orchestrator deactivates and waits the amount of time specified by the
l2.l1reconnect.timeout.millis property. This property is described in the Automatic Client Reconnect section.
When communication between the TSA and the Orchestrator is resumed:
If the master region fails entirely, there will most likely be data writes that were not completed (updated to Replica Region), though application may believe they were. As described in this section, recommended action would be to promote Replica region caches to master.
If a TSA failure takes place in a region with Master caches, upon recovery the Master caches cannot automatically force a resynchronization of all live Replicas across the WAN, because that could result in data loss. In this case, when an entire region is down (for example, your data center is offline), you must use a manual recovery process. You will need to designate new Master caches, update the configuration, and restart the Orchestrators.
In preparation for disaster recovery, two Orchestrator configurations should be ready:
|Orchestrator Configuration||Region A||Region B|
If Region A goes down, then Configuration 2 can be applied in Region B. Ideally, Region B would already be serving as a read-only or backup region. When Region B becomes the Master, it can then be used to resynchronize Region A.