Monitor Oracle Standby Databases In OCI With Custom Metrics

Monitoring Oracle Standby Database (Data Guard) has been always a tricky task. Just by the nature of them (the fact that the instance is not open in read-write mode) is hard to gather information about them. Even using specialized tools like Oracle Enterprise Manager requires SYSDBA credentials in order to effectively monitor them. But what about when running them on OCI?

Oracle OCI Monitoring service allow us to monitor cloud resources using metrics and alarms.

Oracle OCI Monitoring Service Architecture

In this post I want to show you how you can create a custom metric to monitor the “Apply Lag” on your Oracle Standby database, so you can create an alarm if it crosses a threshold.

I’m going to follow most of the steps detailed by Liu-Wei on this post “https://qiita.com/liu-wei/items/5e8e04f1e58cc6406ca9” .

Step 1 – Prerequisites

Add an API Key

First of all. You will have to designate an OCI user that has the proper permissions to access the Monitoring Service metrics and post them using custom metrics. This could be your account or a service account. Once you have designated this user, then login to the OCI console and choose the region where the Standby DB resides. Then click on the Profile icon and click on the account name.

Once there, scroll down and click on API Keys from the left menu.

Then click on the Add API Keys button.

Then generate the API Keys, save them nd store them in a secure place and click Add.

This will allow the script to login to the Monitoring Service in order to post custom metric data.

Create an OCI configuration file

For this exercise we will use the API Keys we just generated and we will create a config file in the host where the Standby DB is running using the oracle account.

I used the location /home/oracle/.oci in order to store the OCI config file and the private key. You may use another location depending on your internal standards.

Using the Configuration File Preview copy the contents and save them in the configuration file we are creating in the DB host.

This Preview already has the correct setting for the user, fingerprint, tenancy and region. However, you should amend the key_file setting. This setting is the path where your private key file is stored.

For this exercise it will be:

key_file=/home/oracle/.oci/mykey_private.pem

At the end of this, you should have 2 files. The config file and the private key file in the DB host.

[oracle]$ ls
config mykey_private.pem

Setup Python on DB Host

For this exercise we are going to use Python in order to consume the required REST APIs to post the metric data to the Monitoring Service.

Verify the Python installation on the DB Host using the oracle account. Python3 was already installed on this host.

[oracle]$ which python3
/bin/python3

However we need to install the oci module in Python. Before we install the oci module we need to upgrade pip in Python.

For this, logout from the oracle account and use the opc account. Execute the command below:

[opc]$ sudo pip3 install --upgrade pip

Login again with the oracle account and execute:

[oracle]$ pip3 install -U oci

This should install the oci module correctly.

Step 2 – Create the Python script

In this step we are going to create the Python script that connects to the Standby DB, gathers the Apply Lag and posts the data to the Monitoring service.

Copy the code below and paste it into a file name post_lag_value.py

#!/usr/bin/python3

# This is a sample python script that post a custom metric(lag_value) to oci monitoring.
# Run this script on the client that you want to monitor.
# Command: python post_lag_value.py

import oci,subprocess,os,datetime
from pytz import timezone

# using default configuration file (~/.oci/config)
from oci.config import from_file
config = from_file()

# initialize service client with default config file
monitoring_client = oci.monitoring.MonitoringClient(config,service_endpoint="https://telemetry-ingestion.us-ashburn-1.oraclecloud.com")

os.environ['ORACLE_HOME'] = "<YOUR ORACLE HOME>"
os.environ['ORACLE_SID'] = "<YOUR SID>"

def run_sqlplus(sqlplus_script):

    """
    Run a sql command or group of commands against
    a database using sqlplus.
    """

    p = subprocess.Popen(['<YOUR ORACLE HOME>/sqlplus','-s','/nolog'],stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    (stdout,stderr) = p.communicate(sqlplus_script.encode('utf-8'))
    stdout_lines = stdout.decode('utf-8').split("\n")

    return stdout_lines

sqlplus_script="""
connect / as sysdba
set heading off
SELECT extract(day from p.val) *1440 + extract(hour from p.val)*60 +
extract(minute from p.val) + extract(second from p.val)/60 lag_minutes
from (SELECT name,to_dsinterval(value) val from v$dataguard_stats where name ='apply lag') p;
exit
"""

sqlplus_output = run_sqlplus(sqlplus_script)

for line in sqlplus_output:
     if line.strip():
         lag_value=float(line)

print(lag_value)

times_stamp = datetime.datetime.now(timezone('UTC'))

# post custom metric to oci monitoring
# replace "compartment_ocid string with your compartmet ocid
post_metric_data_response = monitoring_client.post_metric_data(
    post_metric_data_details=oci.monitoring.models.PostMetricDataDetails(
        metric_data=[
            oci.monitoring.models.MetricDataDetails(
                namespace="<YOUR CUSTOM NAMESPACE>",
                compartment_id="<YOUR COMPARTMENT ID>",
                name="<YOUR METRIC NAME>",
                dimensions={'server_id': '<YOUR SERVER ID>'},
                datapoints=[
                    oci.monitoring.models.Datapoint(
                        timestamp=datetime.datetime.strftime(
                            times_stamp,"%Y-%m-%dT%H:%M:%S.%fZ"),
                        value=lag_value)]
                )]
    )
)

# Get the data from response
print(post_metric_data_response.data)

Amend the inputs needed depending on your DB and OCI configuration:

  • <YOUR ORACLE HOME>
  • <YOUR SID>
  • <YOUR CUSTOM NAMESPACE>
  • <YOUR COMPARTMENT ID>
  • <YOUR METRIC NAME>
  • <YOUR SERVER ID>

One important thing to mention is the ingestion service endpoint. I’m using Ashburn as my region, therefore my ingestion endpoint is “https://telemetry-ingestion.us-ashburn-1.oraclecloud.com”. Yours should be different depending on your region.

https://docs.oracle.com/en-us/iaas/api/#/en/monitoring/20180401/

Next, let’s make the post_lag_value.py file executable.

[oracle]$ chmod +x post_lag_value.py

Let’s try our Python script.

./post_lag_value.py 
/home/oracle/.local/lib/python3.6/site-packages/oci/_vendor/httpsig_cffi/sign.py:10: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography (40.0) will be the last to support Python 3.6.
  from cryptography.hazmat.backends import default_backend  # noqa: F401
0.0
{
  "failed_metrics": [],
  "failed_metrics_count": 0
}

As you can see from the output of the file, the current lag is “0.0” minutes and the failed_metrics_count is also “0”. This means that we successfully posted this data to the Monitoring service.

Let’s now find out if our custom metric is visible from the OCI console.

Using the hamburger menu navigate to “Observability & Management” and under the Monitoring Service click on Metrics Explorer.

Inside Metrics Explorer choose the correct Compartment, Namespace and metric. Remember that you provided them in the Python script. Verify you can see data in the graph.

The script is now posting Apply Lag data to the monitoring service.

Step 3 – Schedule

Now we need to schedule the execution of our Python script every “x” minutes. For this I’m using a Cron job. Follow the instructions in the MOS note to enable Cron. How To Use Crontab In OCI DBCS? (Doc ID 2639985.1)

My Cron looks as follows:

[opc]$ sudo cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root

# For details see man 4 crontabs

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed


# System should configure AIDE for Periodic Execution 
05 4 * * * root /usr/sbin/aide --check

*/5 * * * * oracle /home/oracle/post_lag_value.py >> post_lag.log 2>&1

I schedule this Cron job every 5 minutes. You may adjust it to your desired frequency.

Step 4 – Create an Alarm

Go to the Monitoring service and create an Alarm using the Alarm Definitions option.

After this, we will have a notification when the Apply Lag is more than 60 minutes in our Standby DB.

This concludes this small exercise of monitoring the Apply Lag for a Standby Oracle Database using the OCI Monitoring service.

Hope this helps,
Alfredo

ORA-01555 query duration 0 seconds with Dataguard

How many times we have calls from users complaining about their process that failed due to an ORA-01555 error?

We know that if the queries are not well tuned and they modify a lot of data, the image held in the UNDO Tablespace could not be consistent with the real data. But have you ever seen this error right away after executing a SQL statement against a table?
I just did couple of days ago. Here’s the story:
ORA-01555 error appeared in the alertlog’s database with a query duration of 0 seconds.
ORA-01555 caused by SQL statement below (SQL ID: d3rt4tyudufeu, Query Duration=0 sec, SCN: 0x034f.34f660b4)
Any queries plus an analyze table failed right away with ORA-01555:
ERROR at line 1: ORA-0155: snapshot too old: rollback segment number 10 with name “SYSSMU11_1072300523734$” too small

So weird.
After researching a bit on MOS, found a note regarding a bug.
Some minutes later we also started to receive ORA-600 errors related so scn numbers.
ORA-error stack (00600[ktbdchk1: bad dscn])
The MOS note mentions the ORA-01555 and the ORA-600 errors as part of bug 22241601 with a Dataguard configuration. Is worth to mention that yes, we were doing switchover testing recently in this 12.1.0.2 environment.
ALERT Bug 22241601 ORA-600 [kdsgrp1] / ORA-1555 / ORA-600 [ktbdchk1: bad dscn] / ORA-600 [2663] due to Invalid Commit SCN in INDEX (Doc ID 1608167.1)
The solution is to apply the patch but there’s also a tested workaround that is to rebuild online all the indexes of that table.
Hope this helps.

Alfredo

WAIT_FOR_GAP. How to restore missing archivelogs from backup?

In a Dataguard configuration, Oracle’s RFS (Remote File Server) writes redo data to the standby. When for any reason it can’t write this data, MRP (Managed Recovery Process) will wait for the archivelog to be applied and have the status “WAIT_FOR_LOG”. This will lead the standby to be out-of-sync with the primary database.
Sometimes some archivelogs can’t be transferred from primary database to the standby leaving a gap in the archivelog sequence. The MRP process will have the status “WAIT_FOR_GAP”.  
In order to fix the archivelog gap we have to manually transfer the archivelogs missing.
To find the gap you can query v$archive_gap (gv$archive_gap for RAC).
SELECT INST_ID, THREAD#, HIGH_SEQUENCE#, LOW_SEQUENCE# FROM GV$ARCHIVE_GAP;
INST_ID       THREAD#       HIGH_SEQUENCE#       LOW_SEQUENCE#
————- ————— ———————— ————————
2             2             823                  811
You can see that we are missing archivelogs from sequence 811 to 823 for thread 2. If these archivelogs are not available in the primary we have to restore them from backup.
RMAN> RESTORE ARCHIVELOG FROM SEQUENCE 811 UNTIL SEQUENCE 823 THREAD=2;
Keep in mind that parameter THREAD defaults to 1, so you must specify the thread number when you are trying to restore from a different thread.
After restoring these archivelogs the RFS process should transfer them automatically to the standby. Verify if the gap is fixed.
Thanks,

Alfredo

MRP0: Background Media Recovery terminated with error 1237

Some days back checking an Oracle physical standby database found that the DB was some hours back than the primary database.

alter session set nls_date_format=’DD-MM-yyyy HH24:MI:SS’;
show parameter dest
select thread#,max(sequence#) from gv$log_history group by thread#;
select (a.amct-b.bmct)*24 “Hours Standby is Behind: ”
from (select max(completion_time) amct from v$archived_log) a,
(select max(completion_time) bmct from v$archived_log where applied=’YES’) b;
Hours Standby is Behind:
————————-
45.000054
The very next thing to check is what is going on with the MRP process.
select inst_id, process,status,sequence#, thread# from gv$managed_standby where process=’MRP0′;
no rows selected
So, the MRP process wasn’t running in the standby database. Let’s check the alert.log file.
MRP0: Background Media Recovery terminated with error 1237
ORA-01237: cannot extend datafile 13
The mount point where the datafile 13 resides is 100% full, that’s why the MRP couldn’t resize the datafile and was terminated by the instance.
In order to fix this you should increase the size of the mount point or if you have another mount point with enough free space you can do the following:

      ·        Shutdown standby database

SQL> shutdown immediate
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
·                  ·         Copy the datafile to the new location
#> cp –p users03.dbf /u02/oradata/test/users03.dbf
·          Startup mount standby database
SQL> startup nomount
ORACLE instance started.
SQL> alter database mount standby database;
Database altered.
·                    ·        Modify “standby_file_management” parameter to manual
As per Oracle documentation:
STANDBY_FILE_MANAGEMENT enables or disables automatic standby file management. When automatic standby file management is enabled, operating system file additions and deletions on the primary database are replicated on the standby database.
SQL> alter system set standby_file_management=’MANUAL’ scope=both;
System altered.
·          Rename the datafile in order to reflect the changes in the standby control file.
SQL> alter database rename file ‘/u01/oradata/test/users03.dbf’ to ‘/u02/oradata/test/users03.dbf’;
Database altered.
·          Now let’s reset “standby_file_management” to AUTO.
SQL> alter system set standby_file_management=’AUTO’ scope=both;
System altered.
·          And start the MRP process again.
SQL> alter database recover managed standby database disconnect from session;
Database altered.
After this MRP was able to successfully apply archive logs from primary database.
We have to be sure that every time we increase the size of a datafile in the primary database, have enough free space in the standby server to fit the new size of the datafile.
Thanks,

Alfredo