Troubleshooting Basics for Advanced Replication
| 主题: | Troubleshooting Basics for Advanced Replication | |||
| 文档 ID: | 122039.1 | 类型: | TROUBLESHOOTING | |
| Modified Date: | 22-SEP-2008 | 状态: | PUBLISHED | |
PURPOSE
-------
The purpose of this article is to provide basic steps for troubleshooting
Advanced Replication setup, configuration and operation. Replication DBA's
can use this article as a starting point for diagnosing and resolving the most
common problems encountered in replicated environments. Additional notes are
referenced through out this article that address specific issues or provide
additional information on a particular component used by Advanced Replication.
SCOPE & APPLICATION
-------------------
To be used by Oracle support analysts and replication DBA's to understand and
employ basic troubleshooting techniques for Advanced Replication configuration
and components.
Troubleshooting Basics for Advanced Replication
===============================================
Article Contents
----------------
1. TROUBLESHOOTING SETUP/CONFIGURATION
1.1 Packages
1.2 Users and privileges
1.3 Database links
2. TROUBLESHOOTING OPERATION
2.1 Deferred Transactions
2.1.1 Deferred transactions Not being Created for Asynchronous Propagation
2.1.2 Deferred transactions created but not propagated
2.1.3 Deferred transactions propagated but not applied to receiving site
2.1.4 Deferred transactions propagated and no longer in the Deftran
view but def$_aqcall table at local site still has entries.
2.2 Administrative Requests
2.2.1 Admin requests generated but not propagated
2.2.2 Admin requests propagated but not applied at receiving site
2.3 Master Repgroup State
2.3.1 Stuck in 'Quiescing' mode when suspend_master_activity command issued
2.3.2 Group still quiesced after resume_master_activity command issued
2.4 Automated job not running
2.5 Snapshot refreshes
2.5.1 Snapshot refresh looses snapshot changes
2.5.2 Snapshot can not fast refresh
3. ISSUES WITH REPLICATION MANAGER SETUP WIZARD
-------------------------------------------------------------------------------
Sections:
1. TROUBLESHOOTING SETUP/CONFIGURATION
======================================
A large majority of the problems encountered by replication users are due to
incorrect setup and configuration of the replicated environment. The following
components must be correctly in place before replication can work properly.
1.1 Packages
------------
CATREP.SQL is the script that creates the necessary data dictionary entries
and API packages that enable the Advanced Replication feature. From Oracle9
onwards this script is automatically run from CATPROC.SQL.
The packages created by CATREP.SQL often have dependencies on objects created
by CATALOG.SQL and CATPROC.SQL, so it important that these have been run with
out error before CATREP.SQL is run.
Some things to check to make sure CATREP.SQL runs successfully:
1) Ensure CATALOG.SQL and CATPROC.SQL have been successfully run on the
database as SYS (never SYSTEM) from svrmgrl or from Oracle9 onwards
sqlplus.
2) Make sure all objects in the SYS and SYSTEM schema are valid - see
Note 73995.1 for more information on troubleshooting invalid objects
in the database.
3) CATREP.SQL tends to require a significant amount of rollback space so make
sure a large rollback segment is available and set transaction to that
rollback segment.
4) Prior to Oracle9 there were problems running CATREP.SQL from sqlplus (see
Bug 895928 and Note 116118.1 - there are still issues with this in
8.1.6). So make sure CATREP.SQL is run as SYS (never SYSTEM) from svrmgrl
(up to release 8.1.7, then in sqlplus from Oracle9 onwards). Make sure the
output is spooled to a file, so the log file can be used for troubleshooting
if needed.
If you have migrated or upgraded the database from a previous version, refer
to Note 91900.1 and Note 118416.1 for troubleshooting invalid packages
after migration.
1.2 Users and privileges
------------------------
Special, privileged users are needed with replication to ensure cohesive and
controlled administration of replicated objects, data and the environment.
Master site
Replication Administrator - usually REPADMIN (administers the replicated
objects and environment). Required privileges granted with:
dbms_repcat_admin.grant_admin_any_schema() - Global Repadmin
dbms_repcat_admin.grant_admin_schema() - Schema only Repadmin
Propagator - usually REPADMIN (pushes DML changes to other sites)
Required privileges granted with:
dbms_defer_sys.register_propagator() - Global
Note: You can only have one propagator registered for a database.
Also grant: Comment any table, Lock any table
Receiver - defaults to propagator, usually REPADMIN (receives DML
changes from other sites). Required privileges granted with (if not
propagator):
dbms_repcat_admin.register_user_repgroup() - Repgroup level Receiver
Note: use privilege_type => 'RECEIVER'
Updateable Snapshot Site
Snapshot Administrator - usually SNAPADMIN or REPADMIN (administers the
replicated objects and environment). Required privileges granted with:
dbms_repcat_admin.grant_admin_any_schema() - Global Admin
dbms_repcat_admin.grant_admin_schema() - Schema only Admin
Propagator - usually SNAPADMIN or REPADMIN (pushes DML changes to master
sites). Required privileges granted with:
dbms_defer_sys.register_propagator() - Global
Note: You can only have one propagator registered for a database.
Also grant: Comment any table, Lock any table
Refresher - defaults to propagator, usually SNAPADMIN or REPADMIN (pulls
data from master sites on refresh). Required privileges if not
propagator:
Create Session
Alter Any Snapshot
Comment any table
Lock any table
See setup note for user creation and required privileges in:
Oracle 8.1 onwards - Note 117434.1
1.3 Database links
------------------
Replication relies solely on database links to transfer data from one site to
another. If database links are not created correctly or are malfunctioning,
propagation will not take place.
Common things to look for when troubleshooting database link connection:
1) Make sure that the remote database to which the link attaches is started
and available through the network.
2) Make sure that each replication user has a private database link with the
same name as a public database link pointing to the remote database.
See the following setup note for the required database link configuration:
Oracle 8.1 onwards - Note 117434.1
3) Make sure the database links are created and work properly.
See Note 121716.1 for troubleshooting database links in a
replicated environment.
2. TROUBLESHOOTING OPERATION
============================
There are many mechanisms and components used throughout the propagation
operation of replication. It is essential to understand the flow of
propagation. Knowing 'how' a component does 'what', and 'when' it should do
it, will make troubleshooting the operations of a replicated environment more
manageable.
2.1 Deferred Transactions
-------------------------
2.1.1 Deferred transactions not being Created for Asynchronous Propagation
--------------------------------------------------------------------------
When changes are made to data in a replicated table, Oracle uses an internal
trigger ($RT) to create the deftran entry. The transaction must be successfully
committed at the local site before an entry is placed in deftran for other
master sites.
Check the following:
1) Make sure a 'commit' has been issued from the session making the change
$RT is an 'on commit trigger' and will not fire until the change has been
successfully committed to the table.
2) Make sure the 'commit' was successful. If the 'commit' for the transaction
was not successful, $RT will not generate an entry in deftran. This avoids
propagating a 'bad' transaction.
3) Check if there are any other master site members for the regroup to which
the updated table belongs.
When the $RT fires, it will check to see if there are other master sites
to which the data change needs to be propagated. If no other master sites
belong to the group, the $RT will not generate an entry in deftran.
To check this run the following query as REPADMIN:
select * from dba_repsites where gname = <repgroup>;
gname = repgroup to which the updated table belongs
If only one record is returned for the gname, there are no other master
sites designated for the repgroup and thus, no need to generate an entry
for deftran.
4) If the table update has been successfully committed and there are other
master sites designated for the group, it is likely that the $RT trigger
has become corrupted or destroyed. You will need to regenerate replication
support for the object.
2.1.2 Deferred transactions created but not propagated
------------------------------------------------------
There are transactions in the deftrandest view but they are not being
propagated. The component that is responsible for send pending transactions
to their remote destinations is the 'push' job.
In Oracle 7.3 the 'push' operation was carried out by dbms_defer_sys.execute.
This 'push' could be automated with the dbms_defer_sys.schedule_execute
procedure. In Oracle 8.0+, the calls are to dbms_defer_sys.push (function),
and dbms_defer_sys.schedule_push (procedure), respectively.
When the push is called, all pending transactions in the deferred transaction
queue for the specified destination are sent. Once the existing deferred
transactions are successfully sent, the procedure is finished. Transactions
generated after the procedure is called will remain in the deferred
transaction queue until the procedure is called again for that destination.
The schedule_push procedure creates an automated job in the job queue
(dba_jobs) that will make a call to the dbms_defer_sys.execute/push. If
transactions are not being propagated, it could be due to problems with the
'push' process itself, or the automated job process. To determine where the
problem lies, you will want to separate the two processes and troubleshoot
them individually. First, make sure that the automated job is configured
properly. If so, then try running the push manually by calling the
dbms_defer_sys.execute/push API directly.
Please see Note 1035874.6 section 3 for troubleshooting the push job.
2.1.3 Deferred transactions propagated but not applied to receiving site
------------------------------------------------------------------------
The deferred transactions are propagating, however you do not see the changes
at the remote site after propagation. More than likely, the transaction failed
on the remote site. Check the deferror view at the remote site, taking note of
the error number and message for the deferror entry. You can also view failed
transactions with Replication Manager.
If needed, you can determine the row values and row operation for each call
by using show_call scripts that can be found on metalink:
Note 2063747.4 V7.3 PRINTING DEFCALL FOR A SINGLE TRANSACTION
Note 2065172.102 V8.0.X PRINTING A SINGLE CALL FOR A TRANSACTION
Note 2103883.6 V8.0.X PRINTING DEFCALL FOR A SINGLE TRANSACTION
(Notes Note 2065172.102 and Note 2103883.6 can also be used for 8i)
Or, you can use Replication Manger to display the information by drilling down
through transactions and calls in the 'Local Errors' folder.
This information can be used to identify the row at the local master site for
manual correction.
Please see Note 1035874.6 section 4 for more information on troubleshooting
failed deferred transactions.
2.1.4 Deferred transactions propagated and no longer in the Deftran view
but def$_aqcall table at local site still has entries.
------------------------------------------------------------------------
In Oracle 7.3, the dbms_defer_sys.execute() package will automatically remove
the deferred transaction entry from the local def$_call when it has been
successfully sent to it's remote site. In Oracle 8.0+, this 'send and purge'
operation has been separated. Oracle 8.0+ replication relies on the 'purge'
job to clean propagated transactions from the local def$_aqcall table once
they have been successfully pushed to their remote site.
As with the 'push' job, the 'purge' job can be run manually with the function
dbms_defer_sys.purge(), or automated with the dbms_defer_sys.schedule.purge()
procedure. The automated job calls the dbms_defer_sys.purge() package. There
are two types of purges that can be done; a lazy purge (default), and a precise
purge.
The lazy purge will purge transactions with a cscn lower than the local low
water mark for propagated transactions (this is calculated based on the minimum
last_delivered in the local def$_destination). This low water mark can be lower
than some cscn numbers of some previously pushed transactions, so they will not
be purged immediately. This can happen if not all the push jobs have run and
still have active transactions. In this case, the transactions will remain in
the def$_aqcall until the low water mark rises above the cscn for the
transaction.
A precise purge will purge transactions with a cscn lower than the low water
mark for propagated transactions to it's specific destination. This means
that the purge will query the last_delivered for each dblink destination. All
transactions that have been pushed from the local site to that destination will
usually fall below the low water mark and be purged.
Users should NEVER manually remove entries from def$_aqcall unless directed by
an Oracle Support analyst.
Please see Note 1035874.6 section 4 for more information on troubleshooting
the purge operation.
2.2 Administrative Requests
---------------------------
2.2.1 Admin requests generated but not propagated
-------------------------------------------------
When you use the dbms_repcat package to administer a master repgroup or
repobject, you will see administrative requests generated in the dba_repcatlog
view, which is commonly referred to as the admin request queue.
When the administrative request (admin request) is first generated at the
master definition site, it will be in a 'READY' status. Execution of requests
on the queue depend on an automated job to apply them at the current site and
to propagate them to other master sites if needed. Most calls to dbms_repcat
at the master definition site will automatically run the local job when the
first set of admin requests generated by the call enter the queue, so you will
see some or all of the calls in the 'AWAITING CALLBACK' status almost
immediately at the master definition site. There are interdependencies between
some admin requests, so if one request errors, dependent requests will not run
until the error has been resolved.
To troubleshoot, please refer to the "Admin requests generated but not
propagated" section of Note 180014.1
2.2.2 Admin requests propagated but not applied at receiving site
-----------------------------------------------------------------
When an admin request is propagated to a remote master site, it is placed
in that master site's dba_repcatlog with a 'READY' status, and must wait for
the do_defer_repcat_admin job for the associated repgroup on that site to
run before it can be applied. When the job runs, the admin request status is
changed to 'DO CALLBACK' and remains so until the admin request has been
completed.
If the admin request is applied successfully, it will be removed from the
dba_repcatlog on both the master definition site and the master site. If the
admin request fails, it will be removed from the master site and marked with
status 'ERROR' at the master definition site.
Special note: If you loose the connection between the master definition site
and the master site, for what ever reason, while an admin request is in
AWAITING CALLBACK at the master definition site, the admin request becomes
corrupted and must be manually removed from both the master definition site
and the master site.
To troubleshoot, please refer to the "Admin requests propagated but not
applied at receiving site" section of Note 180014.1
2.3 Master Repgroup State
-------------------------
2.3.1 Stuck in 'Quiescing' mode when suspend_master_activity command issued
---------------------------------------------------------------------------
A master replication group, or repgroup, can be in one of three states:
Normal, Quiescing or Quiesced
When you suspend activity of a repgroup, Oracle will attempt to push all
waiting transactions in deftran to remote master sites. While this is running
the repgroup is in a Quiescing state. Once deftran has been cleared, the
repgroup then becomes quiesced. A repgroup cannot become quiesced until all
defered transactions have been pushed. If there are any problems pushing the
transactions, the repgroup will remain 'stuck' in quiescing mode.
To recover from this : refer to the troubleshooting guide Note 180014.1
(Master repgroup is stuck in 'QUIESCING' mode and does not go to quiesced).
2.3.2 Group still quiesced after resume_master_activity command issued
----------------------------------------------------------------------
A master repgroup cannot resume activity until all admin requests for the
group have been successfully applied at all master sites for the repgroup.
If you have issued the resume_master_activity command but the group is still
quiesced, there may be admin requests in the admin queue that need to be run
or that have failed with an error. To check this :
select id, request, status, master from dba_repcatlog;
If rows are returned, run the associated do_defer_recat_admin jobs at all
master sites for the repgroup until all admin requests have been applied. The
last admin request that you should see applied is the resume master activity
request.
If you have admin requests that have failed you will need to resolve the
errors, regenerate the admin request, and delete the requests with errors from
the admin request queue (see 2.2 Admin Requests). For additional information
please refer to Note 180014.1
2.4 Automated job not running
-----------------------------
Oracle Replication uses the job queue to automate the propagation and purging
of deferred transactions, the propagation and application of administrative
requests, and snapshot refreshes. If the jobs are not configured properly,
they will not run as expected and their associated tasks will not be completed.
To troubleshoot a job, address the following:
1) If this is a new database. Bounce the database at least once. You may
need to drop the job and reschedule it. This is one of those
idiosyncrasies of new databases and automated jobs.
2) Make sure the package the job is calling is valid (see section 1.1 Packages)
3) Make sure the log_user of the job has privileges to do what the job is doing
(see section 1.2 Users and privileges).
4) Make sure all database links required by the job are available and working
(see section 1.3 Database links).
5) Make sure the job is not broken or does not have failures.
Issue the following command:
select job, what, next_date, interval, log_user, broken, failures
from dba_jobs;
job - job identifier number
what - procedure the job will run
next_date - the next time the job will run
interval - how often the job should run
log_user - owner of the job; should be user with proper
privilages, usually repadmin user or propagator
broken - if true, job is broken and will not run automatically
failures - number of times a job has attempted to run and failed
If a job has failed 16 times, it will mark itself as broken and requires user
intervention to run again. If a job has failed, it will generate a trace file
snpxxx.trc in the background_dump_dest directory. The trace file will show the
cause of the failure. Resolve the problem, then run the job manually with
dbms_job.run(<job#>). Also see Note 103349.1
If the next_date of the job is in the past, try running the job manually.
If the job runs fine check your job queue parameters:
job_queue_processes = (number of jobs in dba_jobs that may run
concurrently) + 1
job_queue_interval = less than the interval for the job itself
(keep in mind that jobs may not run exactly on
time. The job could actually run as late as
the next_date + job_queue_interval)
If the job runs fine manually, make sure the next_date increments by the
interval. Monitor the job queue to see if the job runs as scheduled. If the
job still does not run automatically, contact Oracle Support.
2.5 Snapshot refreshes
----------------------
2.5.1 Snapshot refresh loses snapshot changes
---------------------------------------------
The base concept of a snapshot is to contain an exact copy of the data in it's
associated master table at the time of refresh. If changes are made at an
updateable snapshot site, but are lost when the snapshot is refreshed, the
changes either did not take place at the master site or were overwritten by
later changes. If your snapshot changes are lost after a successful refresh,
check the following:
1) Make sure the changes are being recorded in the deftran at the snapshot
site.
2) Make sure the refresh job is pushing the changes as part of the job:
The dbms_snapshot.refresh package parameter 'push_deferred_rpc' defaults
to true. This will cause the refresh job to push the entries in deftran
before it refreshes the snapshot. If this parameter is set to false, the
transactions will not be pushed and thus not applied at the master site.
When the snapshot refreshes, the changes will appear to be 'lost'. However,
when the transactions are pushed, usually by a separate push job, the
changes will then be visible at the master site and pulled to the snapshot
site on the next refresh.
3) Make sure the deferred transactions are propagated properly (see section
2.1 Deferred Transactions).
4) Make sure deferred transactions did not error out at the master site
(see section 2.1.3 of 2.1 Deferred Transactions).
5) If conflict resolution has been implemented at the master site, make sure
you are familiar with the results of the resolution methods. This could
be the cause of the difference between what you are seeing and what you
expect to see:
An example of this would be the EARLIEST TIMESTAMP resolution method which
will discard changes with a newer timestamp if a conflict is detected.
6) If more than one site can update the same record that the snapshot site
can update, the snapshot update may be replaced with a newer update when
refreshed.
2.5.2 Snapshot can not fast refresh
-----------------------------------
There are a number of mechanisms in place at both the snapshot site and the
master site to facilitate a fast refresh. The main reasons a snapshot cannot
be refreshed are because it is either too complex or the snapshot log on the
master site is not accessible or out of date.
See Note 179469.1 DIAGNOSING THE ORA-12004 for fast refresh problems and
resolutions.
See Note 35217.1 WHEN SNAPSHOT LOGS GET PURGED for discussion on how snapshot
logs on the master site are used for fast refreshes.
Special Note: In Oracle 8i, the term 'materialized view' replaces the term
snapshot. In Oracle 8i, materialized views can reside on the same site as
it's master table. This is NOT regarded as a replicated environment, but
rather a data warehousing environment. There are special conditions that
apply only to data warehousing materialized view fast refreshes and not to
replicated materialized view fast refreshes. The easiest way to determine
what you can and can not do with fast refreshes is look at where the master
table for the snapshot resides. If the master table resides on a remote site
(you have to go over a database link), your fast refresh is governed by the
rules laid forth in the Replication manual. If the master table resides on
the same site, your fast refresh is governed by the rules laid forth in the
Tuning manual.
3. ISSUES WITH REPLICATION MANAGER SETUP WIZARD:
================================================
Replication Manager Setup Wizard connects as user SYSTEM to create replication
users and grant privileges. SYSTEM does not have privileges to grant execute
on the specific packages used internally by the replication users.
Irregularities have been encountered when using the setup wizard to create a
replicated environment. It is suggested that you create the replication users
and grant privileges via command line connecting as SYS as SYSDBA.
Replication setup wizard assumes that the global_name and tnsalias for a
database are identical when creating the public database link. This is often
not the case, and you will get an ora-12154 when attempting to access the
database link. It is suggested that you create the replication users database
links via command line.
See the following setup note for Multi-Master replication environment setup
and configuration :
Oracle 8.1 onwards - Note 117434.1
See the following setup note for Single-Master replication environment setup
and configuration :
Oracle 8.1 onwards - Note 112075.1
RELATED DOCUMENTS
-----------------
Oracle Server Replication Manual
_______________________________________________________________________________
Oracle Support Services
.