Oracle Enterprise Manager 10G Grid Control - Lessons I've Learned the Hard Way

We now have several 10G analysis projects underway. I thought I’d give you a quick update on our accomplishments (or lack thereof) on one of them. This will hopefully prevent others from experiencing the same issues that we just have.

Lessons I have learned the hard way:

That you should use never use the SHUTDOWN NORMAL command to stop an Oracle database. I did one in June of 1996 that I'm still waiting for a response from.
Watching the number of bytes increase on a file that you are restoring from a backup won't make it run any faster.
That you should never believe a student in an Intro to Oracle class when he says "I didn't even touch it, it just broke!" Note to my fellow Oracle instructors - just restore the database. He probably pulled the old ALTER, RENAME, RESIZE, REMOVE and REALLOCATE trick on the database's control files.
That it's not OK to allow a "slightly broken" 10G agent to continue running.

This is one of the projects that we’ll have to temporarily dump into the “lack of accomplishments” column. In addition, we’ll also file it in the “exasperating” and “patience building” columns too. You know, I have been in this field a long time and continue to find it challenging at times. I have also found that since I have become a manager, my administrative skill sets and problem solving skills are becoming a little rusty, so to speak. Personally, I find that somewhat concerning. But let’s put my concerns aside and continue our discussion.

The Target Discovery Process
Before I give you the gory details on our problems, let me describe how nodes are added into the 10G Management Server. In 9I and earlier releases of OEM, you fired up the agents on the targets and then asked the management console to “discover” the node. You specified the node name in a wizard on the management console to add the new target node into the management framework. The databases and listeners running on that node were identified during the discovery process.

The discovery process in 10G is the exact opposite of 9I's. During the agent install on the target, the Oracle Universal Installer asks for the name of the 10G management server that will be responsible for administering the new environment. This information is recorded in the EMD.PROPERTIES configuration file on the target. When the agent is started on the target it “injects” itself into the management server identified during the installation.

We wanted to test deleting and re-adding an entire node from the management console. BIG MISTAKE. Apparently, this generates a problem or two with the management server. Actually it generates more than a few problems, it generates a LOT of problems. First, the server didn’t clean up any of the rows in the management server’s configuration tables. When we attempted to add the node again, we received a SYSMAN trigger error stating that the node was in the process of being deleted.

Although the node was no longer showing up in OEM, there were rows in the configuration table SYSMAN.MGMT_TARGETS_DELETE that caused the trigger to think that the node was in the process of being deleted. We reviewed the contents of the table and found that it contained a history of every target (database, listener, node) that we deleted in the past. None of these previously deleted targets could ever be successfully added back into the environment. That is a problem.

We created a TAR with Oracle and received a very quick response. Of course the analyst’s recommendation was for us to upgrade to version 10.1.0.3 of OEM. Apparently, this is the “catch all” patch that solves a lot of problems relating to the agents. We upgraded to 10.1.0.3 on the management server and on the target and still had the same problem adding the node back in. We deleted another target to determine if the patched code would successfully delete the rows in SYSMAN.MGMT_TARGETS_DELETE. Although the patch did successfully fix the original problem, we still had the old rows in SYSMAN.MGMT_TARGETS_DELETE.

We determined that the rows still residing in the configuration table SYSMAN.MGMT_TARGETS_DELETE continued to be the source of our problem. So, we did what any good DBA would do; we backed up the table and deleted all of the offending rows in it. This fixed our first problem but we then ran into another issue. This time the following error was returned:

ERROR-400|ORA-20206: Target does not exist: Agent does not exist for http://agenthost:1830/emd/main/
ORA-06512: at "SYSMAN.EMD_LOADER", line 1656
ORA-06512: at line 1

We made another visit to Metalink for further research. Although we originally thought that we shouldn’t have tried to “trick” Oracle by deleting the rows, note 290527.1 on Metalink told us that we indeed had a new problem. 290527.1 stated that multiple agents running on the target would “confuse” the management server and prevent a successful target injection into the management server from occurring.

Other day-to-day work needed our immediate attention, so we decided to try again the next morning. The target agent was left in an active state. This was a mistake on our part (really, my part). The agent kept trying to inject itself into the management server every few minutes throughout the night. Each time it failed, it generated a 5 MEG error log on the management server. The error messages rapidly filled the $ORACLE_HOME/sysman/log directory on the management server which caused OEM to ultimately fail and become totally unusable.

Lessons Learned

Do apply patch 10.1.0.3. This would have prevented the whole comedic set of problems from occurring. I found multiple articles on Metalink stating that it fixes numerous issues with both the management server and the target agents.
Don’t let an agent that is having problems inserting itself into the management server continue to run. It will fill up the error log directory which may cause your management server to crash (like ours did).
Do use OEM 10G’s monitoring capabilities to monitor the space used in its error log directories. We intend to monitor space utilization in the error log directory every 15 minutes. That way if any of our agents “flake out” and is automatically restarted by their monitoring processes, we won’t fill up the error log directory.
Our Oracle application server DBA, Jeff Kondas, recommended that we customize the script that he uses to move application server error logs to a hold directory. He runs a job daily through KRON that moves all of the files in error log directories to a hold directory, compresses the error logs and then removes them from the hold directory on a semi-regular basis. That allows him to keep a running history of logs if historical error analysis is required.

Tuesday, February 01, 2005 | Permalink | Comments (1)

trackback URL: http://www.dbazine.com/blogs/blog-cf/chrisfoot/10goemlessons/sbtrackback

lessons !!

Posted by saaya at 2006-08-01 03:09 AM

I guess that was prety much mess ...hard way to lessons learnt..as far as i understand we need not mess up with OMS and the tables. As the tables are have interdependency. From "Managment systems" we can check if the target has been deleted or not.

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Oracle Enterprise Manager 10G Grid Control - Lessons I've Learned the Hard Way

lessons !!