1. TCP/IP
measures, including flow rates, internal TCP queuing connects pending, connects
satisfied.
2. Non pre-emptive threading measures including thread memory, context
switching, hung threads, etc.
3. Higher-level capabilities in MEASCOM (or integration with
other tools) to do correlations/pattern analysis (MAW interesting).
I'm with John Nash too on this (point 10 below), although I
still *really* want pthread/Java thread
instrumentation.
We found one rather simple thing missing: There is no chance
to see the current temperature of any of the components.
@Mark: Concerning your question you might find an
interesting article in the next Issue of The Connection.
We are monitoring our NonStop
using the Open-Source tool Nagios because we are
monitoring all the other systems with Nagios and the NonStop should not be an exception. Because of that we had
to create our own "Nagios NonStop
client". During the creation of that client I found that the documentation
of the existing SPI-interfaces is only partly available. So the documentation
of the SPI-interfaces for IP/V6 and OSS is missing, I suspect that others are
missing, too. As far as I know RDF still has no SPI-interface.
The interaction between the Guardian
versus the OSS is limited, you might have an issue in the OSS but can't really
see it. You really have to dig down but OSS performance differs from Guardian,
Tools do not really help much in that area.
ASAP is not that bad. But, you are correct;
there are no integrated tools out there that will do everything, as far as I
know.
@Mauricio, can you please give an example of
something you "really can't see"? At the level of e.g. Measure, there
is hardly any difference between Guardian side programs and OSS side programs.
(apart from the funny name of the latter -- and even
these names are often translated in OSS pathnames. Obviously I am missing
something I have never missed. looking forward to your
paper Mark.
We use NonStop for the National
Switch in Oman. As a pure switch (i.e. driving no devices), the standard tools
are sufficient. Users with more complex environments may be those with a
greater need.
Is there any product out there similar to WireShark to read PTrace files
(TCP/IP Traces) and If we could specify to only trace
a specific source / destination would be great.
As I understand it is currently all or nothing, no
filtering for IP trace.
@Derek: I imagine you realize this, but in case not, and for
others reading this, let me mention that for systems that use CIPs for the
TCP/IP communications, line traces are done using the linux
tcpdump command, so the files certainly can be analyzed using Wireshark or any other
software that understands tcpdump files.
But you asked specifically about PTrace
files. I just looked at the Wireshark web site, and
it gives a very long list of file formats that Wireshark
can read. I did not see anything in the list that looked like it was naming the
PTrace format from a NonStop
system, but the PTrace format might be a duplicate of
one of those formats that Wireshark does recognize,
so it would be reasonable to ftp a PTrace file to a
Windows or Linux machine and give Wireshark a try on
it. You might be pleasantly surprised. If that doesn't work, I'll bet it would
take only a very simple program to convert a PTrace
file into a file that Wireshark would recognize, but
I don't know enough about the subject to attempt the job.
As for tracing activity only for a specific source or
destination, tcpdump can do that, so that is possible
when the system uses a CIP for the TCP/IP communications. I don't know about
the tracing with the other TCP/IP hardware. Perhaps you are correct that no such
filtering is possible with those devices.
Ted's comment (12) was that there is a need for an event
throttling mechanism in EMS that does not require 3rd party apps or application
augmentation. My question is: What is it about the EMS abilities to configure
event burst detection and suppression or to entirely suppress events that match
an event filter loaded into a collector that don't meet the need Ted expresses?
I have never tried to use either of those mechanisms.
I just know that they exist, and I thought they were implemented to address the
kind of problem Ted seems to be talking about. Do they not do
all that is needed?
Okay, I guess what you are saying is that
the problem is with applications that still write text messages to $0. If those
applications go crazy and flood the collector with messages, there currently
isn't a good way to throttle that. Is that correct?
The documentation for burst filtering says that it will detect text events as
similar if the text is identical, and it can keep track of up to 128
simultaneous bursts. In the cases you have seen, are the contents of the text
events that ought to be detected and suppressed not identical? Does turning on
the burst detection and suppression overload $0 so that it cannot keep up with
the incoming event rate? Something else?
@Ted: I'm slightly confused by what you
wrote. I'm not saying you did not have a problem; I just don't understand what
happened yet.
All the messages in the $0 EMS log *are* in EMS format. Messages that the
application did not tokenize but wrote the old way as a simple text string get
turned into a default EMS format by $0. Those events have subsystem ID EMS,
event number 512, the subject is the system number and PID of the process that
wrote the message, and the message text is put into the ZEMS-TKN-TEXT token in
the event.
If you have the burst filtering on and you get a flood of text messages written
to $0, the burst filtering ought to detect the text messages that have the same
text content from the same process as being duplicates and the burst filtering
should suppress them. If the text is not identical, then the messages will not
be recognized as duplicates and so the burst suppression won't suppress them.
Do you believe the burst filtering did not
detect duplicate text messages, or were the messages in the flood of text
messages not exact duplicates? In the first case, it sounds like a bug in EMS
(unless I am misreading the manual); in the second case, it sounds like some
additional capabilities would be needed in the burst detection.
@Ted: I don't believe you need to have been the one to have
written the logging code in the application to answer the question I asked:
Were the messages that did not get suppressed truly duplicates (according to
the EMS rules), or did the content of the events differ enough that the EMS
rules did not classify them as duplicates? If you don't know enough about the
messages to know that, are you willing to talk with someone who does know?
The message rates are not, as far as I know, the
issue. Event burst detection and suppression is specifically intended to
operate in cases where large numbers of duplicate events are submitted very
rapidly. If it somehow fails to work properly at high event rates, I think it
is not operating properly.
This is not "One for the guys in the Lab"
yet. Until you, or someone, can definitely characterize the problem as one of
the two cases I ask about above, the "guys in the Lab" wouldn't know
whether they are looking for a bug in the current implementation or the need
for enhancement of the detection rules. I assume that the QA test suite already
contains tests that check that burst detection and suppression is working to
some degree. Without more details, I think it isn't reasonable to ask
Development to investigate this.
If you, or someone, has
already filed a problem report on this, then I apologize -- you've already done
what is necessary to call attention to the problem. If you could tell me the
case number (or customer name, some keywords you used, and approximate date), I
will see whether I can find out what has happened with it. But if the problem
hasn't been reported in enough detail so that it is clear where to start to
work on it, it really would help to get the details pinned down. I have no influence
on whether the problem would get fixed, but I could at least ask people closer
to Development whether they could look into it.
I never had the chance to test run Insider Technologies
"Reflex" for NonStop, I'm an advocate of
MOMI which I find the best value for the money, an a
tool with amazing drill down capabilities and easy access to virtually all
measure entities and counters we require for every day performance analysis and
problem solving.
QIO segment current pool size increase close to segment
size, most times the issue is found when the pool is exhausted.
Sockets in CLOSE-WAIT state indefinitely
easy access to SS7 over IP state, stats, etc.
most performance/operation tools a gear toward Banking applications, Telco is using NonStop
for many Cell phone applications.
Easy access to CLIM information/commands and trace
facility need it, climcmd is SLOW and a CLIM trace is
painful, you need to execute the trace on the clim,
make sure it stopped when terminated and then move the file to the guardian to
use ptrace or to your PC if using wireshark
or similar, oh and make sure you don't leave too many trace files in the clim disk, maybe incorporate ptrace
to the operation/performance tool for display in the GUI or such.
Things you can't see with measure... let me check my list...
Just a few highlights - everywhere that the operating
system is using special APIs that don't increment counters - TMF's activity,
SQL/MX file file-busy-time, we've had a ton of issues with SQL/MX's use of
different APIs and ignoring the counters (it's getting better, very very very slowly), TCP/IPv6 lack
of use of file-busy despite the fact the reads/writes are there (as compared
with conventional TCP/IP). Every operating system upgrade I open 5-10-15 cases
regarding measure counters no longer being incremented or being incremented
differently.
Another one of the big things we would love to have is
measure having knowledge of serverclass names without
us having to go through the hack we put together for reporting. For a lot of
shops they can use the program name as a proxy for serverclass,
but in my case the programs are OSS and I do most of the analysis and reporting
on another node and can't capture the OSS journal due to the OSS name server
overhead.
I'm sure there is more that I've swapped out, but I
need to get back to earning my pay - hope that helps.
I totally agree with John Nash about SQL/MX and TMF. Heap
Size is a counter that should be added to aid looking for Memory Leaks (it's in
PSTATE but not in Measure).
A true throttling mechanism for EMS that
works without the need for 3rd party apps or application
augmentation....certain apps have literally thousands of amber alert messages
that are vital for troubleshooting, that can kill a system in seconds when
activated. HP needs a true way of detecting that and preventing it from hurting
the system.
Everybody, don't give me that it's the Application Team's job to clean the junk
out of $0, because that just won't fly. Dedicated collectors work, but there
are just some things that need to be seen along with the regular operator's
view that helps the Application support team during incident management
reviews.
All messages need to be tokenized for the
throttling mechanisms to work. EMS Faststart is
designed to fix this problem when used properly, but getting application
support teams to spend the time on this "zero revenue" generating
effort is often pushed to the bottom of the stack. Which
leads to volumes of messages that could potentially overrun your EMS Logs in
the event of a network or hardware failure. Sorry for posting my comment
on the wrong group Mark.
Keith, you are correct when used properly,
but there are old message systems that have a tremendous volume of conversion
effort necessary and development teams often relegate this work to the back
burner. Having supported several environments with this condition over the past
20 years, it's frustrating to always have to gather EMS logs and perform triage
following a client impacting incident, just to discover that this age old
problem is at the root cause of the issue, but gets quietly pushed under the
rug as a CPU/memory bottleneck related problem. MessageSYS
overruns do cause outages contrary to popular opinion.
Burst filtering is on and does it's job we'll, but the messages
that are not in EMS format are the ones which ultimately overwhelm $0 and
associated message processing. We have work arounds
to this issue, so not a big one just an itch that is hard to scratch.
I didn't write the code on EMS, so I can't
answer that one for you. In a transaction processing world where sub-second
messages are sent back and forth to verify whether a transaction can be
authorized or reversed out of the system, how does EMS determine if the
messages are duplicates or not?
There are literally hundreds of messages a second when things go wrong with
sub-second timestamps, One for the guys in the Lab. As I stated we found a way
to limit the possibility of it occurring again via internal measures.
Nobody mentioned Prognosis. The learning curve is a bit
steep but the capabilities are truly enormous. I have gotten pretty good at it
myself (Nothing like getting thrown in the deep end:)) Many of the metrics come
from their custom servers that do SPI calls so the I/O overhead to a MEASURE
file is cut way down. It is actually much more efficient than MEASURE in many
cases. The weaknesses I have observed with Prognosis in general are the
weaknesses missing from Guardian itself - very limited information about
message bus I/O, ServerNet, CLIM metrics and the
association between SQL table names and dynamic compiles for SQL query tuning
and performance bottlenecks (JDBC/ODBC queries) are a few that come to mind.
Prognosis is very good at slicing and dicing the data. For more complex
analysis, it can quickly generate excel data that can be analysed with pivot
tables. If you have a well-thought-out Prognosis repository, you can do very
effective post-mortems on crisis events. If you put the repository on a Windows
clustered server with a raid storage array (i.e. like DirecTV uses) you can
have a fault-tolerant solution and offload all the
disk I/O off the NonStop. It can do everything
Measure can do - and much more.