1.           LinkedIn Responses to: What’s missing for NonStop in 2012?

 

1.1.         Randall Becker - Development Team at ITUGLIB Engineering Team, Toronto

1. TCP/IP measures, including flow rates, internal TCP queuing connects pending, connects satisfied.

2. Non pre-emptive threading measures including thread memory, context switching, hung threads, etc.

3. Higher-level capabilities in MEASCOM (or integration with other tools) to do correlations/pattern analysis (MAW interesting).

I'm with John Nash too on this (point 10 below), although I still *really* want pthread/Java thread instrumentation.  

 

 

1.2.         Wolfgang Breidbach – System manager at BV Zahlungssysteme, Koln

We found one rather simple thing missing: There is no chance to see the current temperature of any of the components.
@Mark: Concerning your question you might find an interesting article in the next Issue of The Connection.

We are monitoring our NonStop using the Open-Source tool Nagios because we are monitoring all the other systems with Nagios and the NonStop should not be an exception. Because of that we had to create our own "Nagios NonStop client". During the creation of that client I found that the documentation of the existing SPI-interfaces is only partly available. So the documentation of the SPI-interfaces for IP/V6 and OSS is missing, I suspect that others are missing, too. As far as I know RDF still has no SPI-interface.

 

1.3.         Mauricio Bermudez– Advisor Systems Engineer at Syniverse Technologies, Tampa

The interaction between the Guardian versus the OSS is limited, you might have an issue in the OSS but can't really see it. You really have to dig down but OSS performance differs from Guardian, Tools do not really help much in that area.

 

 

1.4.         Patrick Levesque – Senior NonStop software architect at AJB Software Design Inc., Canada

ASAP is not that bad. But, you are correct; there are no integrated tools out there that will do everything, as far as I know.

 

 

1.5.         Frans Jongma – Sr. System Software Engineer, HP NonStop Advanced Technology Center, Rotterdam

@Mauricio, can you please give an example of something you "really can't see"? At the level of e.g. Measure, there is hardly any difference between Guardian side programs and OSS side programs. (apart from the funny name of the latter -- and even these names are often translated in OSS pathnames. Obviously I am missing something I have never missed. looking forward to your paper Mark.

 

1.6.         John Russell - Consultant to the Central Bank of Oman

We use NonStop for the National Switch in Oman. As a pure switch (i.e. driving no devices), the standard tools are sufficient. Users with more complex environments may be those with a greater need.

 

1.7.         Derek Wallace - Base24 Developer at First National Bank (FNB), Johannedburg

Is there any product out there similar to WireShark to read PTrace files (TCP/IP Traces) and If we could specify to only trace a specific source / destination would be great.
As I understand it is currently all or nothing, no filtering for IP trace.

 

1.8.         Keith Dick - Independent Computer Software Professional, San Francisco

@Derek: I imagine you realize this, but in case not, and for others reading this, let me mention that for systems that use CIPs for the TCP/IP communications, line traces are done using the linux tcpdump command, so the files certainly can be analyzed using Wireshark or any other software that understands tcpdump files.

But you asked specifically about PTrace files. I just looked at the Wireshark web site, and it gives a very long list of file formats that Wireshark can read. I did not see anything in the list that looked like it was naming the PTrace format from a NonStop system, but the PTrace format might be a duplicate of one of those formats that Wireshark does recognize, so it would be reasonable to ftp a PTrace file to a Windows or Linux machine and give Wireshark a try on it. You might be pleasantly surprised. If that doesn't work, I'll bet it would take only a very simple program to convert a PTrace file into a file that Wireshark would recognize, but I don't know enough about the subject to attempt the job.

As for tracing activity only for a specific source or destination, tcpdump can do that, so that is possible when the system uses a CIP for the TCP/IP communications. I don't know about the tracing with the other TCP/IP hardware. Perhaps you are correct that no such filtering is possible with those devices.

Ted's comment (12) was that there is a need for an event throttling mechanism in EMS that does not require 3rd party apps or application augmentation. My question is: What is it about the EMS abilities to configure event burst detection and suppression or to entirely suppress events that match an event filter loaded into a collector that don't meet the need Ted expresses?

I have never tried to use either of those mechanisms. I just know that they exist, and I thought they were implemented to address the kind of problem Ted seems to be talking about. Do they not do all that is needed?

Okay, I guess what you are saying is that the problem is with applications that still write text messages to $0. If those applications go crazy and flood the collector with messages, there currently isn't a good way to throttle that. Is that correct?

The documentation for burst filtering says that it will detect text events as similar if the text is identical, and it can keep track of up to 128 simultaneous bursts. In the cases you have seen, are the contents of the text events that ought to be detected and suppressed not identical? Does turning on the burst detection and suppression overload $0 so that it cannot keep up with the incoming event rate? Something else?

@Ted: I'm slightly confused by what you wrote. I'm not saying you did not have a problem; I just don't understand what happened yet.

All the messages in the $0 EMS log *are* in EMS format. Messages that the application did not tokenize but wrote the old way as a simple text string get turned into a default EMS format by $0. Those events have subsystem ID EMS, event number 512, the subject is the system number and PID of the process that wrote the message, and the message text is put into the ZEMS-TKN-TEXT token in the event.

If you have the burst filtering on and you get a flood of text messages written to $0, the burst filtering ought to detect the text messages that have the same text content from the same process as being duplicates and the burst filtering should suppress them. If the text is not identical, then the messages will not be recognized as duplicates and so the burst suppression won't suppress them.

Do you believe the burst filtering did not detect duplicate text messages, or were the messages in the flood of text messages not exact duplicates? In the first case, it sounds like a bug in EMS (unless I am misreading the manual); in the second case, it sounds like some additional capabilities would be needed in the burst detection.

@Ted: I don't believe you need to have been the one to have written the logging code in the application to answer the question I asked: Were the messages that did not get suppressed truly duplicates (according to the EMS rules), or did the content of the events differ enough that the EMS rules did not classify them as duplicates? If you don't know enough about the messages to know that, are you willing to talk with someone who does know?

The message rates are not, as far as I know, the issue. Event burst detection and suppression is specifically intended to operate in cases where large numbers of duplicate events are submitted very rapidly. If it somehow fails to work properly at high event rates, I think it is not operating properly.

This is not "One for the guys in the Lab" yet. Until you, or someone, can definitely characterize the problem as one of the two cases I ask about above, the "guys in the Lab" wouldn't know whether they are looking for a bug in the current implementation or the need for enhancement of the detection rules. I assume that the QA test suite already contains tests that check that burst detection and suppression is working to some degree. Without more details, I think it isn't reasonable to ask Development to investigate this.

If you, or someone, has already filed a problem report on this, then I apologize -- you've already done what is necessary to call attention to the problem. If you could tell me the case number (or customer name, some keywords you used, and approximate date), I will see whether I can find out what has happened with it. But if the problem hasn't been reported in enough detail so that it is clear where to start to work on it, it really would help to get the details pinned down. I have no influence on whether the problem would get fixed, but I could at least ask people closer to Development whether they could look into it.

 

1.9.         Hector Gull - Intelligent Voice Platforms Support at Rogers Communications Inc., Toronto

I never had the chance to test run Insider Technologies "Reflex" for NonStop, I'm an advocate of MOMI which I find the best value for the money, an a tool with amazing drill down capabilities and easy access to virtually all measure entities and counters we require for every day performance analysis and problem solving.

QIO segment current pool size increase close to segment size, most times the issue is found when the pool is exhausted.

Sockets in CLOSE-WAIT state indefinitely

easy access to SS7 over IP state, stats, etc.

most performance/operation tools a gear toward Banking applications, Telco is using NonStop for many Cell phone applications.

Easy access to CLIM information/commands and trace facility need it, climcmd is SLOW and a CLIM trace is painful, you need to execute the trace on the clim, make sure it stopped when terminated and then move the file to the guardian to use ptrace or to your PC if using wireshark or similar, oh and make sure you don't leave too many trace files in the clim disk, maybe incorporate ptrace to the operation/performance tool for display in the GUI or such.


1.10.      John Nash - Senior Systems Programmer at SHAZAM Network - ITS, Inc. , Iowa Area

Things you can't see with measure... let me check my list...
Just a few highlights - everywhere that the operating system is using special APIs that don't increment counters - TMF's activity, SQL/MX file file-busy-time, we've had a ton of issues with SQL/MX's use of different APIs and ignoring the counters (it's getting better, very very very slowly), TCP/IPv6 lack of use of file-busy despite the fact the reads/writes are there (as compared with conventional TCP/IP). Every operating system upgrade I open 5-10-15 cases regarding measure counters no longer being incremented or being incremented differently.

Another one of the big things we would love to have is measure having knowledge of serverclass names without us having to go through the hack we put together for reporting. For a lot of shops they can use the program name as a proxy for serverclass, but in my case the programs are OSS and I do most of the analysis and reporting on another node and can't capture the OSS journal due to the OSS name server overhead.

I'm sure there is more that I've swapped out, but I need to get back to earning my pay - hope that helps.

 

1.11.      Mike Hoare - Owner, Kempoller Software, Cologne

I totally agree with John Nash about SQL/MX and TMF. Heap Size is a counter that should be added to aid looking for Memory Leaks (it's in PSTATE but not in Measure). 

 

 

1.12.      Ted Rodgers - IT Director, Midrange Systems Analyst, Performance and Tuning, Disaster Avoidance, Systems Architect, Atlanta

A true throttling mechanism for EMS that works without the need for 3rd party apps or application augmentation....certain apps have literally thousands of amber alert messages that are vital for troubleshooting, that can kill a system in seconds when activated. HP needs a true way of detecting that and preventing it from hurting the system.

Everybody, don't give me that it's the Application Team's job to clean the junk out of $0, because that just won't fly. Dedicated collectors work, but there are just some things that need to be seen along with the regular operator's view that helps the Application support team during incident management reviews.

All messages need to be tokenized for the throttling mechanisms to work. EMS Faststart is designed to fix this problem when used properly, but getting application support teams to spend the time on this "zero revenue" generating effort is often pushed to the bottom of the stack. Which leads to volumes of messages that could potentially overrun your EMS Logs in the event of a network or hardware failure. Sorry for posting my comment on the wrong group Mark.

Keith, you are correct when used properly, but there are old message systems that have a tremendous volume of conversion effort necessary and development teams often relegate this work to the back burner. Having supported several environments with this condition over the past 20 years, it's frustrating to always have to gather EMS logs and perform triage following a client impacting incident, just to discover that this age old problem is at the root cause of the issue, but gets quietly pushed under the rug as a CPU/memory bottleneck related problem. MessageSYS overruns do cause outages contrary to popular opinion.

Burst filtering is on and does it's job we'll, but the messages that are not in EMS format are the ones which ultimately overwhelm $0 and associated message processing. We have work arounds to this issue, so not a big one just an itch that is hard to scratch.

I didn't write the code on EMS, so I can't answer that one for you. In a transaction processing world where sub-second messages are sent back and forth to verify whether a transaction can be authorized or reversed out of the system, how does EMS determine if the messages are duplicates or not?

There are literally hundreds of messages a second when things go wrong with sub-second timestamps, One for the guys in the Lab. As I stated we found a way to limit the possibility of it occurring again via internal measures.

 

 

1.13.      Dean Malone - Independent NonStop Computer Professional, Charlottesville, Virginia Area

Nobody mentioned Prognosis. The learning curve is a bit steep but the capabilities are truly enormous. I have gotten pretty good at it myself (Nothing like getting thrown in the deep end:)) Many of the metrics come from their custom servers that do SPI calls so the I/O overhead to a MEASURE file is cut way down. It is actually much more efficient than MEASURE in many cases. The weaknesses I have observed with Prognosis in general are the weaknesses missing from Guardian itself - very limited information about message bus I/O, ServerNet, CLIM metrics and the association between SQL table names and dynamic compiles for SQL query tuning and performance bottlenecks (JDBC/ODBC queries) are a few that come to mind. Prognosis is very good at slicing and dicing the data. For more complex analysis, it can quickly generate excel data that can be analysed with pivot tables. If you have a well-thought-out Prognosis repository, you can do very effective post-mortems on crisis events. If you put the repository on a Windows clustered server with a raid storage array (i.e. like DirecTV uses) you can have a fault-tolerant solution and offload all the disk I/O off the NonStop. It can do everything Measure can do - and much more.