Front cover POWER7 and POWER7+ Optimization and Tuning Guide Discover simple strategies to optimize your POWER7 environment Analyze and maximize performance with solid solutions Learn about the new POWER7+ processor Brian Hall Steve Munroe Mala Anand Francis P O’Connell Bill Buros...
Page 3
International Technical Support Organization POWER7 and POWER7+ Optimization and Tuning Guide November 2012 SG24-8079-00...
IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead.
Corporation or its subsidiaries in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. LTO, the LTO Logo and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.
IBM POWER7® and POWER7+™ processors. This advice is drawn from application optimization efforts across many different types of code that runs under the IBM AIX® and Linux operating systems, focusing on the more pervasive performance opportunities that are identified, and how to capitalize on them.
Page 12
Miso Cilimdzic has been with IBM since 2000. Over the years, he has worked on a diverse set of projects, with a focus on IBM DB2® in the areas of performance, and the integration with, and exploitation of, hardware and design of workload optimized systems.
Page 13
IBM POWER® performance and Java performance. Francis P O’Connell is a member of the IBM Systems and Technology Group in Austin, Texas. He is a Senior Technical Staff Member in Power Systems development, specializing in performance. He joined IBM in 1981 after receiving a Bachelor’s degree in Mechanical Engineering from the University of Connecticut and then earned a Master’s degree in...
Page 14
US Army, he joined IBM in 1989 and worked on several operating system projects, including AIX, OSF, Project Monterey, and Linux. Most of his 23 years with IBM have been working with the AIX core kernel, specializing in processor and hardware bring-up, memory management, and virtualization.
Jerrold M (Jerry) Heyman, Software Engineer/Technical Consultant - IBM System p/AIX, Research Triangle Park, North Carolina Karen Lawrence, IBM Redbooks Technical Writer, Research Triangle Park, North Carolina Mirco Malessa, SAP on POWER Development - IBM i, IBM PureSystems™, Boeblingen Germany Anbazhagan Mani, Senior Software Engineer, Solutions Management Architecture,...
Comments welcome Your comments are important to us! We want our books to be as helpful as possible. Send us your s about this book or other IBM Redbooks publications in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks...
Optimization and tuning on IBM Chapter 1. POWER7 and IBM POWER7+ This chapter describes the optimization and tuning of the IBM POWER7 system. It covers the the following topics: Introduction Outline of this guide Conventions that are used in this guide...
POWER7 also applies to POWER7+. This guide strives to focus on optimizations that tend to be positive across a broad set of IBM POWER processor chips and systems. While specific guidance is given for the POWER7 and POWER7+ processors, the general guidance is applicable to the IBM POWER6®, POWER5,...
Page 19
In Chapter 5, “Linux” on page 97, we describe the two primary Linux operating systems that are used on POWER7: Red Hat Enterprise Linux (RHEL) for POWER and SUSE Linux Enterprise Server (SLES) for POWER. The minimum supported levels of Linux, including service pack (SP) levels, are SLES11/SP1 and RHEL6/GA, which provide full support and usage of POWER7 technologies and systems.
These performance tools are most often used as part of the advanced investigative techniques that are described in 1.5, “Optimizing performance on POWER7” on page 5, except for the new performance advisors, which are intended as investigative tools, appropriate for a broader audience of users.
This section provides guidance for optimizing code performance on POWER7 when you use the AIX or Linux operating systems. The POWER7+ processor is a superset of the POWER7 processor, so all optimizations described for POWER7 apply equally for POWER7+. We cover the more prominent performance opportunities that are noted in past optimization efforts.
This section covers building and performance testing applications on POWER7, and gives a brief introduction to the most important simple performance tuning opportunities that are identified for POWER7. More details about these and other opportunities are presented in the later chapters of this guide.
Page 23
For example, when you target a multi-threaded application to scale up to four cores on POWER7, it is important that the test bed be at least a 4-core system and that tests are configured to run in various configurations (1-core, 2-core, and 4-core).
Page 24
Java release must be installed on the performance test bed system. For IBM Java, tuning for POWER7 was introduced in Java 6 SR7, and that is the recommended minimum version. Newer versions of Java contain more improvements, though, as described in 7.1, “Java levels”...
Page 25
Some of the possible options to consider are: – -qarch=ppc64 -qtune=pwr7 for an executable that is optimized to run on POWER7, but that can run on all 64-bit implementations of the Power Architecture (POWER7, POWER6, POWER5, and so on) –...
Page 26
Java releases automatically fully utilize all of the new features of the target POWER7 or POWER7+ processor of the system an application is running on. For more information about Java performance, see Chapter 7, “Java” on page 125.
Page 27
The Linux Advance Toolchain contains replacements for various standard system libraries. These replacement libraries are optimized for specific processor chips, including POWER5, POWER6, and POWER7. After you install the Linux Advance Toolchain, the dynamic linker automatically has programs use the library that is optimized for the processor chip type in the system.
Page 28
For certain types of non-numerical applications, turning off the default POWER7 hardware prefetching improves performance. In specific cases, disabling hardware prefetching is beneficial for Java programs, WebSphere Application Server, and DB2.
There are also mechanisms to control prefetching at the process level. POWER7 allows not only prefetching to be enabled or disabled, but it also allows the fine-tuning of the prefetch engine. Such fine-tuning is especially beneficial for scientific/engineering and memory-intensive applications.
Page 30
Similarly, the multiple cores on a POWER7 processor share a chip-specific cache space. Again, arranging the software threads that are sharing the data to run on the same POWER7 processor (when the partition spans multiple sockets) often allows more efficient utilization of cache space and reduced data reference latencies.
Page 31
Consider an example where you run four instances of WebSphere Application Server on a partition of 16 cores on a POWER7 system that is running in SMT4 mode. Each instance of WebSphere Application Server would be bound to run on four of the cores of the system.
Page 32
Some important items to understand in this example are: For a particular number of instances and available cores, the most important consideration is that each instance of an application runs only on the cores of one POWER7 processor chip. Memory and logical processor binding is not done independently because doing so can negatively affect performance.
For low CPU usage, look at the number of runnable threads reported by the operating system, and try to ensure that there are as many runnable threads as there are logical processors in the partition. Chapter 1. Optimization and tuning on IBM POWER7 and IBM POWER7+...
Page 34
For Java programs, use Java Lock Monitor (see “Java Health Center” on page 177). For non Java programs, use the splat tool on AIX (see “AIX trace-based analysis tools” on page 165). POWER7 and POWER7+ Optimization and Tuning Guide...
Page 35
For Java programs, the WAIT tool might be one of the first analysis tools to consider because of its versatility and ease of use. For more information about IBM Whole-system Analysis of Idle Time, which is the browser-based (that is, no-install) WAIT tool, go to: http://wait.researchlabs.ibm.com...
Page 36
POWER7 and POWER7+ Optimization and Tuning Guide...
The POWER7 processor Chapter 2. This chapter introduces the POWER7 processor and describes some of the technical details and features of this product. It covers the the following topics: Introduction to the POWER7 processor Multi-core and multi-thread scalability Using POWER7 features...
Core Core Figure 2-1 The POWER7 processor chip Each core is a 64-bit implementation of the IBM Power ISA (Version 2.06 Revision B), and has the following features: Multi-threaded design, capable of up to four-way SMT 32 KB, four-way set-associative L1 i-cache...
POWER7 processors. A single POWER7 chip can contain up to eight cores. With SMT, each POWER7 core can present four hardware threads. SMT is the ability of a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context.
Page 40
68. Additionally, the following specific scaling topics are described in 4.1, “AIX and system libraries” on page 68: pthread tuning malloc tuning For more information about this topic, see 2.4, “Related publications” on page 51. POWER7 and POWER7+ Optimization and Tuning Guide...
(which is always larger than the available real memory). The VMM must minimize the total processor time, disk bandwidth price, and response time to handle the virtual memory page faults. IBM Power Architecture provides support for multiple virtual memory page sizes, which provides performance benefits to an application because of hardware efficiencies that are associated with larger page sizes.
Page 42
The pagesize -a command on AIX determines all of the page sizes that are supported by AIX on a particular system. IBM AIX 5L™ Version 5.3 with the 5300-04 Technology Level supports up to four different page sizes, but the actual page sizes that are supported by a particular system vary, based on processor chip type.
Page 43
1. On the managed system, click Properties Memory Advanced Options Show Details to change the number of 16 GB pages. 2. Assign 16 GB huge pages to a partition by changing the partition profile. Ibid Ibid Chapter 2. The POWER7 processor...
Page 44
Rather than using the LDR_CNTRL environment variable, consider marking specific executable files to use large pages, because this limits the large page usage to the specific application that benefits from large page usage. Ibid Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf POWER7 and POWER7+ Optimization and Tuning Guide...
Hypervisor memory access permissions and controls. In POWER7 Systems, each chip consists of eight processor cores, each with on-core L1 instruction and d-caches, an L2 cache, and an L3 cache, as shown in Figure 2-2 on page 31.
Page 46
For an introduction to the concepts of cache and memory affinity, see “The POWER7 processor and affinity performance effects” on page 14. The IBM POWER Hypervisor is responsible for:...
Page 47
These design details change for every processor chip, even within the Power Architecture. Figure 2-2 shows the layout of a POWER7 chip, including the processor cores, caches, and local memory. Table 2-6 shows the cache sizes and related geometry information for POWER7.
Page 48
Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). available at: http://www.ics.uci.edu/%7Efranz/Site/pubs-pdf/ICS-TR-98-34.pdf Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, available at: http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206 Ibid Ibid POWER7 and POWER7+ Optimization and Tuning Guide...
Page 49
These instructions can be used directly in hand-tuned assembly language code, or they can be accessed through compiler built-ins or directives. Prefetching is also automatically done by the POWER7 hardware and is configurable, as described in 2.3.7, “Data prefetching using d-cache instructions and the Data Streams Control Register (DSCR)”...
Page 50
Hot locks result in intervention and can easily limit the ability to scale a workload because all updates to the lock are serialized. Tools such as splat (see “AIX trace-based analysis tools” on page 165) can be used to identify hot locks. POWER7 and POWER7+ Optimization and Tuning Guide...
The POWER processor architecture uses SMT to provide multiple streams of hardware execution. POWER7 provides four SMT hardware threads per core and can be configured to run in SMT4, SMT2, or single-threaded mode (SMT1 mode or, as referred to in this publication, ST mode) while POWER6 and POWER5 provide two SMT threads per core and can be run in SMT2 mode or ST mode.
Page 52
LPAR is running on, and not the processor compatible mode. Therefore, setting Very Low SMT priority only requires user level privilege on POWER7+ processors, even when running in P6-, P6+-, or P7-compatible modes. Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf...
2. Modify the SMT priority through the usage of special no-ops. 3. Using the AIX thread_set_smt_priority system call. On POWER7 and earlier, code that is running in problem-state can only set the SMT priority level to Low, Medium-Low, or Medium. On POWER7+, code that is running in problem-state can additionally set the SMT priority to Very-Low.
Page 54
Michael Lyons, et al, available at: http://www.ibm.com/developerworks/systems/articles/powerpc.html lwarx (Load Word and Reserve Indexed) Instruction, available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/idalangref_lwarx_ lwri_instrs.htm stwcx (Store Word Conditional Indexed) Instruction, available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/idalangref_stwcx_ instrs.htm eieio instruction, available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/idalangref_eieio_ instrs.htm POWER7 and POWER7+ Optimization and Tuning Guide...
PowerPC storage model and AIX programming: What AIX programmers need to know about how their software accesses shared storage, Michael Lyons, et al, available at: http://www.ibm.com/developerworks/systems/articles/powerpc.html Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf Chapter 2. The POWER7 processor...
Page 56
Engineering and Scientific Subroutine Library (ESSL) and Parallel ESSL, available at: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.essl.doc/esslb ooks.html What’s New in the Server Environment of Power ISA v2.06?, a white paper from Power.org, available at: (registration https://www.power.org/documentation/whats-new-in-the-server-environment-of-power-isa-v2-06/ required) POWER7 and POWER7+ Optimization and Tuning Guide...
Page 57
Also, the XL compilers are able to automatically generate VSX instructions from scalar code when they generate code that targets the POWER7 processor. This task is accomplished by using the -qsimd=auto option with the -O3 optimization level or higher.
Page 58
VSX instruction set extensions, such as POWER7. You must specify the XL -qarch=pwr7 -qaltivec compiler options when you use this type, or the GCC -mcpu=power7 or -mvsx options. The hardware does not have instructions for supporting vector unsigned long long, vector bool long long, or vector signed long long.
Page 59
You can use the NOSIMD directive to prevent the transformation of a particular loop: Using a compiler: Compiler versions that recognize the POWER7 architecture are XL C/C++ 11.1 and XLF Fortran 13.1 or recent versions of GCC, including the Advance Toolchain, and the SLES 11SP1 or Red Hat RHEL6 GCC compilers: –...
IBM POWER6 and POWER7 processor-based systems provide hardware support for DFP arithmetic. The POWER6 and POWER7 microprocessor cores include a DFP unit that provides acceleration for the DFP arithmetic. The IBM Power instruction set is expanded: 54 new instructions were added to support the DFP unit architecture.
Page 61
– The IBM XL C/C++ Compiler, release 9 or later for AIX and Linux, includes native DFP language support. Here is a list of compiler options for IBM XL compilers that are related to DFP: • -qdfp: Enables DFP support. This option makes the compiler recognize DFP literal suffixes, and the _Decimal32, _Decimal64, and _Decimal128 keywords.
Cache instructions, such as dcbt and dcbtst, allow applications to specify stream direction, prefetch depth, and number of units. These instructions can avoid the starting cost of the automatic stream detection mechanism. Ibid POWER7 and POWER7+ Optimization and Tuning Guide...
Page 63
Such load streams are detected when LSD = 0 and such store streams are detected when SSE=1. Bits 60 – SSE – Store Stream Enable Enables hardware detection and initiation of Store streams. Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf Chapter 2. The POWER7 processor...
Page 64
Bits 55:57 - URG - Depth Attainment Urgency This field is a new one added in the POWER7+ processor. This field indicates how quickly the prefetch depth should be reached for hardware-detected streams. Values and their meanings are as follows: –...
Page 65
EPERM Operation not permitted (DSCR_SET_DEFAULT by non-root user). ENOTSUP Data streams that are not supported by platform hardware. Symbolic values for the following SSE and DPFD fields are defined in <sys/machine.h>: DPFD_DEFAULT DPFD_NONE DPFD_SHALLOWEST DPFD_SHALLOW Chapter 2. The POWER7 processor...
Page 66
To query the characteristics of the hardware streams on the system, run the following command: dscrctl -q Here is an example of this command: # dscrctl -q Current DSCR settings: number_of_streams = 16 platform_default_pd = 0x5 (DPFD_DEEP) os_default_pd = 0xd (DSCR_SSE | DPFD_DEEP) POWER7 and POWER7+ Optimization and Tuning Guide...
AIX Version 7.1 Release Notes, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. ntl/RELNOTES/GI11-9815-00.htm Refer to the section, The dscrctl command. Application configuration for large pages, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. prftungd/doc/prftungd/config_apps_large_pages.htm False Sharing, found at: http://msdn.microsoft.com/en-us/magazine/cc872851.aspx lwsync instruction, found at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alang ref/idalangref_sync_dcs_instrs.htm Chapter 2. The POWER7 processor...
Page 68
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System, found at: http://www.microarch.org/micro36/html/pdf/lu-PerformanceRuntimeData.pdf POWER6 Decimal Floating Point (DFP), found at: http://www.ibm.com/developerworks/wikis/display/WikiPtype/Decimal+Floating+Poin POWER7 Processors: The Beat Goes On, found at: http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+- +The+Beat+Goes+On.pdf Power Architecture ISA 2.06 Stride N prefetch Engines to boost Application's performance, found at: https://www.power.org/documentation/whitepaper-on-stride-n-prefetch-feature-of-...
Page 69
Command, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. cmds/doc/aixcmds5/splat.htm trace Daemon, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. cmds/doc/aixcmds5/trace.htm What makes Apple's PowerPC memcpy so fast?, found at: http://stackoverflow.com/questions/1990343/what-makes-apples-powerpc-memcpy-so- fast What programmers need to know about hardware prefetching?, found at: http://www.futurechips.org/chip-design-for-all/prefetching.html Chapter 2. The POWER7 processor...
Page 70
POWER7 and POWER7+ Optimization and Tuning Guide...
(CPU, memory, and I/O), virtualization, capacity planning, and virtualization management. Some of these documents are shown in the reference section at the end of this section, which focuses on POWER7 Virtualization usually. As for any workload deployments, capacity planning, selecting the correct set of technologies, and appropriate tuning are critical to deploying high performing workloads.
(SPLPAR) environment. By using the preferred practices that are described in this guide, customers can attain optimum application performance in a shared resource environment. This guide covers preferred practices in the context of POWER7 Systems, so this section can be used as an addendum to other PowerVM preferred practice documents.
Page 74
Entitlement is the capacity that an SPLPAR is ensured to get as its share from the shared pool. Uncapped mode allows a partition to receive excess cycles when there are free (unused) cycles in the system. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 75
Entitlement also determines the number of SPLPARs that can be configured for a shared processor pool. The sum of the entitlement of all the SPLPARs cannot exceed the number of physical cores that are configured in a shared pool. For example, a shared pool has eight cores and 16 SPLPARs are created, each with 0.1 core entitlement and one virtual CPU.
Page 76
POWER7 Systems. If free cores are available in the shared processor pool, then unfolding another virtual processor results in the LPAR getting another core along with its associated caches.
– There is less page walk time as page tables are small. 3.2.3 Placing LPAR resources to attain higher memory affinity POWER7 PowerVM optimizes the allocation of resources for both dedicated and shared partitions as each LPAR is activated. Correct planning of the LPAR configuration enhances the possibility of getting both CPU and memory in the same domain in relation to the topology of a system.
Page 78
At partition boot time, PowerVM is aware of all of the LPAR configurations, so placement of processors and memory are made regardless of the order of activation of the LPARs. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 79
However, after the initial configuration, the setup might not stay static. Numerous operations take place, such as: Reconfiguration of existing LPARs with new profiles Reactivating existing LPARs and replacing them with new LPARs Adding and removing resources to LPARs dynamically (DLPAR operations) Any of these changes might result in memory fragmentation, causing LPARs to be spread across multiple domains.
For more information about this topic, see 3.3, “Related publications” on page 65. 3.2.4 Active memory expansion Active memory expansion (AME) is a capability that is supported on POWER7 and later servers that employs memory compression technology to expand the effective memory capacity of an LPAR.
Word document P7 Virtualization Best Practice. https://www.ibm.com/developerworks/wikis/display/WikiPtype/Performance+Monitoring+ Documentation Comment: This document is intended to address POWER7 processor technology based PowerVM best practices to attain the best LPAR performance. This document should be used in conjunction with other PowerVM documents. 3.3 Related publications...
Page 82
Virtual I/O (VIO) and Virtualization, found at: http://www.ibm.com/developerworks/wikis/display/virtualization/VIO Virtualization Best Practice, found at: http://www.ibm.com/developerworks/wikis/display/virtualization/Virtualization+B est+Practice POWER7 and POWER7+ Optimization and Tuning Guide...
Chapter 4. This chapter describes the optimization and tuning of a POWER7 processor-based server running the AIX operating system. It covers the the following topics: AIX and system libraries AIX Active System Optimizer and Dynamic System Optimizer AIX preferred practices...
417 - 448 25 - 28 97 - 112 224 - 240 449 - 480 29 - 32 113 - 128 241 - 256 481 - 512 This allocator is ideal for 64-bit memory-intensive applications. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 85
This chapter covers a few of the suboptions that are more relevant to performance tuning. For a complete list of options, see System Memory Allocation Using the malloc Subsystem, available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc /sys_mem_alloc.htm Multiheap By default, the malloc subsystem uses a single heap, which causes lock contention for internal locks that are used by malloc in case of multi-threaded applications.
Page 86
The number of times to try a busy lock before yielding to another pthread is n. The default is 40 and n must be a positive value. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 87
Efficient I/O event polling through the pollset interface on AIX contains a pollset summary and outlines the most advantageous use of Java. To see this topic, go to: http://www.ibm.com/developerworks/aix/library/au-pollset/index.html For more information about this topic, see 4.4, “Related publications” on page 94.
Page 88
You can enable IOCP on AIX by running smitty iocp. Verify that IOCP is enabled by running the following command: lsdev -Cc iocp The resulting output should match the following example: iocp0 Available I/O Completion Ports POWER7 and POWER7+ Optimization and Tuning Guide...
Page 89
Page-level protection must be set on the mapping (allows a 4K boundary). For more information, see General Programming Concepts: Writing and Debugging Programs, available at: http://publib16.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/underst anding_mem_mapping.htm For more information about this topic, see 4.4, “Related publications” on page 94.
Page 90
64-bit and that have large shared memory regions can benefit from incorporating 1 TB segments. An overview of 1 TB segment usage can be found in the IBM AIX Version 7.1 Differences Guide, SG24-7910. For more information about this topic, see 4.4, “Related publications” on page 94.
Page 91
The most significant issue is typically the porting effort (for existing applications), as changing between ILP32 and LP64 normally requires a port. Large memory addressability and scalability are normally the deciding factor when you chose an application execution model. For more information about this topic, see 4.4, “Related publications” on page 94. Affinity APIs Most applications must be bound to logical processors to get a performance benefit from memory affinity to prevent the AIX dispatcher from moving the application to processor cores...
Page 92
-P -c 4-8 28026 partition rset. detachrset: Detaches an RSET from a specified PID. For example: detachrset 28026 Detaches an effective RSET from a PID. detachrset -P 20828 Detaches a partition RSET from a PID. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 93
execrset: Runs a specific program or command with a specified RSET. For example: execrset sys/node.04.00000 -e test Runs a program test with an effective RSET from the system registry. execrset -c 0-1 -e test2 Runs program test2 with an effective RSET that contains logical CPU IDs 0 and 1.
Page 94
A reboot is required to change the Enhanced Affinity status. In AIX V6.1.0 technology level 6100-05, Enhanced Affinity is enabled by default on POWER7 machines. Enhanced Affinity is available only on POWER7 machines. Enhanced Affinity is disabled by default on POWER6 and earlier machines. A vmo command tunable (enhanced_memory_affinity) is available to disable Enhanced Affinity support on POWER7 machines.
Page 95
Some workloads do not run well with the SMT feature. This situation is not typical for commercial workloads, but has been observed with scientific (floating point intensive) workloads. Simultaneous Multithreading, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/smt.htm Chapter 4. AIX...
Page 96
AIX provides options to allow SMT customization. The smtctl option allows the SMT feature to be enabled, disabled, or capped (SMT2 versus SMT4 mode on POWER7). The partition-wide tuning option, smtctl, changes the SMT mode of all processor cores in the partition.
Page 97
Sleep and wake-up primitives (thread_wait and thread_post) AIX provides proprietary thread_wait() and thread_post() APIs that can be used to optimize thread synchronization and communication (IPC) operations. AIX also provides several standard APIs that can be used for thread synchronization and communication. These APIs include pthread_cond_wait(), pthread_cond_signal(), and semop().
Page 98
Some software products manage to never change their non-shaerable files in their update process, so they do not need any special handling for updates. POWER7 and POWER7+ Optimization and Tuning Guide...
(EFS) where all data at rest in the file system is encrypted. When AIX EFS runs on POWER7+, it uses the encryption accelerators, which can show up to a 40% advantage in file system I/O-intensive operations. Applications do not need to be aware of this situation, but application and workload deployments might be able to take advantage of higher levels of security by using AIX EFS for sensitive data.
ASO is available on the POWER7 platform in AIX V7.1 TL1 SP1 (4Q 2011) and AIX V6.1 TL8 SP1 (4Q 2012). DSO extensions are available in 4Q 2012, and require AIX V7.1 TL2 SP1 or AIX V6.1 TL8 SP1 (on the POWER7 platform).
Page 101
AIX 7.1 Resource Allocation Kernel Data POWER7 Performance Monitoring Unit Figure 4-1 Basic ASO architecture that shows an optimization flow on a POWER7 system Optimization strategies Two optimization strategies are provided with ASO: Cache affinity optimization Memory affinity optimization DSO adds two more optimizations to the ASO framework:...
Also, in the current version, only workloads that fit within a single Scheduler Resource Affinity Domain (SRAD, a chip/socket in POWER7) are considered. POWER7 and POWER7+ Optimization and Tuning Guide...
Large page optimization AIX allows translations of multiple memory page sizes within the same segment. Although 4 KB and 64 KB translations are allowed in the current version of AIX (Version 6.1 and greater), Version 6.1 TL8 and Version 7.2 TL2 (4Q 2012) include dynamic 16 MB translation. For workloads that use large chunks of data, using pages larger than the default size is useful because the number of TLB/ERAT misses is reduced (For information about general page size information and TLB/ERAT, see 2.3.1, “Page sizes (4 KB, 64 KB, 16 MB, and 16 GB)”...
Page 104
The workload memory footprint should have at least 16 GB of System V shared memory. – CPU usage CPU usage of the workload should be above eight cores. A workload may be either a multi-threaded process or a collection of single-threaded processes. POWER7 and POWER7+ Optimization and Tuning Guide...
– Workload age Workloads must be at least 10 minutes of age to be considered. Optimization time When you test the effect of DSO on applications, it is important to run the tests for enough time. The duration depends on the type of optimization that is being measured. For example, in the case of large page optimization, there is a small increase in system usage (less than 2%) when the pages are being promoted.
Memory prefetch requires eight cores. For large page and memory prefetch optimization, the system should have a minimum of 20 GB system memory. POWER7 and POWER7+ Optimization and Tuning Guide...
4.2.6 Installing DSO The AIX DSO is available as a separately chargeable premium package that includes the two new types of optimizations: Large page optimization and memory prefetch optimization. The package name is dso.aso and is installable using installp or smitty, as with any AIX package.
For logical partitions (LPARs) with Java applications, run and evaluate the output from the Java Performance Advisor, which can be run on POWER5 and POWER6, to determine if there is an existing issue before you migrate to POWER7. Instructions are available for Java Performance Advisor at: https://www.ibm.com/developerworks/wikis/display/WikiPtype/Java+Performance+Adv...
PERvasive (HIPER) fixes that continue to provide you with the system availability you expect from IBM Power Systems. Before you migrate to POWER7, you see more benefits if your AIX level contains the performance bundle set of APARS. Visit IBM Fix Central (http://www.ibm.com/support/fixcentral/) to download the latest service pack (SP) for...
Oracle Database and 1 TB Segment Aliasing, found at: http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105761 pollset_create, pollset_ctl, pollset_destroy, pollset_poll, and pollset_query Subroutines, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.basetechref /doc/basetrf1/pollset.htm For more information, go to ftp://ftp.software.ibm.com/systems/support/tools/mynotifications/overview.pdf. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 111
POWER7 Virtualization Best Practice Guide, found at: https://www.ibm.com/developerworks/wikis/download/attachments/53871915/P7_virtu alization_bestpractice.doc?version=1 ra_attach Subroutine, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.basetechref /doc/basetrf2/ra_attach_new.htm Shared library memory footprints on AIX 5L, found at: http://www.ibm.com/developerworks/aix/library/au-slib_memory/index.html thread_post Subroutine, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. basetechref/doc/basetrf2/thread_post.htm thread_post_many Subroutine, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix. basetechref/doc/basetrf2/thread_post_many.htm thread_wait Subroutine, found at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.
Page 112
POWER7 and POWER7+ Optimization and Tuning Guide...
This section contains information about Linux and system libraries. 5.1.1 Introduction When you work with IBM POWER7 processor-based servers, systems, and solutions, a solid choice for running enterprise-level workloads is Linux. Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) provide operating systems that are optimized and targeted for the Power Architecture.
Page 115
-mcpu= and -mtune= compiler flags might be the best option. For example, -mcpu=power7 allows the compiler to use all the new instructions, such as the Vector Scalar Extended category. The -mcpu=power7 option also implies -mtune=power7 if it is not explicitly set.
Page 116
– POWER6 has DFP and a Vector Unit implementing the older VMX (vector float but no vector double) instructions. – POWER7 has DPF and the new Vector Scalar Extended (VSX) Unit (the original VMX instructions plus Vector Double and more).
Page 117
(POWER5, POWER6, or POWER7). If the dynamics linker finds the shared library in the subdirectory with the matching platform name, it loads that version; otherwise, the dynamic linker looks in the base lib64 directory and use the default implementation.
Page 118
This situation is partially mitigated by the larger (64 KB) default page size of the Red Hat Enterprise Linux and SUSE Linux Enterprise Server on Power Systems; there are fewer page faults than with 4 KB pages. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 119
Massif: a heap profiler, available at: http://valgrind.org/docs/manual/ms-manual.html For more details about memory management tools, see “Empirical performance analysis using the IBM SDK for PowerLinux” on page 172. For more information about tuning malloc parameters, see Malloc Tunable Parameters, available at: http://www.gnu.org/software/libtool/manual/libc/Malloc-Tunable-Parameters.html...
Page 120
3. Set up the libhugetlbfs mount point by running the following commands: – # mkdir -p /libhugetlbfs – # mount -t hugetlbfs hugetlbfs /libhugetlbfs POWER7 and POWER7+ Optimization and Tuning Guide...
Page 121
4. Monitor large pages usage by running the following command: # cat /proc/meminfo | grep Huge This command produces the following output: HugePages_Total: HugePages_Free: HugePages_Rsvd: HugePages_Surp: Hugepagesize: Where: – HugePages_Total is the total pages that are allocated on the system for LP usage. –...
Red Hat Enterprise Linux 6 Performance Tuning Guide, Optimizing subsystem throughput in Red Hat Enterprise Linux 6, Edition 3.0, found at: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/Perfor mance_Tuning_Guide/index.html SUSE Linux Enterprise Server System Analysis and Tuning Guide (Version 11 SP2), found at: http://www.suse.com/documentation/sles11/pdfdoc/book_sle_tuning/book_sle_tuning .pdf POWER7 and POWER7+ Optimization and Tuning Guide...
The IBM XL compilers are updated periodically to improve application performance and add processor-specific tuning and capabilities. The XLC11/XLF13 compilers for AIX and Linux are the first versions to include the capabilities of POWER7, and are the preferred version for projects that target current generation systems. The newer XLC12/XLF14 compilers provide performance improvements, and are preferred for template-heavy C++ codes.
The POWER7 processor supports the VSX instruction set, which improves performance for numerical applications over regular data sets. These performance features can increase the performance of some computations, and can be accessed manually by using the Altivec vector extensions, or automatically by the XL compiler by using the -qarch=pwr7 -qhot -O3 -qsimd options.
XL Compilers provide a full implementation of the OpenMP 3.0 specification in C, C++, and Fortran. You can program with OpenMP to capitalize on the incremental introduction of parallelism in an existing application by adding pragmas or directives to specify how the application can be parallelized. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 127
For applications with available parallelism, OpenMP can provide a simple solution for parallel programming, without requiring low-level thread manipulation. The OpenMP implementation on the XL compilers is available by using the -qsmp=omp option. Whole-program analysis Traditional compiler optimizations operate independently on each application source file. Inter-procedural optimizations operate at the whole-program scope, using the interaction between parts of the application on different source files.
Specifying the -mveclibabi=mass option and linking to the MASS libraries enables more loops for -ftree-vectorize. The MASS libraries support only static archives for linking, and so they require explicit naming and library search order for each platform/mode: POWER7 32-bit: -L<MASS-dir>/lib -lmassvp -lmass_simdp7 -lmass -lm MASS-dir POWER7 64-bit: -L<...
Page 129
OpenMP The OpenMP API is an industry specification for shared-memory parallel programming. The current GCC compilers, starting with GCC- 4.4 (Advance Toolchain 4.0+), provide a full implementation of the OpenMP 3.0 specification in C, C++, and Fortran. Programming with OpenMP allows you to benefit from the incremental introduction of parallelism in an existing application by adding pragmas or directives to specify how the application can be parallelized.
FDPR runs on both Linux and AIX and produces optimized code for all versions of the Power Architecture. POWER7 is its default target architecture. POWER7 and POWER7+ Optimization and Tuning Guide...
64-bit applications. For more information, see AIX 5L Performance Tools Handbook, SG24-6039: Software Development Toolkit for PowerLinux: Available for use through the IBM SDK for PowerLinux. Linux distributions of Red Hat EL5 and above, and SUSE SLES10 and above are supported.
--dump-ascii-profile (-dap): This option dumps the profile file in a human readable ASCII format (extension .aprof). The .aprof file is useful for manual inspection or user-defined post-processing of the collected profile. POWER7 and POWER7+ Optimization and Tuning Guide...
--verbose n (-v n), --print-inlined-funcs (-pif), and --journal file (-j file): These options generate different analyses of the optimized file. -v generates general and optimization-specific statistics (.stat extension). The amount of verbosity is set by Basic statistics are provided by -v 1. Optimization-specific statistics are added in level 2 and instruction mix in level 3.
Page 134
Because environment variables are global in nature, when profiling several binary files at the same time, use explicit instrumentation options (-f, -fd, and -fdir) to differentiate between the profiles rather than using the environment variables (FDPR_PROF_FD and FDPR_PROF_NAME). POWER7 and POWER7+ Optimization and Tuning Guide...
Instrumentation stack The instrumentation is using the stack for saving registers by dynamically allocating space on the stack at a default location below the current stack pointer. On AIX, this default is at offset -10240, and on Linux it is -1800. In some cases, especially in multi-threaded applications where the stack space is divided between the threads, following a deep calling sequence, the application can be quite close to the end of the stack, which can cause the application to fail.
Page 136
The factor parameter determines the aggressiveness of the optimization. With -O3, the optimization is invoked with -lu 9. By default, loops are unrolled two times. Use -lu factor to change that default. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 137
The -m flag allows the user to specify the target machine model when known in cases where the program is not intended for use on multiple target platforms. The default target is POWER7. --align-code code (-A code): Optimizing the alignment and the placement of the code is crucial to the performance of the program.
Page 138
-O: Performs code reordering (-RC) with branch prediction bit setting (-bp), branch folding (-bf), and NOOP instructions removal (-nop). -O2: Adds to -O function de-virtualization (-pto), TOC-load optimization (-tlo), function inlining (-isf 8), and some function optimizations (-hr, -see 0, and -kr). POWER7 and POWER7+ Optimization and Tuning Guide...
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter: C/C++ Cafe (IBM Rational), found at: http://www.ibm.com/rational/cafe/community/ccpp FDPR, Post-Link Optimization for Linux on Power, found at: https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communi...
Page 140
POWER7 and POWER7+ Optimization and Tuning Guide...
7.1 Java levels You should use Java 6 SR7 or later for POWER7 Systems for two primary reasons. First, 64 KB pages are used for JVM text, data, and stack memory segments and the Java heap by default on systems where 64 KB pages are available. Second, the JIT compiler in Java 6 SR7 and later takes advantage of POWER7 specific hardware features for performance.
For more information about this topic, see 7.6, “Related publications” on page 136. 7.3 Memory and page size considerations IBM Java can take advantage of medium (64 KB) and large (16 MB) page sizes that are supported by the current AIX versions and POWER processors. Using medium or large pages instead of the default 4 KB page size can improve application performance.
16 MB pages with less impact and can be suitable for workloads that benefit from large pages but do not take full advantage of 16 MB pages. Starting with IBM Java 6 SR7, the default page size is 64 KB. 7.3.2 Configuring large pages for Java heap and code cache Large pages must be configured on AIX by the system administrator by running vmo.
To alleviate this impact, use the -Xcompressedrefs option. When this option is enabled, the JVM uses 32-bit references to objects instead of 64-bit references wherever possible. Object references are compressed and extracted as necessary at minimal cost. The need for compression and decompression is determined by the overall heap size and the platform the JVM is running on;...
Ahead-of-time (AOT) compiled code 7.4 Java garbage collection tuning The IBM Java VM supports multiple garbage collection (GC) strategies to allow software developers an opportunity to prioritize various factors. Throughput, latency, and scaling are the main factors that are addressed by the different collection strategies. Understanding how...
7.4.2 GC strategy: Optavgpause This strategy prioritizes latency and response time by performing the initial mark phase of GC concurrently with the execution of the application. The application is halted only for the sweep and compact phases, minimizing the total time that the application is paused. Performing the mark phase concurrently with the execution of the application might affect throughput, because the CPU time that would otherwise go to the application can be diverted to low priority GC threads to carry out the mark phase.
An application's memory behavior can be determined by using various tools, including verbose GC logs. For more information about verbose GC logs and other tools, see “Java (either AIX or Linux)” on page 176. POWER7 and POWER7+ Optimization and Tuning Guide...
SMT2 SMT4 The default SMT mode on POWER7 depends on the AIX version and the compatibility mode the processor cores are running with. Table 7-3 shows the default SMT modes. Table 7-3 SMT mode on POWER7 is dependent upon AIX and compatibility mode...
Page 150
In general, RSETs are created on core boundaries. For example, a partition with four POWER7 cores that are running in SMT4 mode has 16 logical CPUs. Create an RSET with four logical CPUs by selecting four SMT threads that belong to one core. Create an RSET with eight logical CPUs by selecting eight SMT threads that belong to two cores.
To achieve the best performance with RSETs that are created across multiple cores, all cores of the RSET must be from the same chip and in the same scheduler resource allocation domain (SRAD). The lssrad command can be used to determine which logical CPUs belong to which SRAD, as shown in Example 7-2: Example 7-2 Use the lssrad command to determine which logical CPUs belong to which SRAD # lssrad -av...
7.6 Related publications The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter: Java performance for AIX on POWER7 – best practices, found at: https://www-304.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_ast_sys_jav a_performance_on_power7 Java Performance on POWER7, found at: https://www.ibm.com/developerworks/wikis/display/LinuxP/Java+Performance+on+POW...
Chapter 8. This chapter describes the optimization and tuning of the POWER7 processor-based server running IBM DB2. It covers the following topics: DB2 and the POWER7 processor Taking advantage of the POWER7 processor Capitalizing on the compilers and optimization tools for POWER7...
POWER7 guidelines and technologies. The focus of this chapter is to showcase how IBM DB2 10.1 uses various POWER7 features and preferred practices from this guide during its own software development cycle, which is done to maximize performance on the Power Architecture.
Consider using this variable only for well-defined workloads that have a relatively static database memory requirement. The POWER7 large page size support can be enabled by setting the DB2 registry variable DB2_LARGE_PAGE_MEM. Here are the steps to enable large page support in DB2 database system on AIX operating systems: 1.
8.3 Capitalizing on the compilers and optimization tools for POWER7 DB2 10.1 is built by using an IBM XL C/C++ Version 11 compiler using various compiler optimization flags along with optimization techniques based on the common three steps of software profiling:...
8.4.1 DB2 virtualization DB2 10.1 is engineered to take advantage of the many benefits of virtualization on POWER7 and therefore allows various types of workload to be deployed in a virtualized environment.
For file systems that support CIO, such as AIX JFS2, DB2 automatically uses this I/O method because of its performance benefits over DIO. The DB2 log file by default uses DIO, which brings similar performance benefits as avoiding file system cache for table spaces. POWER7 and POWER7+ Optimization and Tuning Guide...
You should configure the AIX I/O completion port for performance purposes, even though it is not mandatory, as part of the DB2 10.1 installation process. For more information, see Configuring IOCP (AIX), available at: http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=/com.ibm.db2.luw.ad min.perf.doc/doc/t0054518.html After IOCP is configured on AIX, then DB2, by default, capitalizes on this feature for all asynchronous I/O requests.
For more information about this topic, see 8.8, “Related publications” on page 144. 8.7 Conclusion DB2 is positioned to capitalize on many Power features to maximize the ROI of the full IBM stack. During the entire DB2 development cycle, there is a targeted effort to take advantage of Power features and ensure that the highest level of optimization is employed on this platform.
Page 161
Feedback Directed Program Restructuring (FDPR), found at: https://www.research.ibm.com/haifa/projects/systems/cot/fdpr/ FDPR-Pro - Usage: Feedback Directed Program Restructuring, found at: http://www.research.ibm.com/haifa/projects/systems/cot/fdpr/papers/fdpr_pro_usa ge_cs.pdf IBM DB2 Version 10.1 Information Center, found at: http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=/com.ibm.db2.luw .welcome.doc/doc/welcome.html Smashing performance with OProfile, found at: http://www.ibm.com/developerworks/linux/library/l-oprof/index.html tprof Command, found at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/do...
Page 162
POWER7 and POWER7+ Optimization and Tuning Guide...
This chapter is intended to provide you with performance and functional considerations for running WebSphere Application Server middleware on Power Systems. It primarily describes POWER7. Even though WebSphere Application Server is designed to run on many operating systems and platforms, some specific capabilities of Power Systems are used by WebSphere Application Server as a part of platform optimization efforts.
Scalability challenges when moving from POWER5 or POWER6 to POWER7 By default, POWER7 runs in SMT4 mode. As such, there are four hardware threads (or four logical CPUs) per core that provide tremendous concurrency for applications. If the enterprise applications are migrated to POWER7 from an earlier version of POWER hardware...
WebSphere Application Server on POWER7 Systems. For an example of using the taskset and numactl commands in a Linux environment, see “Partition sizes and affinity” on page 14. More information about these topics is in Java Performance on POWER7 - Best practices, found at: http://public.dhe.ibm.com/common/ssi/ecm/en/pow03066usen/POW03066USEN.PDF 9.1.4 Performance analysis, problem determination, and diagnostic tests...
Total for process: 119214037 Allocation requests by bucket Bucket Maximum Number of Number Block Size Allocations ----- ---------- ----------- 104906782 9658271 1838903 880723 300990 422310 143923 126939 157459 72162 87108 56136 63137 66160 45571 POWER7 and POWER7+ Optimization and Tuning Guide...
Page 169
For more information, see System Memory Allocation Using the malloc Subsystem, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/g enprogc/sys_mem_alloc.htm Appendix A. Analyzing malloc usage under AIX...
Page 170
POWER7 and POWER7+ Optimization and Tuning Guide...
Performance tooling and Appendix B. empirical performance analysis This appendix describes the optimization and tuning of the POWER7 processor-based server from the perspective of performance tooling and empirical performance analysis. It covers the following topics: Introduction Performance advisors Linux Java (either AIX or Linux)
“Expert system advisors” on page 156. The fourth advisor is part of the IBM Rational Developer for Power Systems Software. It is a component of an integrated development environment (IDE), which provides a set of features for performance tuning of C and C++ applications on AIX and Linux.
Page 173
All of the advisors follow the same reporting format, which is a single page XML file you can use to quickly assess conditions by visually inspecting the report and looking at the descriptive icons, as shown in Figure B-1. Figure B-1 Descriptive icons in expert system advisors (AIX Partition Virtualization, VIOS Advisor, and Java Performance Advisor) The XML reports generated by all of the advisors are interactive.
Page 174
LPAR over time. The goal of the advisor is for the user to be able to self-assess the health of their LPAR and act to attain optimal performance. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 175
LPAR configuration is optimized. If the advisor finds that the LPAR configuration is not optimal for the workload, it guides the user in determining the best possible configuration. The LPAR Performance Advisor can be found at: https://www.ibm.com/developerworks/wikis/display/WikiPtype/PowerVM+Virtualization+ performance+advisor Figure B-3 LPAR Virtualization Advisor...
Page 176
The output of the run is a simple XML file that can be viewed by using the supplied XSL viewer and any browser. The Java Performance Advisor can be found at: https://www.ibm.com/developerworks/wikis/display/WikiPtype/Java+Performance+Adviso Figure B-4 Java Performance Advisor POWER7 and POWER7+ Optimization and Tuning Guide...
Performance Advisor, which provides a rich set of features for performance tuning C and C++ applications on IBM AIX and IBM PowerLinux systems. Although not directly related to the tooling described in “Expert system advisors” on page 156, Rational Performance Advisor has the same goal of helping users to best use Power hardware with tooling that offers simple collection, management, and analysis of performance data.
As such, this section does not address performance topics that are related to capacity planning, and system-level performance monitoring and tuning. For capacity planning, see the IBM Systems Workload Estimator, available at: http://www-912.ibm.com/estimator For system-level performance monitoring and tuning information for AIX, see Performance Management, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prf...
CPU profiling A CPU profiler is a performance tool that shows in which code CPU resources are being consumed. Tprof is a powerful CPU profiler that encompasses a broad spectrum of profiling functionality: It can profile any program, library, or kernel extension that is compiled with C, C++, Fortran, or Java compilers.
Page 180
More information about using AIX tprof for Java programs is available in “Hot method or routine analysis” on page 177. The functionality of tprof is rich. As such, it cannot be fully described in this guide. For complete tprof documentation, see tprof Command, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmd s/doc/aixcmds5/tprof.htm POWER7 and POWER7+ Optimization and Tuning Guide...
0.0162 0.0580 _esend(2a29f88) 26414 447.7029 2.06% 0.0169 0.0082 0.0426 _erecv(2a29e98) trace Daemon, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/tra ce.htm CPU Utilization Reporting Tool (curt), available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftools/doc/prftools/idprftools_cpu.ht Simple performance lock analysis tool (splat), available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftools/doc/prftools/idprftools_splat.ht splat Command, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/splat .htm Appendix B. Performance tooling and empirical performance analysis...
Page 182
The accumulated time in the system call The name of the system call (followed by the system call address in parentheses) The process name, followed by the Process ID and Thread ID in parentheses POWER7 and POWER7+ Optimization and Tuning Guide...
Page 183
This report is useful in determining what system calls are blocking threads from proceeding. For example, threads appearing in this report with an unfinished recv call are waiting on data to be received over a socket. Another useful trace-based tool is splat, which is the Simple Performance Lock Analysis Tool.
Page 184
The minimum, maximum, and average number of threads that are waiting on the lock, across the analysis interval Recursion The minimum, maximum, and average recursion depth to which each thread held the lock POWER7 and POWER7+ Optimization and Tuning Guide...
Page 185
Examples of alignment issues that are handled by microcode with a performance penalty in the POWER7 processor are loads that cross a 128-byte boundary and stores that cross a 4 KB page boundary. To give an indication of the penalty for this type of misalignment, on a 4 GHz processor, a nine-instruction loop that contains an 8 byte load that crosses a 128-byte boundary takes double the time of the same loop with the load correctly aligned.
The key metric is the Emulation Delta (the number of instructions that are emulated during each interval). Non-zero values merit further investigation. Invoking tprof with the -E EMULATION flag generates a profile that shows where the emulated instructions are. POWER7 and POWER7+ Optimization and Tuning Guide...
POWER7 processor chip that the software thread is running on. is memory that is attached to a different POWER7 processor that is in the same CEC (that is, the same node or building block in the case of a multi-CEC system, such as a Power 780) that the software thread is running on.
OProfile. OProfile can be run directly as a command-line tool or under the IBM SDK for PowerLinux. The OProfile tools can monitor the whole system (LPAR), including all the tasks and the kernel.
Using the IBM SDK for PowerLinux Trace Analyzer The IBM SDK for PowerLinux provides tools, including the SystemTap and pthread monitor, for tracking I/O and lock usage of a running application. The higher level Trace Analyzer tools can target a specific application for combined SystemTap syscall trace and Lock Trace.
In GCC, you must specify the -O3 optimization level and inform the compiler that you are running on a newer processor chip with the Vector ISA extensions. In fact, with GCC, you need both -O3 and -mcpu=power7 for the compiler to generate code that capitalizes on the new VSX feature of POWER7.
Page 191
General/Code Analysis. Hotspot profiling IBM SDK for PowerLinux integrates the Linux Oprofile hardware event profiling with the application source code view. This configuration is a convenient way to do hotspot analysis. The integrated Linux Tools profiler focuses on an application that is selected from the current SDK project.
Various tools and diagnostic options are available that can provide detailed information about the state of the JVM. The information that is provided can be used to guide tuning decisions to maximize performance for an application or workload. POWER7 and POWER7+ Optimization and Tuning Guide...
For more information about the GC and Memory Visualizer, see Java diagnostics, IBM style, Part 2: Garbage collection with the IBM Monitoring and Diagnostic Tools for Java – Garbage Collection and Memory Visualizer, available at: http://www.ibm.com/developerworks/java/library/j-ibmtools2...
Page 194
CPU resources. Example B-11 Sample Java program public class ProfileTest extends Thread { static Object o; /* used for locking to serialize threads */ static Double A[], B[], C[]; POWER7 and POWER7+ Optimization and Tuning Guide...
Page 195
static int Num=1000; public static void main(String[] args) { o = new Object(); new ProfileTest().start(); /* start 3 threads */ new ProfileTest().start(); /* each thread executes the "run" method */ new ProfileTest().start(); public void run() { double sum = 0.0; for (int i = 0;...
Page 196
===== ====== ======= ===== libj9jit24.so 1157 27.51 900000003e81240 5c8878 libj9gc24.so 510 12.13 900000004534200 91d66 /usr/lib/libpthreads.a[shr_xpg5_64.o] 4.16 900000000b83200 30aa0 Profile: libj9jit24.so Total Ticks For All Processes (libj9jit24.so) = 1157 Subroutine Ticks Source Address Bytes POWER7 and POWER7+ Optimization and Tuning Guide...
Page 197
========== ===== ====== ====== ======= ===== .jitMonitorEntry 1121 26.66 nathelp.s 549fc0 Garbage Collection impact: The impact of initializing new objects and of GC is shown in Example B-13 on page 180 as the 12.13% of ticks in the libj9gc24.so shared object. This high GC impact is related to the excessive creation of Double objects in the sample program.
A common case is when older java/util classes, such as Hashtable, do not scale well and cause a locking bottleneck. An easy solution is to use java/util/concurrent classes instead, such as ConcurrentHashMap. POWER7 and POWER7+ Optimization and Tuning Guide...
JVM, such as GC locks. These statistics can be used to make decisions about GC policies, lock reservation, and so on, to make optimal usage of processing resources. For more information about the Java Lock Monitor, see Java diagnostics, IBM style, Part 3: Diagnosing synchronization and locking problems with the Lock Analyzer for Java, available at: http://www.ibm.com/developerworks/java/library/j-ibmtools3...
Page 200
WAIT website. For more information about WAIT, go to: http://wait.researchlabs.ibm.com This site also has sample input files for WAIT, so users can try out the data analysis and visualization aspects without collecting any data. POWER7 and POWER7+ Optimization and Tuning Guide...
Oracle environment. This section includes information specific to POWER7 in an Oracle environment. IBM is not aware of any POWER7 specific Oracle issues at the time of this writing. Most issues that show up on POWER7 are the result of not following preferred practices that apply to all Power Systems generations.
Page 203
Java Performance on Information about migrating Java http://www.ibm.com/systems POWER7 - Best Practice. /power/hardware/whitepaper applications from POWER5 or POWER6 to POWER7. s/java_perf.html Appendix C. POWER7 optimization and tuning with third-party applications...
Oracle 11gR2 preferred practices for AIX V6.1 and AIX V7.1 on Power Systems This section is a summary of preferred practices for stand-alone Oracle 11gR2 instances on AIX V6.1 and AIX 7.1 on POWER7 Systems. Except for references to symmetric multithreading, quad-threaded mode (SMT4 mode), all of the preferred practices apply to POWER6 as well.
Page 205
– 64 KB page size for data, text, and stack regions is useful in environments with a large (for example. 64 KB+) SGA and many online transaction processing (OLTP) users. For smaller Oracle instances, 4 KB is sufficient for data, text, and stack. Appendix C. POWER7 optimization and tuning with third-party applications...
Page 206
LDR_CNTRL CPU specifications are as follows: SMT mode: POWER7 supports SMT4 mode, which is the AIX default. AIX and Oracle performance support encourages starting with the default. Virtual processor folding: This is a feature of Power Systems in which unused virtual processors are taken offline until demand requires that they be activated.
Page 207
– For information about these parameters, see the help pages available at http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.cmds/doc/aixcmd s3/ioo.htm. Do not change AIX V6.1 or AIX V7.1 restricted tunables unless directed to do so by IBM AIX support. In AIX V6.1, j2_nBufferPerPagerDevice is a restricted tunable, and j2_dynamicBufferPreallocation is not. ASM considerations for stand-alone Oracle 11gR2: –...
Page 208
9000 bytes. They are used to reduce the number of frames to transmit a volume of network traffic, but they work only if enabled on every in the network infrastructure. Jumbo frames help reduce network and CPU processing impacts. POWER7 and POWER7+ Optimization and Tuning Guide...
– http://www.sybase.com/detail?id=1096191 This paper is a joint IBM and Sybase publication that was originally written in 2006 and has been updated twice since then, most recently in Spring, 2012. The authors of the current version are Peter Barnett (IBM), Mark Kusma (Sybase), and Dave Putz (Sybase).
Page 210
Migrating Sybase ASE to IBM Power Systems, available at: http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102105 The subtitle of this paper explains its purpose: It presents the case for POWER7 as the optimal platform for Sybase ASE, presents guidelines and preferred practices for migrations, and includes a field guide to migrations and proofs of concept.
However, there are some special environment variables and scheduler tunables that should be considered, based on joint work between Sybase and AIX development teams. These items were originally developed for POWER6, but are also applicable to POWER7. Some of irm suggestions...
Is IQ allocating too many threads in SMT4 mode? With POWER5 SMT2 mode, it is beneficial to set -iqnumbercpus to the number of cores or virtual processors. However, with POWER6 and POWER7, the best performance is generally obtained by leaving -iqnumbercpus at its default, even though IQ creates more threads.
Our findings were that, on POWER7 and AIX V6.1, you can achieve equal or better performance for the 18 queries in parallel, if you allocate +1 virtual processors to the target dedicated cores and: Give the IQ LPAR slightly higher entitlement than its neighbors...
Page 214
I/O, and paging space. Before you increase the user-process resource limits, such as memory, to high values, consider the potential consequences. POWER7 and POWER7+ Optimization and Tuning Guide...
Page 215
• Release-behind sequential read flag (-rbr), • Release-behind sequential write flag (-rbw), • Release-behind sequential read and write flag (-rbrw). Appendix C. POWER7 optimization and tuning with third-party applications...
The I/O stack layers are: Application File system (optional) LVM (optional) Subsystem device driver (SDD) or SDD Path Control Module (SDDPCM) (if used) hdisk device driver Adapter device driver Interconnect to the disk Disk subsystem Disk POWER7 and POWER7+ Optimization and Tuning Guide...
Page 217
Disk maximum I/O that is issued. queue_depth Disk maximum number of simultaneous I/Os. The default is 20 but can be set as high as 256 for IBM Enterprise Storage Server® (ESS), IBM System Storage® DS6000™, and IBM System Storage DS8000®.
Page 218
AIX limitations for total number of I/Os. Also, carefully evaluate the queue parameters before you implement any changes. For tuning guides specific to a particular IBM storage system such as the IBM System Storage DS4000®, DS6000, or DS8000, see Appendix B, “Performance tooling and empirical performance analysis” on page 155.
AIX V6 best practices for SAS Enterprise Business Intelligence (SAS eBI) users on IBM POWER6: http://www.sas.com/partners/directory/ibm/AIXBestPractice.pdf IBM General Parallel File System (GPFS) wiki for SAS: http://www.ibm.com/developerworks/wikis/display/hpccentral/SAS Understanding Processor Utilization on Power Systems - AIX: http://www.ibm.com/developerworks/wikis/display/WikiPtype/Understanding+Process or+Utilization+on+POWER+Systems+-+AIX Migrating SAP BusinessObjects Business Intelligence platform...
SBOP BI applications, the Quick Sizer tool provides an SAP Application Performance Standard (SAPS) number and the memory necessary to run the applications. IBM can then provide the correct system configuration that is based on these numbers. The Quick Sizer tool is available at http://service.sap.com/quicksizer...
This advice is drawn from application optimization efforts environment ORGANIZATION across many different types of code that runs under the IBM AIX and Linux operating systems, focusing on the more pervasive performance Analyze and opportunities that are identified, and how to capitalize on them. The...