Hello,
we upgraded our HANA scale out system from rev. 69.1 to rev. 72.
~4 hours after the upgrade our HDB crashed with OOM errors on the master indexserver.
Since than we are trying to start it back up and face OOM errors during startup in the index rebuild phase:
The errors clearly state the HANA runs out of allocatable memory for Pool/IndexRebuildAllocator during startup.
It looks like HANA requires more memory during this phase compared to rev. 69.1
Our Master node has 512GB of physical memory and the Allocation limit is set to 95% of this (default).
Here is what we read from our logs during startup:
....
[3331]{-1}[-1/-1] 2014-03-24 01:35:36.735498 i Service_Startup ptime_master_start.cc(00719) : Rebuilding system indexes.
[3331]{-1}[-1/-1] 2014-03-24 01:35:37.236971 i Service_Startup ptime_master_start.cc(00733) : Rebuilding system indexes done.
[3331]{-1}[-1/-1] 2014-03-24 01:35:37.237002 i Service_Startup ptime_master_start.cc(00735) : Rebuilding indexes.
[3331]{-1}[-1/-1] 2014-03-24 01:35:37.244660 i Service_Startup IndexManager_rebuild.cc(00478) : Number of indexes: 2837
[3331]{-1}[-1/-1] 2014-03-24 01:35:37.245606 i Service_Startup IndexManager_rebuild.cc(00580) : Number of JobEx indexes: 2798
[3367]{-1}[9/-1] 2014-03-24 01:36:33.729391 w ResMan ResourceContainer.cpp(01300) : Information about shrink at 24.03.2014 01:28:46 000 Mon:
Reason for shrink: Precharge for big block allocation. User size: 6269307200
ShrinkCaller
....
[4123]{-1}[9/-1] 2014-03-24 01:36:42.465154 w Memory PoolAllocator.cpp(01060) : Out of memory for Pool/IndexRebuildAllocator, size 48B, flags 0x0
[4123]{-1}[9/-1] 2014-03-24 01:36:42.465168 e Memory ReportMemoryProblems.cpp(00733) : OUT OF MEMORY occurred.
[3538]{-1}[9/-1] 2014-03-24 01:36:42.465164 w Memory PoolAllocator.cpp(01060) : Out of memory for Pool/IndexRebuildAllocator, size 48B, flags 0x0
[3538]{-1}[9/-1] 2014-03-24 01:36:42.465177 e Memory ReportMemoryProblems.cpp(00733) : OUT OF MEMORY occurred.
[4123]{-1}[9/-1] 2014-03-24 01:36:42.465168 e Memory ReportMemoryProblems.cpp(00733) : Failed to allocate 48 byte.
[3538]{-1}[9/-1] 2014-03-24 01:36:42.465177 e Memory ReportMemoryProblems.cpp(00733) : Failed to allocate 48 byte.
[4123]{-1}[9/-1] 2014-03-24 01:36:42.465168 e Memory ReportMemoryProblems.cpp(00733) : Current callstack:
[3538]{-1}[9/-1] 2014-03-24 01:36:42.465177 e Memory ReportMemoryProblems.cpp(00733) : Current callstack:
[3562]{-1}[9/-1] 2014-03-24 01:36:42.465245 w Memory PoolAllocator.cpp(01060) : Out of memory for Pool/IndexRebuildAllocator, size 48B, flags 0x0
[3562]{-1}[9/-1] 2014-03-24 01:36:42.465256 e Memory ReportMemoryProblems.cpp(00733) : OUT OF MEMORY occurred.
....
GLOBAL_ALLOCATION_LIMIT (GAL) = 520645177866b (484.88gb), SHARED_MEMORY = 243742983024b (227gb), CODE_SIZE = 6919073792b (6.44gb)
PID=2987 (hdbnameserver), PAL=487793667686, AB=1596952576, UA=0, U=1415644701, FSL=0
PID=3196 (hdbcompileserve), PAL=487793667686, AB=447041536, UA=0, U=356200477, FSL=0
PID=3193 (hdbpreprocessor), PAL=487793667686, AB=416477184, UA=0, U=292814063, FSL=0
PID=3254 (hdbstatisticsse), PAL=54199296409, AB=1040187392, UA=0, U=862081043, FSL=0
PID=3257 (hdbxsengine), PAL=487793667686, AB=1100451840, UA=0, U=907234077, FSL=0
PID=3251 (hdbindexserver), PAL=487793667686, AB=265382010522, UA=0, U=218447843045, FSL=0
Total allocated memory= 520645177866b (484.88gb)
Total used memory = 472943874222b (440.46gb)
Sum AB = 269983121050
Sum Used = 222281817406
Heap memory fragmentation: 9
Top allocators (ordered descending by inclusive_size_in_use).
1: / 218448756093b (203.44gb)
2: Pool 214254480696b (199.53gb)
3: Pool/IndexRebuildAllocator 196297409104b (182.81gb)
4: Pool/PersistenceManager 9920106968b (9.23gb)
5: Pool/PersistenceManager/PersistentSpace(0) 9779577952b (9.10gb)
6: Pool/PersistenceManager/PersistentSpace(0)/RowStoreLPA 9473884512b (8.82gb)
7: Pool/ResourceContainer 2852314824b (2.65gb)
8: AllocateOnlyAllocator-unlimited 2832931544b (2.63gb)
9: Pool/malloc 2660566632b (2.47gb)
10: Pool/malloc/libhdbrskernel.so 2467813288b (2.29gb)
11: Pool/RowEngine 2284285840b (2.12gb)
12: AllocateOnlyAllocator-unlimited/FLA-UL<3145728,1>/MemoryMapLevel2Blocks 2135949312b (1.98gb)
13: AllocateOnlyAllocator-unlimited/FLA-UL<3145728,1> 2135949312b (1.98gb)
14: Pool/RowEngine/CpbTree 1417842512b (1.32gb)
15: AllocateOnlyAllocator-limited 1184520640b (1.10gb)
16: AllocateOnlyAllocator-limited/ResourceHeader 1184517680b (1.10gb)
17: Pool/RowEngine/LockTable 536881408b (512mb)
18: AllocateOnlyAllocator-unlimited/FLA-UL<120,256>/BigBlockInfoAllocator 360752520b (344.04mb)
19: AllocateOnlyAllocator-unlimited/FLA-UL<120,256> 360752520b (344.04mb)
20: Pool/PersistenceManager/PersistentSpace(0)/RowStoreConverter 239921680b (228.80mb)
Top allocators (ordered descending by exclusive_size_in_use).
1: Pool/IndexRebuildAllocator 196297409104b (182.81gb)
2: Pool/PersistenceManager/PersistentSpace(0)/RowStoreLPA 9473884512b (8.82gb)
3: Pool/ResourceContainer 2852314824b (2.65gb)
4: Pool/malloc/libhdbrskernel.so 2467813288b (2.29gb)
5: AllocateOnlyAllocator-unlimited/FLA-UL<3145728,1>/MemoryMapLevel2Blocks 2135949312b (1.98gb)
6: Pool/RowEngine/CpbTree 1417842512b (1.32gb)
7: AllocateOnlyAllocator-limited/ResourceHeader 1184517680b (1.10gb)
8: Pool/RowEngine/LockTable 536881408b (512mb)
9: AllocateOnlyAllocator-unlimited/FLA-UL<120,256>/BigBlockInfoAllocator 360752520b (344.04mb)
10: Pool/PersistenceManager/PersistentSpace(0)/RowStoreConverter/ConvPage 239075328b (228mb)
11: Pool/RowEngine/Internal 205837824b (196.30mb)
12: StackAllocator 176672768b (168.48mb)
13: AllocateOnlyAllocator-unlimited/FLA-UL<48,128>/FreeBigBlockInfoAllocator 144301008b (137.61mb)
14: Pool/RowEngine/Transaction 103391528b (98.60mb)
15: Pool/malloc/libhdbexpression.so 90507984b (86.31mb)
16: Pool/malloc/libhdbbasement.so 90380472b (86.19mb)
17: AllocateOnlyAllocator-unlimited/ReserveForUndoAndCleanupExec 84029440b (80.13mb)
18: AllocateOnlyAllocator-unlimited/ReserveForOnlineCleanup 84029440b (80.13mb)
19: Pool/Statistics 83825720b (79.94mb)
20: Pool/PersistenceManager/ContainerNameDirectory 59182968b (56.44mb)
In order to fix this bottleneck, we first need to get HDB started, but how?
Is there a way to aviod that row store tables are being loaded during startup? (this would allow enough memory for the index rebuild)
Is there a way to skip the index rebuild during startup?
Can we increase the allocation limit to more than 95% of the physical memory? (e.g. we could configure swap space to be utilized, just to get over the edge of this during the startup in order to bring our HDB back up and work on reducing the memory requirement on the master index server)
Kind Regards
Florian Wittmann
ps. we also have a call with SAP.