Amion server performance summer / fall blog

Amion server performance summer / fall blog

11-24 The new server has been humming along for over two months. We've updated Amion2 to better mirror the primary site. Single sign-on is back. Amion should be ready for many years of growth. The server problems of late summer and fall are resolved.

Thanks for your patience through the early weeks when Amion was not as reliable as it needed to be.

9-24 The new server has handled all traffic this week as if Amion were a small startup with few visitors. Logs haven't shown a single moment of high load.

Monday through Wednesday, we fixed dozens of small bugs that never caused trouble on the old system. The server app is compiled C code which runs fast but does whatever the code dictates, good or bad. With the new compiler, the app shuts down if it writes a single byte past the end of a data element, leaving you with a server error. We've cleaned up most those problems. Yesterday brought a few edge cases but we've fixed those, too.

The toolkit we use for single sign-on wasn't compatible with the new server. We're in the process of rebuilding our SSO interface and hope to have that back later today.

9-20 The move to the new server went smoothly. It handled the early AM mobile-app sync without logging a single moment of heavy load, a refreshing change from the last month when every overnight update pushed the load higher than usual.

Earlier this morning, though, a routine job that rotates system logs to the next day exposed a typo in a config file that led HTTP services to not restart correctly. Mail, FTP, and all other functions were running fine.

HTTP services were restored at approximately 4:55am.

9-17 We will switch to the new server at 7a tomorrow, 9/18. The transition may come with a few seconds of downtime. Both servers will be running up to and through the switch. We will copy recently revised schedules just before 7a and sync again once all traffic is hitting the new server.

9-14 We restarted the server at 4:30a Sat, 10:30a yesterday and 11:10a today but each time, we're getting back to baseline in seconds vs. minutes or hours when the problems began last month.

By the end of last week, we finished reworking the file storage. Every folder that used to hold tens of thousands of files has its data spread over 200 to 300 sub-folders.

We have a new server ready. The hardware won't change but we'll get 10 years of kernel and other system updates. Our data-center guru believes file and process management will be far more efficient. We're testing the new server this week and will schedule a date for the move soon. The switch should not bring any down time. We will let you know when we plan to flip to the new server.

After each moment of high load, even ones that don't require a server restart, we review Cloudflare logs for sources of unusual traffic. We've worked with a few partner companies that pull data from Amion to reduce their impact. We're actively working with Doximity to make our own mobile app more efficient as it syncs assignments for every Amion schedule.

We're also making regular but small improvements to the 'ocs' server app to reduce overhead. Upgrades will continue until all server performance issues are resolved.

9-7 After a week of smooth operations we had to restart the server a little after 10a this morning due to yet another overload. The restart went quickly and we had the system back to baseline in just a minute or two.

Over the weekend, we continued restructuring the way Amion stores data. We're on the final folder that stores all schedule data. A filter limits which schedules move to the new layout. We started with a few small accounts and are up to 1/3 of schedules as of this morning. After today's restart, we'll open the filter up for all remaining accounts.

Each page Amion generates runs a copy of our 'ocs' server app. Each copy of the app requires a good-size block of memory for the program itself and CPU overhead to launch. Upon startup, the app runs a number of housekeeping chores to track open tasks and monitor the system load. We plan to break the housekeeping tasks into a separate app that runs under FastCGI. FastCGI apps remain resident in memory and a single instance can run many jobs simultaneously. The new app will be able to better manage traffic and block specific servers that request an unusual number of pages which might otherwise overload our system.

We still have many upgrades planned for the coming weeks. Details will come later.

8-31 We had a flood of traffic right around 5p that caused the server to back up and lock. We restarted the system and Amion is back to baseline. We were at a very comfortable load all day today and yesterday. We'll investigate the source of the 5p traffic shortly

We're 80% of the way through reworking large folders into a more efficient layout. We thought the changes from the weekend and yesterday were enough to put the logjams behind us but we will get that last part completed tonight or early tomorrow. The biggest remaining folder holds all 2021 schedule files and is accessed once or many times for every schedule page Amion displays.

8-27 We had a minute of intentional down time at 6:25a to switch to a new filesystem that should better handle folders with large numbers of files.

Amion stores every active schedule in one folder. Customer account info goes in another. Each folder has tens of thousands of files. The folder that stores iCal data for Google and iPhone calendars has close to 100,000 files.

Our data-center guru suggests we split files over multiple sub-folders. Any time a file changes, the operating system locks the directory index to update it. The index for 20K files can be ~0.5 MB which can take more than a decent number of CPU cycles to write. If an index is locked, other jobs that need to access files in that folder will be blocked momentarily. If the server is fairly busy, a folder could remain locked far longer than ideal and jobs could stack up.

We will be working on a new storage format today and hope to have some folders on the new layout this afternoon. We will start with the iCal folder since it's huge and doesn't impact regular Amion operations.

8-26 Amion was off and on again between 11:00a and 12:15p on Thursday 8-26. We had to restart the system a few times to return to normal operations.

We may have found one source of traffic that's contributing toward Amion getting overloaded. Every 15 minutes, a partner company pulls 3 days of assignments for approximately 120 schedules. They pull each day separately, generating close to 400 requests within a narrow time frame. If their sync launches at a moment when Amion is already busy, jobs can back up on our system.

While getting Amion back online today, we shut down API access. That company's sync was underway and once blocked, it sent over 1000 requests to Amion to recover from the error.

If their sync drives the load up on Amion to where our app returns a "busy" message, their error handler may make the problem worse and not give us time to recover.

Over the next few days, we'll be evaluating architectural changes to our hosting setup to split traffic across different systems.

The partner has a fix ready based on feedback we provided them a few weeks ago when our server problems started. Today, we asked that they install the fix as an emergency patch instead of waiting until their next standard release.

If they can't roll the fix out quickly, we will limit the rate at which they can request data.

8-23-21 We had to restart the server after another rush of traffic led jobs to back up to where the system couldn't recover.

This morning, we installed new tools to gather data on why the server runs smoothly for hours and then gets overloaded.

We also adjusted the filter settings which protect the server from getting overloaded. We are monitoring the service non-stop.

Sorry for the ongoing trouble.

8-19-21 We are still seeing brief moments of high traffic. This morning, we installed a new filter to prioritize large hospital systems and Enterprise+ sites that standardize all schedules on Amion. The initial configuration let too much traffic through and we had to restart the system after adjusting settings.

Since Thursday of last week, we've been clearing the disk cache every five minutes. The few load spikes since then have been in the moments leading up to the reset. We set the cache to clear every three minutes instead of every five.

Our goal is ensure Amion remains fast and responsive for everyone. We will continue to make improvements until the traffic filter sits idle and we all forget it's there.

8-13-21 We sincerely apologize for the performance issues that started in the afternoon of Aug. 10 and continued intermittently through the morning of Aug. 11. You rely on Amion to communicate and delays affect patient care. We devoted all our resources toward resolving the problem but it took a few days of troubleshooting to find the root cause. Read on for a timeline and full explanation of what transpired.

Starting mid-day Sunday 8/8, Amion had difficulty generating pages for people on the site. The high load continued on and off through the afternoon and overnight. Monday at around 9:30a ET, we reconfigured the server to launch tasks more efficiently and the site returned to normal.

The system ran fine all day Monday and into Tuesday. At around 2p on Tuesday, the load started running high again leading to slow responses or, for many of you, the “busy” message that helps keep the system from getting completely overloaded.

We thought we were getting hit with a DDOS attack. To keep up, we doubled the resources behind Amion. When that didn’t help, we doubled it again but still didn’t see sufficient improvement.

Tuesday evening, we redirected the Amion.com domain through Cloudflare to help analyze and control traffic. Wednesday morning, we enabled a firewall rule to let only browsers access the site while blocking bots that might be overloading the system. That helped and seemed to get us back to normal but the load continued to cycle from very low (normal) to unusually high.

At around 5p ET on Wednesday, we opened up access to the APIs and schedule editing. The site continued to function smoothly enough but we were also near the end of the day on the east coast when traffic lightens.

From the start, we worked closely with the lead system admin at the data center that hosts Amion. Wednesday evening, he came up with a possible explanation for the problem. We made a few changes in the Amion server app to address his findings and made those live at around midnight. Thursday morning, the system admin set up a job that runs every five minutes to perform some Linux housekeeping. We had to reboot the server at around 10a Thursday but since then, it’s been running smoothly.

The problem was an odd interaction between Linux and our server app. The ‘ocs’ app handles every request to Amion. It can run a thousand, often several thousand times a minute. Each instance opens one or more schedule files, a license / account file, a handful of system files, and a few temp files.

Each time an app touches a file, the Linux operating system adds an entry to its disk cache. An app can create a file and delete it moments later but the cache entry remains. Linux clears entries from the cache as part of its normal operations but if apps touch files faster than the system clears unneeded slots, the cache keeps growing.

When the server was running slow, our disk cache had over 4 million entries. The entire Amion site has around 400,000 files and many of those are schedules from years long past that get viewed only rarely. Most of the entries in the cache pointed to temp files that no longer existed but hadn’t been cleared.

The cache consumed an outsized portion of the system memory, leaving too little for normal operations. The system would compensate by using disk space as temporary memory but moving data to and from disk is slow. Once the system slowed, jobs started backing up and it would then have to spend yet more time switching between the many tasks. The system ended up spending most of its time moving memory and task switching and too little time executing the tasks it's meant to run.

We’re confident the problem has been fully resolved but we have further infrastructure improvements in the works. We will be moving to a new server in the next few days. The new server will have updated system libraries and programs that should minimize or eliminate the disk cache problem altogether.

We plan to run the ‘ocs’ server app under FastCGI to reduce system overhead. We will update and improve Amion2.com, the mirror site that runs out of a different data center, so that it is a more reliable backup should our primary server become slow or unavailable. We also plan to discuss architectural changes so that Amion will be ready to handle the next several years of growth.