11-24 The new server has been humming along for over two months. We've
updated Amion2 to better mirror the primary site. Single sign-on is back. Amion should
be ready for many years of growth. The server problems of late summer and fall are
resolved.
Thanks for your patience through the early weeks when Amion was not as reliable as
it needed to be.
9-24 The new server has handled all traffic this week as if Amion
were a small startup with few visitors. Logs haven't shown a single moment of high
load.
Monday through Wednesday, we fixed dozens of small bugs that never caused trouble
on the old system. The server app is compiled C code which runs fast but does whatever
the code dictates, good or bad. With the new compiler, the app shuts down if it writes
a single byte past the end of a data element, leaving you with a server error. We've
cleaned up most those problems. Yesterday brought a few edge cases but we've fixed
those, too.
The toolkit we use for single sign-on wasn't compatible with the new server. We're
in the process of rebuilding our SSO interface and hope to have that back later today.
9-20 The move to the new server went smoothly. It handled the early
AM mobile-app sync without logging a single moment of heavy load, a refreshing change
from the last month when every overnight update pushed the load higher than usual.
Earlier this morning, though, a routine job that rotates system logs to the next day
exposed a typo in a config file that led HTTP services to not restart correctly. Mail,
FTP, and all other functions were running fine.
HTTP services were restored at approximately 4:55am.
9-17 We will switch to the new server at 7a tomorrow, 9/18. The transition
may come with a few seconds of downtime. Both servers will be running up to and through
the switch. We will copy recently revised schedules just before 7a and sync again
once all traffic is hitting the new server.
9-14 We restarted the server at 4:30a Sat, 10:30a yesterday and 11:10a
today but each time, we're getting back to baseline in seconds vs. minutes or hours
when the problems began last month.
By the end of last week, we finished reworking the file storage. Every folder that
used to hold tens of thousands of files has its data spread over 200 to 300 sub-folders.
We have a new server ready. The hardware won't change but we'll get 10 years of kernel
and other system updates. Our data-center guru believes file and process management
will be far more efficient. We're testing the new server this week and will schedule
a date for the move soon. The switch should not bring any down time. We will let you
know when we plan to flip to the new server.
After each moment of high load, even ones that don't require a server restart, we
review Cloudflare logs for sources of unusual traffic. We've worked with a few partner
companies that pull data from Amion to reduce their impact. We're actively working
with Doximity to make our own mobile app more efficient as it syncs assignments for
every Amion schedule.
We're also making regular but small improvements to the 'ocs' server app to reduce
overhead. Upgrades will continue until all server performance issues are resolved.
9-7 After a week of smooth operations we had to restart the server a
little after 10a this morning due to yet another overload. The restart went quickly
and we had the system back to baseline in just a minute or two.
Over the weekend, we continued restructuring the way Amion stores data. We're on the
final folder that stores all schedule data. A filter limits which schedules move to
the new layout. We started with a few small accounts and are up to 1/3 of schedules
as of this morning. After today's restart, we'll open the filter up for all remaining
accounts.
Each page Amion generates runs a copy of our 'ocs' server app. Each copy of the app
requires a good-size block of memory for the program itself and CPU overhead to launch.
Upon startup, the app runs a number of housekeeping chores to track open tasks and
monitor the system load. We plan to break the housekeeping tasks into a separate app
that runs under FastCGI. FastCGI apps remain resident in memory and a single instance
can run many jobs simultaneously. The new app will be able to better manage traffic
and block specific servers that request an unusual number of pages which might otherwise
overload our system.
We still have many upgrades planned for the coming weeks. Details will come later.
8-31 We had a flood of traffic right around 5p that caused the server
to back up and lock. We restarted the system and Amion is back to baseline. We were
at a very comfortable load all day today and yesterday. We'll investigate the source
of the 5p traffic shortly
We're 80% of the way through reworking large folders into a more efficient layout.
We thought the changes from the weekend and yesterday were enough to put the logjams
behind us but we will get that last part completed tonight or early tomorrow. The
biggest remaining folder holds all 2021 schedule files and is accessed once or many
times for every schedule page Amion displays.
8-27 We had a minute of intentional down time at 6:25a to switch to
a new filesystem that should better handle folders with large numbers of files.
Amion stores every active schedule in one folder. Customer account info goes in another.
Each folder has tens of thousands of files. The folder that stores iCal data for Google
and iPhone calendars has close to 100,000 files.
Our data-center guru suggests we split files over multiple sub-folders. Any time a
file changes, the operating system locks the directory index to update it. The index
for 20K files can be ~0.5 MB which can take more than a decent number of CPU cycles
to write. If an index is locked, other jobs that need to access files in that folder
will be blocked momentarily. If the server is fairly busy, a folder could remain locked
far longer than ideal and jobs could stack up.
We will be working on a new storage format today and hope to have some folders on
the new layout this afternoon. We will start with the iCal folder since it's huge
and doesn't impact regular Amion operations.
8-26 Amion was off and on again between 11:00a and 12:15p on
Thursday 8-26. We had to restart the system a few times to return to normal operations.
We may have found one source of traffic that's contributing toward Amion getting overloaded.
Every 15 minutes, a partner company pulls 3 days of assignments for approximately
120 schedules. They pull each day separately, generating close to 400 requests within
a narrow time frame. If their sync launches at a moment when Amion is already busy,
jobs can back up on our system.
While getting Amion back online today, we shut down API access. That company's sync
was underway and once blocked, it sent over 1000 requests to Amion to recover from
the error.
If their sync drives the load up on Amion to where our app returns a "busy" message,
their error handler may make the problem worse and not give us time to recover.
Over the next few days, we'll be evaluating architectural changes to our hosting setup
to split traffic across different systems.
The partner has a fix ready based on feedback we provided them a few weeks ago when
our server problems started. Today, we asked that they install the fix as an emergency
patch instead of waiting until their next standard release.
If they can't roll the fix out quickly, we will limit the rate at which they can request
data.
8-23-21 We had to restart the server after another rush of traffic
led jobs to back up to where the system couldn't recover.
This morning, we installed new tools to gather data on why the server runs smoothly
for hours and then gets overloaded.
We also adjusted the filter settings which protect the server from getting overloaded.
We are monitoring the service non-stop.
Sorry for the ongoing trouble.
8-19-21 We are still seeing brief moments of high traffic. This
morning, we installed a new filter to prioritize large hospital systems and Enterprise+
sites that standardize all schedules on Amion. The initial configuration let
too much traffic through and we had to restart the system after adjusting settings.
Since Thursday of last week, we've been clearing the disk cache every five minutes.
The few load spikes since then have been in the moments leading up to the reset. We
set the cache to clear every three minutes instead of every five.
Our goal is ensure Amion remains fast and responsive for everyone. We will continue
to make improvements until the traffic filter sits idle and we all forget it's there.
8-13-21 We sincerely apologize for the performance issues that started in the
afternoon of Aug. 10 and continued intermittently through the morning of Aug. 11.
You rely on Amion to communicate and delays affect patient care. We devoted all our
resources toward resolving the problem but it took a few days of troubleshooting to
find the root cause. Read on for a timeline and full explanation of what transpired.
Starting mid-day Sunday 8/8, Amion had difficulty generating pages for people on the
site. The high load continued on and off through the afternoon and overnight. Monday
at around 9:30a ET, we reconfigured the server to launch tasks more efficiently and
the site returned to normal.
The system ran fine all day Monday and into Tuesday. At around 2p on Tuesday, the
load started running high again leading to slow responses or, for many of you, the
“busy” message that helps keep the system from getting completely overloaded.
We thought we were getting hit with a DDOS attack. To keep up, we doubled the resources
behind Amion. When that didn’t help, we doubled it again but still didn’t see sufficient
improvement.
Tuesday evening, we redirected the Amion.com domain through Cloudflare to help analyze
and control traffic. Wednesday morning, we enabled a firewall rule to let only browsers
access the site while blocking bots that might be overloading the system. That helped
and seemed to get us back to normal but the load continued to cycle from very low
(normal) to unusually high.
At around 5p ET on Wednesday, we opened up access to the APIs and schedule editing.
The site continued to function smoothly enough but we were also near the end of the
day on the east coast when traffic lightens.
From the start, we worked closely with the lead system admin at the data center that
hosts Amion. Wednesday evening, he came up with a possible explanation for the problem.
We made a few changes in the Amion server app to address his findings and made those
live at around midnight. Thursday morning, the system admin set up a job that runs
every five minutes to perform some Linux housekeeping. We had to reboot the server
at around 10a Thursday but since then, it’s been running smoothly.
The problem was an odd interaction between Linux and our server app. The ‘ocs’ app
handles every request to Amion. It can run a thousand, often several thousand times
a minute. Each instance opens one or more schedule files, a license / account file,
a handful of system files, and a few temp files.
Each time an app touches a file, the Linux operating system adds an entry to its disk
cache. An app can create a file and delete it moments later but the cache entry remains.
Linux clears entries from the cache as part of its normal operations but if apps touch
files faster than the system clears unneeded slots, the cache keeps growing.
When the server was running slow, our disk cache had over 4 million entries. The entire
Amion site has around 400,000 files and many of those are schedules from years long
past that get viewed only rarely. Most of the entries in the cache pointed to temp
files that no longer existed but hadn’t been cleared.
The cache consumed an outsized portion of the system memory, leaving too little for
normal operations. The system would compensate by using disk space as temporary memory
but moving data to and from disk is slow. Once the system slowed, jobs started backing
up and it would then have to spend yet more time switching between the many tasks.
The system ended up spending most of its time moving memory and task switching and
too little time executing the tasks it's meant to run.
We’re confident the problem has been fully resolved but we have further infrastructure
improvements in the works. We will be moving to a new server in the next few days.
The new server will have updated system libraries and programs that should minimize
or eliminate the disk cache problem altogether.
We plan to run the ‘ocs’ server app under FastCGI to reduce system overhead. We will
update and improve Amion2.com, the mirror site that runs out of a different data center,
so that it is a more reliable backup should our primary server become slow or unavailable.
We also plan to discuss architectural changes so that Amion will be ready to handle
the next several years of growth.
|