The purpose of this article is not just a cheap shot at Windows 10, which despite being out for years, everyone loves to hate. But also to review why it’s hated and to see what we, as developers, can learn from the Windows 10 missteps and how we can improve our own designs to avoid these pitfalls.
Windows 10 was officially released on the 29th of July, 2015. In the almost four years since then its usage has climbed to the point where just over half of desktop computers are currently running it. Yes, just over. Just over 53% according to StatCounter as of the 23rd of February 2019; which isn’t great for an operating system that was force installed onto people’s machines as a security update during the middle of the night. Incidentally the hot second is Windows 7 at 35% (as of the time of writing) despite the fact that it will go out of support in a year, probably still loved for reasons that I empathise with and will go over in this post.
I think the progress of Windows 10 has parallels with a worrying trend I’ve begun to notice in other aspects of the Software Development industry where speed of release and new features are taking priority over heavy testing and focus on stability. I say this as someone who’s never been especially trusting of unit tests myself but regardless of how they managed it the Microsoft of the past had less access to today’s managed languages (like .NET and Python) yet still released fairly stable products without resorting to underhanded methods to monetise them. The reason why I bring this up is that in the fast-paced news cycle of the day it becomes too easy to forget past lessons as they drop off the horizon. I’ll iterate over the strangest issues that made it to customers in the Windows 10 development and release cycle to see what lessons that other developers can learn from this, as well as what we can do to avoid being bitten by similar bugs that make it through the test cycle to release in the future.
Why managed languages are important.
Just as a quick aside. I made a comment earlier about the rise of managed languages. The earlier Windows versions (and the Linux kernel) were written entirely in C, which is an unmanaged language. The managed part comes from the fact that the languages run inside a VM (in this case an application (almost always written in C) whose job is to run the provided code). Often this is slower than pure C in terms of execution time, but not in terms of disk access and networking as the VM will have a higher level view of what it needs to achieve and can take more shortcuts to get there. Additionally, the VM can apply standard optimisations to every piece of code that it gets, even if the code was written before the optimisation existed and by people who didn’t know enough to apply it at the time they were writing it. In general C is faster (and never requires a VM), but in practice it’s only faster if the programmer knows a lot of details in order to take full advantage of C’s features. This means that in practice, managed languages like C# / Java can often be on par with C in terms of speed, but are much easier to make secure as they automatically apply security checks that in C have to be done manually while also being much faster to write. The reason that not everything is written in managed languages is because they need more memory to run and cannot run in some environments (something has to run the VM to run the language); so boot loaders (the things that start up operating systems) and embedded systems (machines with no screen and barely any memory than run code directly, without an operating systems) will often be running C and have no capability of running higher level languages.
It would probably be more illustrative if I offered some examples. C is vulnerable to a class of hacking attacks called “buffer overflow” attacks. These are attacks that are especially dangerous because they can be executed over the internet to a target computer without the user needing to do anything in order to become vulnerable. Completely remote attacks are far less of a threat against modern networks, but at one point they were very dangerous. A buffer overflow is an attack that allows the attack to remotely run their instructions on a target computers processor. C requires that all text areas (like the subject line in your email) be allocated a specific amount of memory before they have text written to them. So for example your C email client needs to specify how long the subject line can be, allocate the correct amount of memory for the character set you’re using (ASCII, Unicode-8, Unicode-16, etc..), and then write the text of your email’s subject into that memory. So what happens if you get an email with a longer subject than the amount of memory allocated and the memory spills over outside of the section marked for text and into general memory? Probably nothing, unless the attacker has helpfully provided compiled C language instructions in their “slightly too large subject line” in their email. Then those instructions have now been copied into your computers general memory and can be executed like they were always a part of your application. The way to deal with that is to wrap all the parts of your application that deal with fixed amounts of memory in code that checks the data going in and out of that section to make sure that it doesn’t exceed the allocated space, doesn’t contain code, and so on.. Managed languages have a huge advantage in that they do this and many other normal housekeeping aspects automatically. It’s not a big deal to keep all this in mind for any single text field, but it becomes very complex when you expand out to an entire operating system’s worth of text areas and images (which can also be overflown) and temporary holding spaces for typed keyboard characters and so on.
That might be excessive detail, but it sets the scene, even with these modern advantages in play, software can still be poorly designed and with unexpected behavior. However it only becomes a problem when the developers start making the excuse that “software development is complicated” to explain poor design and unexpected behavior. Earlier systems were designed in C and worked fairly reliably due to controlled releases and heavy testing. Modern systems can be built much faster and need far less testing to reach the same level of reliability, however this does not mean that they can now be released with no testing at all and no change controls. I’ve seen this excuse used often enough that hearing “programming is hard” has become a personal annoyance. No one would accept it if a car had issues that caused it to suddenly accelerate randomly and the manufacturers said “making cars is hard”, and I while I don’t doubt that designing a new car is an incredibly complex process you don’t see the same kind of issues in cars as you do in software largely due to a multi-step testing process and tight controls on what gets changed from revision to revision. Essentially, designing modern technology is hard, but if we’re smart about how we go about doing it, we can mitigate many of the risks. This is why we shouldn’t accept poorly designed software causing financial loss by destroying data or generally stuffing up just because “it’s hard”.
I think the real learning from this is that you need a comprehensive multi step test plan followed by testers that are trying to break the software on purpose as well as heavy regression testing to really guarantee reliability. You could probably get around this by doing small roll-outs over time to like Windows Insiders, but that only works if the insiders are a good cross section of your normal user base, which the insiders are not. I’ll go into this more further down as I go into some specific Windows design decisions.
There are several existing desktop editions of Windows 10. In general they are..
- Windows 10 Home
- Windows 10 Education
- Windows 10 Pro
- Windows 10 Pro for Workstations
- Windows 10 Enterprise
In addition to those versions there is also Windows 10 IoT and Windows 10 Mobile / Mobile Enterprise for various kinds of mobile and embedded devices. Within those versions, and especially within Pro edition which is the one that’s most directly targeted at small to medium sized businesses and independent workers there are a list of independently maintained branches. Now the different versions might not be a problem as they’re not too different from what we had with Windows XP or 7, but the different branches are definitely raising a few red flags. When Windows 10 first came out the idea was that there would be one major release (the last version of Windows) and that this one release would receive continual patching to insure that every user was on the same version.
Consider the following. Not all hardware has proven to be compatible with all low level changes in major Windows versions, because of those compatibility issues there are users on major revisions that are receiving security updates, but aren’t being pushed to the next major revision. For example you might be on 1703 (15063.1631) and because 1709 has a known incompatibility with your hardware you are upgraded to (15063.1659) instead of being moved to 1809 (17763.316). This applies even more strongly to users that need to stay on a certain major revision due to security requirements or needing to insure compatibility with critical software, which is where LTSB (now called LTSC) comes in. It’s a version of Windows that is only available to corporate customers and is on a specific upgrade path that will never be moved to the next major release. This means that instead of having to maintain (and validate) one single version of Windows, Microsoft’s developers are required to insure that every security update is also deployed across the other 6 (at the time of writing) versions of Windows that are increasingly different from the latest release. As Microsoft plans to support each LTSB candidate for 10 years and that a new major revision comes out every 6 months on average, this means that there will eventually be 20 major revisions that will require testing with each security update that Microsoft plans to release.
It’s worth noting that there are a few differences between the Long Term Service Branch (LTSB), which has now been renamed to the Long Term Service Channel (LTSC) and normal Windows 10. Aside from the fact that LTSB versions of Windows are only available to enterprise customers that get their licenses from an enterprise level agreement with Microsoft (so no company with less than few hundred people, and no individuals), the LTSB editions of Windows 10 do not come with support for the App Store or any apps. Which is interesting because it seems like an acknowledgement by Microsoft that the App Store is a net negative to Windows 10 stability. In fact the existence of LTSB is considered an open secret among IT professionals, to the point where people who complain about Windows 10 having stability issues are advised to try getting an LTSB build. The problem is that this isn’t exactly a solution, because unless you’re provided a copy by the multinational corporation you work at, there are almost no legal ways for an individual to get a copy. This is important because making sure you have a legal copy of all your software isn’t just the ethical thing, but insures that you’re legally covered for damages that might happen because you’ve been fed an untested patch. This isn’t hypothetical either, as some individuals are already starting to pursue lawsuits against Microsoft for their forced updating.
The problem with all this is that the major revisions also have changes to the kernel and low level functionality. That must make it harder to back port security changes to all currently maintained versions. Additionally even when users do update from one version to the next, there are often intermittent issues when upgrading which appear to stop the upgrade happening completely. That means that a user might just end up being stuck on their version in a perpetual update loop until Microsoft releases a patch for their specific issue. If I had to guess I’d say that this is related to major versions being more similar to full reinstalls. This might make things simpler from a speed perspective, where you only need to install one thing instead of being like Windows 7 where you need to install 150+ fixes in order. However in Windows 7 you knew that the fixes were going to come in one at a time and were dependent on each other, so you could test the upgrade path more strongly. It looks like the major revisions in Windows 10 don’t guarantee a straight path that all computers have to follow to upgrade, which appears to let some computers get stuck in a bad state where they’re perpetually trying to download the update, try to install it and fail.
What is the learning from this?
I’ll talk about Microsoft’s update policy more in the future, but the takeaway from this is that it can’t be good to have so many “stable” versions of your main software that all have to be patched and maintained independently of each other. Instead of making things easier by switching to a constant delivery model, the requirement of having a fixed target that can be validated for security by external organisations (like banking regulators, government agencies, etc..) means that Microsoft is now maintaining several major revisions , each of which have their own sub-revisions. We don’t have to guess that this isn’t going great because we can see progressively more bugs slip through compared to what we saw with previous Windows versions. This is doubly concerning because automated testing frameworks have only improved over time, which means that it is even easier to run mass testing now than before.
As developers I think the takeaway is that rapid iteration only really works right at the start, before you build up a large user base. The more users rely on you, the more you can serve them well by having large major releases that each pass through extensive testing. Microsoft also has a practice of pushing out updates slowly over time to various systems and monitoring for issues as updates go out in realtime. Aside from the obvious problems in a company using paying customers as guinea pigs it’s unlikely that the metrics that Microsoft is getting from a remote machine are as useful as it might be if they invested into developing their automated testing frameworks.
What this suggests to me is that once you have an established product you’re better off having an exceptionally slow iterative cycle. Make one change, test it to destruction until you’re reasonably certain you’ve got the edge cases, then roll it out slowly to users and check to see that everything’s working as expected while you start working on the next change. That would definitely seem to make more sense than blasting a set of untested, partly developed features into the wild all at once (eg, timeline and the new settings panel) and exposing them to all the users at once with bugs and all.
Under normal circumstances, I’d be the first person to jump on board with “Iterative Development”. Instead of doing massive releases where you deal with a huge wave of problems all at once, you’re tackling your potential issues over slowly over time; this lets you have a smoother release cycle with fewer interruptions. That’s also what the process should be like according to Microsoft’s “servicing strategy” guide.
The issue with this is that, while you’re working with the software as a full time job. The users are working with it as a tool to accomplish their other jobs. From their perspective a one-off problem that they have to deal with, like getting used to the issues with a new car, now becomes a perpetual stream of new annoyances that they have to deal with. This may not be an issue for your users if your software company only does one thing, like Uber, whose client facing app only has five or so screens in total and the iterative changes are both small and backed up by a massive team of testing and development staff. However as the ratio of the change to team size increases, it looks like it gets harder to keep the small issues that each change will generate from causing further roll on issues with the users.
I could say that it gets much worse for companies that perform critical tasks and have to be accountable to their users and staff for the reliability of their infrastructure. Which is why LTSB exists; because those users complained to Microsoft in order to be allowed to simply opt out of this process. The problem is that every company’s tasks, not just those of companies with massive budgets (and even the tasks for self employed individuals) are critical to for the people doing them, as the people doing them depend on carrying out those tasks in order to make rent and pay salaries. Essentially the existence of LTSB is an implied admission that the users subject to iterative development and the short, new feature, non-LTSB release cycles are effectively a test group for the “real” users. This might not sting as much as it does if bugs didn’t regularly slip through the testing process to go live into production, to the point where Microsoft has specific policies about their staggered update rollout to mitigate this. This means that not only is the iterative development process flawed, but Microsoft is fully aware that it has those flows and is perfectly happy to push out mandatory patches anyway for those users that don’t have the budget to get out of it.
I’ll talk about staggered updates later, because it’s worth keeping in mind. However the practical effects of the perpetual release cycles are direct. Eventually, everyone is going to either run WSUS (a custom update server) or O&OShutup10 (an install-able app), both of which act to block updates from Microsoft from your systems. Effectively, if you’re an IT person, the first time a forced update gets through and you have to explain why things are broken to management due to Microsoft’s actions is the time when you’ll start seriously investigating blocking updates on your system. This works the same for home users and their customers or students and their assignments or even gamers losing a ranked match because of a newly introduced bug and sliding down the rankings. Issues do make it to production, even serious ones, and even if a user has been lucky so far they’ll get unlucky eventually; at which point they’ll get upset and try to get out of the forced updates. This means that given enough time, an increasingly larger proportion of users is going to start hating the company.
A second problem with this development model is with regards to drivers, both for hardware drivers and virtualization drivers that plug into the Windows kernel. The drivers currently being used in Windows are often for hardware that is no longer under active development, but that is still completely functional and being used by the people who own it. If an update to the Windows driver model breaks functionality with hardware, like has happened with SSD drives and the Creator’s Update. If a hardware item is no longer made, it’s not reasonable to expect the manufacturers to go back and rewrite their drivers every single time that Microsoft makes a small change, which means that Windows’s biggest strength, it’s hardware and application support is now randomly at risk unless you’re continually buying the latest hardware.
What this means for developers.
If you’re responsible for IT infrastructure then you’ll have to take steps to insure that you don’t receive updates whenever Microsoft decides it’s time; you’ll want to get them on your own schedule to avoid late night phone calls and panicked meetings. So you’ve got a few options.
If you’re not dependent on Office or Active Directory you can phase in Apple hardware for the desktops and have them running OSX. This will improve your security, prevent users from running strange attachments they receive over the internet and will allow you to guarantee that your machines won’t be suddenly rebooted without you being able to image them just before it happens and confirm their operation afterward. It’s worth noting that if you need them to, OSX machines can connect to an Active Directory domain for authentication.
If you’re part of the IT staff for a larger company that is dependent on Microsoft software such as Active Directory then your IT department is probably already running WSUS, although it’s been mentioned that some “critical” updates have the capability to bypass WSUS entirely.
For people managing smaller networks where machines can be configured one at a time, you’ll probably want to use something like O&O Shutup 10. This lets you configure the hidden aspects of the Windows update service to stop a machine checking for updates until you’ve had time to see if other users are reporting issues like the infamous data loss issue that resulted in 1809 being removed from release for further testing. It’s worth emphasizing that it had been reported for months that some users were losing files when running updates, but the release with this bug still made it into general release and became a candidate for a LTSB branch. Being able to configure updates with O&O means that you can do a backup before running the update, then run the update and then confirm directly that the machine is working as expected before disabling updates again.
Also it’s worth emphasizing that while more expensive versions of Windows with enterprise plans have progressively more controls over this process, they also require more IT staff and bigger budgets. I’ve found it’s generally worth having a plan that can provide reliability and protection assuming poor budgets and standard non-enterprise equipment since you can always guarantee that you’ll have that on hand.
One thing that people may not be aware of, you’ve probably heard the advice that in order to avoid having a Windows 10 machine reboot without warning, it’s worth clicking “Check for Updates” before beginning a large render or other project that requires your machine to stay on for the entire process. However this isn’t strictly true, as Windows machines do not always receive the latest updates when they check for them. For example if every Windows 10 machine in existence were to check for updates at the same time, only a small subsection would have updates advertised to them, the other would be told that there were no new updates available. In this way updates are rolled out over time in order to avoid potential issues like the 1809 data loss bug affecting every user at once.
However there is an additional difference for users that select “Check for new updates”. Those users are marked as “Seekers” which Microsoft believes are users that are wanting to try less tested updates. These flagged users will receive less tested updates than the ones that are otherwise being rolled out to the computers around them, but this doesn’t include all the updates that are available for that machine. So it’s possible for a user to check for updates, be given an untested beta update, then be rebooted without warning as their machine receives the normal updates during its nightly check.
Of course most users won’t see any issue from this as there won’t be beta updates available when they update or new updates going out in the days after they check. Neither will there necessarily be any issues with the beta updates themselves. Only a small subsection of affected users will see issues from this. The main problem is that these subsections of users that are affected by an unclear update process causing them to lose work are not completely overlapping with the users that have issues as a result of untested patches; and while it seems to help Microsoft do a staggered rollout of its less than completely tested software, it’s another source of trouble and dissatisfaction for Microsoft’s users.
What the Seeker setup teaches us
I think the takeaway from this as software developers is that if you’re going to have an update process, don’t feed the less tested updates to the more responsible people. The fact that a user can search for updates just before starting a long term task (like a render) and actually get a less tested update than normal users, while still not removing the risk of a midnight reboot causing financial harm as their work is lost, is a problem. This seems like it’s another process that upsets people at the cost of saving Microsoft money, and since it’s not the only one, the various attempts to offload testing are making Microsoft look like a company that can’t be relied on for quality. The learning here, for developers, is that if you want to keep your reputation you need to pass on at least a few of the opportunities to offload your problems onto the users or what you’re really selling is your brand reputation; which you have to make up with expensive advertising.
I’m not being hypothetical in that way, as I’ve passed on One Drive and several other Microsoft products entirely because I’m worried that the questionable testing practices might extend to other desktop beta products. This was before that update that caused data loss on what turned out to be Windows 10 desktops that had enabled One Drive. Effectively, people who were paranoid that an update might blow away their files were right; again, after those updates that stopped machines from booting.
I also want to mention that I’ve heard people often say that Windows 10 is a desktop OS and that if you want reliability you should get a server operating system. This is an argument that I despise, because you only see this argument on Internet forums and never on Microsoft’s documentation. If Windows 10 said “this is a desktop OS so don’t use it for anything serious or you might get data loss” then I would understand the internet debater’s point, but it doesn’t, the marketing stuff all talks about Windows 10 like it can be used for anything and is super reliable and so do all the internet IT pundits… until something goes wrong. In general we should expect that the things we buy are reliable enough that we don’t need to constantly baby them to keep them going, especially when that product is the organic result of 40 or so years of development. The answer to a product failing to live up to expectations is not “Well you should have bought the advanced model”, if the base model isn’t doing the job I’m not going to shell out twice as much again to see if model 2 is going to work or not and I think that there’s a fundamental problem with expecting people to do this.
As developers we should expect that people should be able to rely on our software even without them having to monitor it constantly, and they should be able to turn their backs even to the base version of the product and have it still run without them thinking about it. That creates the greatest amount of customer satisfaction and is something worth aiming for.
Paging and Excessive Memory Usage
Windows 98 is still used in many ATMs all over the world despite the fact that it doesn’t have modern memory protection and is super unreliable as it has very few moving parts compared to more modern systems, which is itself a source of reliability. For comparison, Windows 98 needs 16MB of RAM to run, but can run in 8 with the right boot options. Windows 10 lists its system requirements as 2GB minimum but that won’t be enough to actually run any modern software with any reasonable speed as the computer is effectively paging even while booting.
I’m not going to go into the necessity of this, as I know that people will argue that Windows “needs” that memory to do all of the cool things it does. Which would be the exact opposite argument if I were selling them a car with poor gas mileage where they would argue that the car shouldn’t use that much gas and instead should just be better designed to use less fuel to begin with.
Most of my annoyance however, stems from the way that low memory states are handled in Windows. Windows will use as much memory as it can to improve its performance, but when that memory starts to run out, it just completely freaks out. Then it begins to do a few “clever” tricks to handle its remaining memory that probably aren’t that clever and only really make the problem worse, firstly Windows will push out memory to paging to make room for the file cache, and secondly both the on disk paging file and the memory compression functionality are linked together, to the point where paging can’t be turned off without also turning off memory compression.
First let’s talk about swapping memory to make room for file IO. During normal operation Windows keeps chunks from the files it reads off the disk in its RAM. Disk IO can be slow, especially if you’re accessing random small chunks of data, how slow depends on your device but if you’re running off an SD card the answer can be “very”. When Windows reads a file, unless the application passes a special command to skip this process, it will keep that data in its “free” memory and the next time that disk chunk is located it will come from Windows’ memory at 2GB second (2048MB/s) instead of the disk (which can be as good as 500MB/s… or as bad as 10MB/s, or worse for spinning disks and small files). It’s also worth noting that Windows uses file IDs (like an internal name) to reference this data, not disk location, so this functionality isn’t aware of disk position. When windows detects that the storage is being heavily accessed it will move memory into its paging file to make more room for this data. The problem with this is the effect it has in practice, lets assume that you’re compressing a file on a spinning disk, a heavy operation that will use up a large amount of disk for a while, while also making the disk respond slower due to a small amount of large operations that will cause other small operations to queue behind them. Windows decides that this is a trigger to start paging, even though there’s plenty of free memory left. It will start moving the memory out in 4KB chunks, one at a time. For 2GB of memory that’s about 524,000 items. Each of which are now being deposited in the paging sector. When the user does anything the computer will need to reference its memory to see what to do about it, some of which is now on the disk. This means that your drive that is already running at 100% will start to slowly read those chunks one at a time in between running the compression operation. It will do compression, read a few chunks, go back to compression, the computer decides it needs more data so it reads some more chunks, more waiting. During this process the mouse is not moving and everything appears locked up. It’s also worth noting that any disk activity can trigger this, including background virus scanning, that’s why Windows 10 is almost unusable on rotating drives. It’s also why a machine may appear slower after the user has been away from it, as background processes will have caused memory to be unloaded and paged while the user is away and even if the memory is completely organized the first time it’s paged, it won’t be after a few dozen cycles of this. Even on SSDs, this is a noticeable source of “jittery” behavior as while the system won’t lock up as distinctly, it will still need to load the data one piece at a time.
The problem with all this is that having better performance on in-store tests is not the same as being a more usable machine while you’re doing your taxes on it or trying to reply to an important email quickly. People still have machines with rotating drives in them and more rotating drive machines are still being made and sold. A good system should run well on slow machines (so that it can run better on faster ones). It’s unreasonable to say “well it doesn’t run on this machine, but if you spend $2000 or more it’ll run really well we promise”. Especially as you just know that after you’ve upgraded you’ll be told you missed something and need to spend even more. It would be nice to have an option to disable this behaviour like Linux used to have (setting snappiness to 0) without also disabling the swap file so that the system would only swap during emergencies.
Linked to this is the implementation of the new memory compression system. Instead of paging memory directly Windows will try to compress memory in place before unloading it. This is much faster and will compress and decompress at 400MB/s or so, on par with the very best SSD drives operating at full speed, additionally as this is an in-memory operation there will be far less noticeable delays while this process is happening. The downside of course is that the memory is still resident while compressed, which makes it far less efficient than regular paging.
One of the differences between Apple’s OSX and Microsoft’s Windows is that OSX allows you to disable the paging system while allowing memory compression to stay enabled. This means that if you’re on a slow drive and your machine decides that it absolutely must start compressing (which it will long before it actually gets low on memory), it doesn’t mean that your computer will immediately become unusable. This means that a Mac with 8GB of RAM and a regular drive (which are still sold) is fairly usable with this setting even if it’s almost completely unusable without it, but does require a bit of care as Mac machines can be even more memory hungry than PCs. Unfortunately this doesn’t work at all on PCs, even though on average a Windows PC uses less memory than a Mac for the same functionality, it’s not possible to disable paging but keep memory compression, although there are several ways to create the reverse.
These have always struck me as an annoying combination of design decisions. As someone who’s job involves being stuck behind a computer for most of the day, having it be responsive under load (even if it’s actually slower in doing things) is one less source of annoyance in my life and generally makes my work more pleasant to deal with. So it’s really frustrating that in this one case, Windows seems dead set on its design decisions and leaves no way of changing them. The best you can really achieve is to disable paging completely and keep an eye on your memory usage; but this is still better than wondering if the machine is getting slower or if it’s just your imagination, or finally rebooting it and noticing that everything does run faster than before (including one unfortunate experience after I’d spent ages debugging something) and it really wasn’t just my imagination that it was slowing down.
The machine our software will be run on is never as fast as we’d like it to be and when developing applications it’s important to plan for the very worst piece of hardware that can support what you’re writing and take steps to insure that your new software will run on that hardware as well as possible. This has the added benefit of insuring that when users do get the latest and greatest, your code will run really, really well.
What does this mean for software development?
Generally it means that while your app can exceed the amount of memory available on the system for a while, it’s likely to start getting strange errors and crashing randomly once it does. Most applications don’t have such a strong tolerance against multi-second access times to their memory; something that’s more and more likely the longer the system has been running for. There are a few things that you can do to mitigate the problem.
For example you can do what the old Java runtime environments used to do and read in all your memory at intervals to insure that none of it gets paged to disk. This is a cheap hack and is generally not recommended, but does work at insuring that your software remains in memory.
For things like drivers you can mark your code as being non-pageable on Windows or “Wired” on OSX which works to insure that it can never be paged out from the operating system. Note that this also means that on OSX it can never be compressed, even if it would otherwise compress very well and react quickly to being accessed.
If you’re configuring a system, it’s just useful to know that this is something that happens and that “it gets slower the longer it’s been running” may not be entirely your imagination.
Obviously this is not a great feature and I’m still hoping that both Microsoft and Apple implement some clever functionality to either allow you to switch off the automatic paging unless memory is critically low or some additional functionality that loads code back in to memory over time to avoid situations where memory fragments itself all over the disk. It’s not an entirely far out suggestion either as the Linux kernel had several patches during the 2.6 years that provided this kind of functionality to allow better responsiveness while the kernel was paging out to early 2000s spinning drives. If you’re doubting the benefits of this, remember that both spinning drive computers and eMMc based devices are still being sold today and a cloud server with a spinning drive is almost 30% cheaper than a SSD machine on several hosts. If you’re writing server software optimising your code saves money and if you’re writing for desktops then one additional pass right at the end to improve performance makes for a happier and larger customer base.
This complaint is going to stand out a bit, as people are generally always in favor of more security. However the issue with Microsoft implementing their own anti-virus solution into Windows is the way they did it, and to a certain extent the justifications for their implementation when people complain about how it was set up.
The biggest problem with the anti virus is that it is a surprisingly massive performance hog, especially on older systems. More security is better, but not if it makes your computer completely unusable. If I sold someone a car that was safer than the previous model but couldn’t go off the road or reach highway speeds they would demand their money back. People would find it unacceptable if a car gained a new safety feature that made it unusable for its primary purpose, but that’s what people are willing to accept when Windows 10 makes an older computer too slow to use; especially if the things it is being used for have not changed.
The way the Windows 10 anti-virus is configured, it will re-scan any accessed file when its opened, regardless of when it was last opened or if any application edited it in the meantime. This is a huge problem if you’re running a console application that operates by editing text files, especially if performance is at all a concern for you as the files will be re-scanned as they are accessed by the application. This configuration is also a problem as the anti virus will also be triggered by Windows’ own internal updating functionality. Which means that at any point your computer may find it necessary to read all of its core files into memory and process them several times. Which wouldn’t be a problem if you were running a monstrous beast on the latest SSD drives but will present a concern for virtually anyone else trying to get work done on more reasonable systems.
What to keep in mind
Most people who need to use their Windows 10 desktop for work that needs good hardware performance will probably end up turning the anti-virus off through a registry hack; especially once they become aware of exactly how much it’s affecting their machine’s responsiveness.
I’m not against security. What I am against is mindless security implementations that exist largely as box ticking exercises so that the provider can shift blame onto the people being hacked. It might be surprising to hear but I’m actually in favor of an implemented anti-virus, it’s a good idea, but there are several ways it could have been much better. For example as the anti-virus is built into Windows, it would have been entirely possible to keep a record of when files were last updated and only scan files that were updated since they were last scanned. The operating system doesn’t need to trust the “Last Modified” timestamp on the file for this as it’s the source of the disk operations to begin with, and if Microsoft wasn’t willing to go as far as integrating the anti virus into the disk subsystem Windows can still generate events when files are modified. It would be trivial to know immediately and flawlessly when any mounted filesystem had any files modified on anything other than a network drive. This implementation would probably cut the anti-virus resource usage by more than half and make it far more tolerable.
Another possible change is to update the anti-virus to only scan the changed portions of files. From watching it process text files it looks like what’s actually happening is a full scan of the file, which is very inefficient, especially if the file had been scanned previously.
I suppose what I’m arguing for is that everyone who mentions “security” as a reason for a software issue without being specific about what the attack vector is and how the change is proposing to mitigate it without affecting quality and usability in the software should have to do some kind of “security” dance to underline how blatantly self-serving an argument it is. Anyone who’s recently had to configure an out of the box MySQL install in Linux for a development machine will understand how annoying security theater can make what is generally a one to two step process into a multi-hour slog. It doesn’t help that every distributor seems to have different implementations and won’t feel the need to document their personal method due to its inherent ‘obviousness’. Security discussions can get silly when you’re working out which flags were set in this implementation of MySQL that prevent connections from happening, but don’t show any errors (for security) so that there won’t be any surprise connections on your development VM that’s behind multiple NATs. This is time that someone will generally pay for, and I doubt there’s too many developers that can breeze through the process without getting stuck referencing documentation.
Security decisions made without thinking through the implications are decisions that other people will end up being out of pocket for when issues start to surface. Which often means that the people responsible for implementing these new security policies won’t end up actually going through and securing everything. That’s equally true for people who disable passwords on their computers because they can’t remember them and it’s true for people who disable the antivirus because it cuts disk IO by over half.
Advertising on a paid product
This is a big one, and crosses the line slightly as now we’re talking less about how you should implement various things in your own applications and into “never do this because everyone will hate you” territory. Generally people expect that either a product is “free” in which case the company makes money through advertising inside the product or otherwise gets added value from driving more users to another of their paid services than they would without the free product, or the user pays directly either monthly or a one off in order to fund the company making the product. Generally you don’t have both.
This is not the case in Windows 10, where advertisements and “product suggests” (which is even more aggravating because they’re ads, but can’t admit it) appear both on the lock screen and start menu. Users have reported inconsistencies on this because not all users seem to get them as what any given person sees will depend on what region they are in and the relevant advertising laws inside that region, and thanks to the existence of A/B testing, Microsoft already has both the theory and the tools to deliver different versions of a product to different groups on the fly. On the whole, most users can expect any of the following…
- Clicking on the start menu will suggest products
- Searching will suggest products
- The lock screen will suggest products
- Notifications will pop up, suggesting products
The only thing missing from all this is is a popup that works like the GWX patch that will also register a purchase for you if you click the X on the top right. Even though Microsoft has spoken out against rumours that they’ve investigating making Windows a free product that is ad supported, it’s not hard to see why those rumours might have some lift to them as more and different advertising methods pop in and out of the insider releases.
The learning from this
I think it’s worth it for developers as a whole to push back on this kind of dual income stream action. For all the talk about how great the monetization is in products that charge as well as running advertising it looks more and more like companies that pursue this end up with a worse product. We only need to look at EA (Anthem), Bethesda (Fallout 76) and Blizzard (Diablo Immortal) to see products that rely on monetization experts instead of their previous high levels of quality will crash and burn. Instead of making a killing what’s more likely is that you’ll antagonize your users, burn your reputation and torpedo your company by doing this. Not to mention that even if you release a product that sells poorly but is of good quality, you’ll have the satisfaction of having contributed to the global pool of software that people can rely on, an unfinished product that’s built as a sales funnel won’t even do that as your business slides towards bankruptcy and won’t look nearly as good on your resume.
Green Screen with Anti-Cheat software
As of this writing Windows 10’s slow ring (one of the insider branches, but better tested) has an issue with Green Screens (of death) (GSOD) (the renamed Blue Screen of Death) (BSOD) caused by many common types of anti cheating software. This is software that multiplayer games install alongside themselves to add another level of validation that insures that players are not able to cheat during online games. One of the recent changes to the Windows kernel causes conflicts with this anti-cheating software, software that worked previously now causes crashes on more recent versions of Windows. This includes the software used in Fortnite.
There are several reasons that this is worrying. Firstly because it should be difficult for a driver to cause OS failure, there are many poorly programmed drivers and in theory we’ve moved away from the old Windows XP days where the SAMBA networking driver was built into the kernel. This meant that for a while it was possible to reboot a Windows XP system by simply giving it a malformed packet, which would cause a crash in an in-kernel driver and force an OS reboot. In theory, changes since then have fixed these issues so it’s very disappointing to see Windows 10 once again experiencing a crash from a poorly formed driver. It’s even worse when you consider that this anti cheat driver was programmed against the written Windows 10 specifications and worked perfectly until the most recent change.
Secondly it’s worrying because it represents a new change in standards for OS development. One standard that the Linux kernel developers hold very strongly is that the kernel has to maintain backwards compatibility with the software. If a driver or piece of software used to work, and a change in the kernel causes it to fail, that change is a bad change. Linux does have the advantage that many drivers are also part of the kernel directly, so changes are generally made in both simultaneously when they are needed, but in practice there have been changes that cause software to fail in linux and they are always treated very seriously on the development mailing lists. Traditionally this has also been Windows’ approach, and for a long time backwards compatibility with a massive base of tested software for a large variety of purposes has been a huge strength for the Windows environment. This is something that Microsoft has seemed to understand for a long time, as despite the release and availability of Windows S, Windows RT and even the Windows Store (which didn’t used to support Win32), Microsoft has continued to support and develop Win32 (the base API) even as the other side-projects fall by the wayside.
Personally, Windows Core has always been something of a mystifying thing to me. If I wanted an OS that runs only on the command line and supports drivers and runs very reliably and is strongly supported… I’d fire up Ubuntu linux.. or Debian. Instead of being a cut-down version of another OS, Linux is already a command line OS at its core, and it has almost forty years worth of testing on its command line server hosting functionality. So I admit I don’t understand the Windows Core pitch. Windows’ strength is the massive forest of Windows applications that exist now, and the drivers that exist to run those applications. Breaking compatibility with drivers, and then placing the blame onto the suppliers while shrugging your shoulders online and saying that they’re the only ones who can fix it, this does not inspire a great deal of confidence in the system. Especially since they’re effectively saying that any other driver might be subject to this at any point; and unless they’re a massively popular piece of software like Fortnite there’s a really good chance that Microsoft will just decide to roll that version of Windows out to production.
What’s the learning
There’s a few things that we need to take from this. Firstly and most importantly that whatever system you build should not depend on third party people doing their jobs perfectly to insure that your system doesn’t fall over. This applies to a lot of things in life, but especially if your business depends entirely on an uninvolved third party, then you’re in trouble. Ideally an OS should be able to survive a driver failure without blue screening to death and it’s also not a good look when your changes to the kernel driver interface breaks some fairly recent SSD drives, then when further driver changes cause BSODs proceed to blame the driver manufacturers.
I’ve harped on this previously, but the driver thing is something I really struggle to understand. Is there really an expectation that APIs are now constantly changing and all hardware manufacturers are constantly doing a game of catch-up for every single piece of hardware they’ve released. If your computer is not brand new, but still works perfectly and does everything you need, is it now normal to expect that an update can be released in the middle of the night to turn it into a brick and for you to just shrug and hurl it out a windows while saying “I should have backed up”. It doesn’t paint a good picture when the most frightening thing out there isn’t a hacker targeting your machine but the security updates themselves.
Not doing anything on the telemetry and user feedback.
Windows 10 collects loads of data from its users. While Microsoft assures everyone that nothing is personally identifiable and that the data collection is very limited and restrained, some people harbor doubts about this. Which is strange, after all when have massive, trillion dollar, world spanning corporations have ever lied to people to make money? If they were caught lying there would be huge financial penalties that would endanger the company, they wouldn’t be able to get away with just some formulaic apology and no penalties at all. To those of you running the latest sarcasm detectors, I owe an apology as they’ve probably melted. Naturally people are skeptical as to Microsoft’s intentions, especially since it’s not explicitly clear what is being collected and the EULA gives them extreme latitude, however we do know that it includes everything typed into the start menu and… anything written with the pen, your system’s encryption key if it is encrypted and all passwords to wifi network that your system has connected to. Microsoft also takes every possible opportunity to ask for feedback. So why don’t they ever do anything with that feedback?
Nothing is more annoying than typing out a well thought out reply into a feedback box, then having it plunge into oblivion and getting an automated email in response. Or when you give a low rating, so they redirect you to another automated form which they then also ignore, with its own follow up asking if you’re happy “now”. How could you not be, after they’ve pretended to care?
We know that they’re not doing anything with that Feedback because if they were, the fiasco with user files disappearing during the Windows 10 major version upgrades wouldn’t have happened, as it was reported at least three times. How hard would it be to get one or two people to sort through the feedback to see what can be passed on to the engineers and QA people? Given the complaints, news articles and resulting fallout, there has to be a point where it’s easier to actually do QA and read the feedback instead of not caring.
Whats the learning
Of course this is a bit hyperbolic. For one thing we know that Microsoft (and for that matter) most companies that request feedback, do read some of it. However it’s a tiny percentage and only really affects things in aggregate, so for example if an update goes out and deletes data on several hundreds of thousands of computers, then the telemetry helps them to know to pull that update. Although at that point they would already be reading about these issues in “The Register”, so it doesn’t seem like it’s as useful as it could be. However on the flip side its super demoralizing to be pestered for feedback constantly. For my part, I would suggest to developers to make the feedback process silent, as the people running a website or internet connected platform, we already see the server commands being called. So rather than requesting feedback through popups that only interrupt people when they’re doing something you can get similar information by checking which accounts are more active after changes.
So far I’ve highlighted a few problems that I have with the result of the development process, and I’ll go into specific bugs that made it through testing and their effects later. However I think the most important lesson to take away is about the relationship between a software development company and their client base (and between people and the tools they use).
People buy tools (in this case operating systems) to solve problems (run their software). They don’t really care about what’s going on underneath the hood, but they do care if what happens underneath the proverbial hood interferes with their software, even if the developers swear it’s going to be better. It becomes a real problem when those problems arise as a result of developer attempts to monetise what is already an incredibly profitable product.
It’s not that I have some strange grudge against capitalism; but more to do with the fact that when Microsoft changes something like adding advertisements in an update that breaks something in company software or updates the driver framework in a way that breaks some older hardware that works perfectly and still has years left in it; then we as the devops / developers / closest fixing person will then have to field that call to fix it. If you’re running a company that uses this software, you’ve just suffered a financial loss (that they’ve inflicted on you out of the blue without asking or giving anything in exchange) to help improve Microsoft’s bottom line. You can see why the end user might be less than happy about this. Also thanks to updates now being bundled together and coming in with more changes in each update the end user no longer has a choice in what they’re going to get.
Don’t say LTSB. That only works if you’re a large company and willing to pay a premium for every single supported desktop. Also bugs still make it into the LTSB editions of Windows as the OneDrive file erasure bug was scheduled for LTSB release before user complaints resulted in the update being pulled from production.
If you’re someone who’s been happily using Windows 10 and hasn’t noticed anything off about it. That’s great, I’m very happy for you. The problem is that as older versions of Windows are falling off the back of the support train, more and more problems with Microsoft’s development processes are showing up. This is terrifying for a company with decades of software development history that much of the world depends on and far more terrifying when you consider that mandatory updates mean that you and your company are chained to to them completely.
If there’s nothing else to gain from this, then as developers the least we can do is pay close attention to everything that happens so we can learn to avoid those pitfalls in our own projects. Microsoft is a massive company that employs some of the smartest people in the world and even they can miss-allocate resources and fall flat, either in coding and design.
The real lesson is probably more along the lines of “it’s not what you have, but how you use it”, even low budget development projects have made incredibly large contributions to the software industry and massive, country funded projects have fallen flat. We shouldn’t fall into the trap of thinking that just because our teams are filled with smart people that we can’t fail, and neither should we start thinking that just because the team is small or untested, that it can’t succeed. Rather it’s probably best to remember the importance of solid planning and small iterative changes between working revisions is the best way to keep a product on stable footing. Also, don’t break backwards compatibility.