That's a ridiculous statement. There is no "minimum pixel size". There is a maximum pixel size if you don't want to throw away too much analog resolution, but no minimum pixel size. If your pixels start getting small relative to the size of the diffraction blur (and any aberration blur), then you have diminishing returns, but diminishing returns are not losses or "negative returns"; they are just smaller gains. What is a small gain as far as "acuity wow factor" can sometimes be a huge gain in avoiding aliasing.
What you and your references are chasing is thresholds, thresholds that include pixel-level sharpness as a quality which is pursued as an end in itself, completely independent of scale and context.
So let's say we had a magic sensor where you could turn a dial and change pixel density/size, and for the optics chosen, your reference chose 4 microns as the minimum. As you dial the sensor from 4 microns up to 10 microns, pixel-level sharpness may increase, but it is obvious that the image is becoming more pixelated and is losing something; namely the number of details recorded, and aliasing is getting worse. Dial back down to 4 microns, and you still have a lot of potential contrast between neighbor pixels, but less pixelation and aliasing and more "details" than at 10 microns. Dial down to 2 microns, though, and you lose a bit of pixel-level sharpness or neighbor-pixel contrast, but what you have is superior to what you would have with 4 microns, and has far less potential for aliasing, and survives interpolative editing better (things like CA correction, horizon leveling, perspective correction, any further resampling, etc.).
So, what that recommended minimum pixel size is all about is a minimum to have some arbitrary pixel-level sharpness, which is not actually a quality at all when you are also increasing the number of pixels used as they get smaller. A spatially-analog sensor would have no pixel-level sharpness, because it has no pixels, so how could a pixelated version of it have any better information? The smaller the pixels, the more you approach a virtually-analog capture.
Most of these photo-culture "match the pixel size to the optics" exercises are allowing for unnecessary aliasing and undersampling, and one could argue that the point of "very small returns" if we are trying to avoid aliasing and undersampling, is actually with pixel densities up to 10 times those chosen by people who are fooled by pixel-level sharpness as an end in itself.
Imagine two possible dials or controls; one that makes the lens sharper or softer, and one that varies pixel density. It makes sense to turn the dial to "sharper optics" for the lens when pixel density is fixed, in most cases, and that will increase pixel level sharpness. That is a completely different thing, though, than turning the pixel density dial to "sharper pixels" by decreasing the pixel density. The two should never be conflated.