That's a ridiculous statement. There is no "minimum pixel size". There is a maximum pixel size if you don't want to throw away too much analog resolution, but no minimum pixel size. If your pixels start getting small relative to the size of the diffraction blur (and any aberration blur), then you have diminishing returns, but diminishing returns are not losses or "negative returns"; they are just smaller gains. What is a small gain as far as "acuity wow factor" can sometimes be a huge gain in avoiding aliasing.
What you and your references are chasing is thresholds, thresholds that include pixel-level sharpness as a quality which is pursued as an end in itself, completely independent of scale and context.
So let's say we had a magic sensor where you could turn a dial and change pixel density/size, and for the optics chosen, your reference chose 4 microns as the minimum. As you dial the sensor from 4 microns up to 10 microns, pixel-level sharpness may increase, but it is obvious that the image is becoming more pixelated and is losing something; namely the number of details recorded, and aliasing is getting worse. Dial back down to 4 microns, and you still have a lot of potential contrast between neighbor pixels, but less pixelation and aliasing and more "details" than at 10 microns. Dial down to 2 microns, though, and you lose a bit of pixel-level sharpness or neighbor-pixel contrast, but what you have is superior to what you would have with 4 microns, and has far less potential for aliasing, and survives interpolative editing better (things like CA correction, horizon leveling, perspective correction, any further resampling, etc.).
So, what that recommended minimum pixel size is all about is a minimum to have some arbitrary pixel-level sharpness, which is not actually a quality at all when you are also increasing the number of pixels used as they get smaller. A spatially-analog sensor would have no pixel-level sharpness, because it has no pixels, so how could a pixelated version of it have any better information? The smaller the pixels, the more you approach a virtually-analog capture.
Most of these photo-culture "match the pixel size to the optics" exercises are allowing for unnecessary aliasing and undersampling, and one could argue that the point of "very small returns" if we are trying to avoid aliasing and undersampling, is actually with pixel densities up to 10 times those chosen by people who are fooled by pixel-level sharpness as an end in itself.
Imagine two possible dials or controls; one that makes the lens sharper or softer, and one that varies pixel density. It makes sense to turn the dial to "sharper optics" for the lens when pixel density is fixed, in most cases, and that will increase pixel level sharpness. That is a completely different thing, though, than turning the pixel density dial to "sharper pixels" by decreasing the pixel density. The two should never be conflated.
What is the imaging currency of each pixel? It is undefined, until we also include the number of such pixels used to render a subject. It is a tragicomedy that so many photographers miss the forest (image) for the trees (pixels), and then run to bigger sensor areas to re-assure themselves that bigger pixels are better, when challenged.
Anyone is free to "like" the look of larger pixels, but they are wrong if they claim that bigger pixels can give more detail, AOTBE. What they give is more artifacts from under-sampling a unit of sensor area.
But, all else equal, does not a larger pixel capture more light due to larger area and so have better signal to noise ratio? Better S/N seems to me to produce an image with more detail if there are the same number of pixels. But, usually not all else is equal.
Thinking about it, the Q metric is even more applicable to high-magnification photography than to photography writ large, since in high magnification work, you are likely to be diffraction-limited.