But basically, one unspent output can only be spent in one confirmed transaction.
Once a transaction spending a certain unspent output is confirmed, all unconfirmed transactions spending the same unspent output become invalid.
Correct if we see it from the perspective of someone looking at the blockchain at a later moment where the transaction has already a lot of confirmations.
However, looking at it in "real time", one of both transactions could land in a stale block (popularly known as "orphan"). If a service which accepts 1-conf transactions is that unlucky to be better connected to the miner who mined the stale block than to the one who mines the later "canonical" block, then a double spend attack could succeed.
Thus in theory it would be interesting if someone conducted the same experiment the OP made, but on a continuous basis (e.g. during 24 hours and trying to get included 2 transactions per block) and sending both txes at the same time from different locations (preferrably also geographically on distinct locations, e.g. one on a server in the US and the other in Asia) and see if the outcome matches the "theoretical probability" of this kind of attack to succeed. Of course one could also use Testnet but the outcome wouldn't be that interesting as the network structure of testnet should be drastically different.