What's to stop someone from just scraping the web of audio files and putting them into a db to execute when said color, object, whatever is asked for? I'm sure you could find many audio clips of someone saying the color green and just use a random one each time.
Also I would be curious to know what percentage of the people with internet access have a microphone.
It could change to different topics, like "a cow goes..." there are lots of different variations to protect against that. It would also take a lot of resources to compile a bot attack against that.
The access to a mic is a big concern that I didn't take into account....
What if it were reversed, and speech gave instructions like "click in the third box in the second column" and there was a grid of boxes for the user to click?