Drawing on recent findings in the cognitive and neurosciences, this article discusses how people manage to predict each other’s actions, which is fundamental for joint action. We explore how a common coding of perceived and performed actions may allow actors to predict the what, when, and where of others’ actions. The “what” aspect refers to predictions about the kind of action the other will perform and to the intention that drives the action. The “when” aspect is critical for all joint actions requiring close temporal coordination. The “where” aspect is important for the online coordination of actions because actors need to effectively distribute a common space. We argue that although common coding of perceived and performed actions alone is not sufficient to enable one to engage in joint action, it provides a representational platform for integrating the actions of self and other. The final part of the paper considers links between lower‐level processes like action simulation and higher‐level processes like verbal communication and mental state attribution that have previously been at the focus of joint action research.