Humans can easily and quickly identify objects. This ability is widely thought to be supported by representations in ventral temporal cortex (VTC). However, prior evidence for this claim has not sufficiently distinguished whether VTC specifically represents objects or simply represents complex visual features regardless of spatial arrangement, i.e. texture. If it is the case that VTC directly supports object perception, one would expect that human performance discriminating objects from textures with scrambled object features would be predicted by the representational geometry of VTC. To test this claim, we leveraged an image synthesis approach that, unlike previous methods, provides independent control over the complexity and the spatial arrangement of visual features. In a conventional categorization task, we indeed find that VTC responses predict human behavior. However, in a perceptual task where subjects discriminated real objects from synthesized textures containing matching features in a scrambled arrangement, VTC representations failed to predict human performance. Whereas human observers were highly sensitive in detecting the real object, VTC representations were sensitive only to the complexity of features but not to their spatial arrangement, and therefore were unable to identify the real object amidst textures with matching features. We find the same insensitivity to feature arrangement and inability to predict human performance in a model of macaque inferotemporal cortex and in Imagenet-trained deep convolutional neural networks. These results suggest that representations in human VTC and state-of-the-art visual recognition models are unable to directly predict perception. How then might texture-like representations in VTC support object perception? We found that a category-specific linear readout of VTC yielded a representation that was more selective for natural feature arrangement, demonstrating that the information necessary to directly support object perception is accessible, though it requires prior experience and additional neural computation. Taken together, our results suggest that the role of human VTC is not to explicitly encode a fixed set of objects but rather to provide a basis set of texture-like features that can be infinitely reconfigured to flexibly learn and identify new object categories.