Introduction

Artificial deep neural networks (DNNs) are the most predictive models of neural responses to images in the primate high-level visual cortex1, 2. Many studies have reported that DNNs trained to perform image classification produce internal feature representations broadly similar to those in areas V4 and IT of the primate cortex, and that this similarity tends to be greater in models with better classification performance3. However, it remains opaque what aspects of the representations of these more performant models drive them to better match neural data. Moreover, beyond a certain threshold level of object classification performance, further improvement fails to produce a concomitant improvement in predicting primate neural responses2, 4, 5. This weakening trend motivates finding new normative principles, besides object classification ability, that push models to better match primate visual representations.

One strategy for achieving high object classification performance is to form representations that discard (are invariant to) all information besides object class. However, high-level cortical neurons in the primate ventral visual stream are known to encode other variables in visual input besides object identity, such as object pose69. How do visual representations encode these multiple forms of information simultaneously? Here, we introduce methods to quantify the relationships between different types of visual information in a population code (e.g., object pose vs. camera viewpoint), and in particular the degree to which different forms of information are “factorized.” Intuitively, if the variance driven by one parameter is encoded independently from, or uncorrelated with, the variance driven by other scene parameters, we say this code is factorized. Factorization is notably distinct from building invariance to scene parameters, which renders neural activity insensitive to variation in those parameters. Invariance is a zero-sum strategy: building invariance to some parameters improves the ability to decode others. Factorization, by contrast, can enable simultaneous decoding of many parameters at once, supporting diverse visually guided behaviors (e.g., spatial navigation, object manipulation or object classification)10. We note that our definition of factorization is closely related to the existing concept of manifold disentanglement11, 12 and can be seen as a generalization of disentanglement to high-dimensional visual scene parameters like object pose.

Across a broad library of DNN models that varied in their architecture and training objectives, we found that factorization of scene parameters in DNN feature representations was positively correlated with models’ matches to neural and behavioral data. While invariance to some scene parameters (background scene and lighting conditions) predicted neural fits, invariance to others (object pose and camera viewpoint) did not. Our results generalized across both monkey and human datasets using different measures (neural spiking, fMRI, and behavior; 12 datasets total) and could not be accounted for by models’ classification performance. Thus, we suggest that factorized encoding of multiple behaviorally-relevant scene variables is an important consideration in building more brainlike models that capture scene understanding as performed by biological vision.

Results

Disentangling object identity manifolds in population responses can be achieved by qualitatively different strategies, including: building invariance of responses to non-identity scene parameters and/or factorizing non-identity-driven response variance into isolated (factorized) subspaces (Figure 1A, left vs. center panels, cylindrical/spherical shaded regions represent object manifolds). Both strategies maintain an “identity subspace” in which object manifolds are linearly separable. In non-invariant and non-factorized representations, other variables like camera viewpoint also drive variance within the identity subspace, “entangling” the representations of the two variables (Figure 1A, right; viewpoint driven variance is mainly in identity subspace, orange flat shaded region). To formalize these different representational strategies, we introduced measures of factorization and invariance to scene parameters in neural population responses (Figure 1B; see Equations 1 and 2 in Methods). Concretely, invariance to a scene variable (e.g. object motion) is computed by measuring the degree to which varying that parameter alone changes neural responses, relative to the changes induced by varying other parameters (lower relative influence on neural activity corresponds to higher invariance). Factorization is computed by identifying the axes in neural population activity space that are influenced by varying the parameter of interest and assessing how much it overlaps the axes influenced by other parameters (“a” in Figure 1B,C)(lower overlap corresponds to higher factorization). For instance, a neural population in which one neural subpopulation encodes object identity and another separate subpopulation encodes object position exhibits a high degree of factorization of those two parameters (however, note that factorization may also be achieved by neural populations with mixed selectivity in which the “subpopulations” correspond to subspaces, or independent orthogonal linear projections, of neural activity space rather than physical subpopulations). Factorization, unlike invariance, has the potential to enable the simultaneous representation of multiple scene parameters in a decodable fashion. Intuitively, factorization increases with higher dimensionality as this decreases overlap, all other things being equal (in the limit, the angle between points will approach 90° or a fully orthogonal code in high dimensions), and for a given finite, fixed dimension, factorization is mainly driven by the angle between this dimension and other variable subspaces which measures the degree of contamination (Figure 1C; square vs. parallelogram). In a simulation, the extent to which the variables of interest were represented in a factorized way (i.e. along orthogonal axes, rather than correlated axes) influenced the ability of a linear discriminator to successfully decode both variables in a generalizable fashion from a few training samples (Figure 1C).

Framework for quantifying factorization in neural and model representations

(A) A subspace for encoding object identity in a linearly separable manner can be achieved by becoming invariant to non-class variables (compact spheres, middle column; colored dots represent example images within each class) or by encoding variance induced by non-identity variables in orthogonal neural axes to the identity subspace (extended cylinders, left column), but only the factorization strategy simultaneously represents multiple variables in a disentangled fashion. A code that is sensitive to non-identity parameters within the identity subspace corrupts the ability to decode identity (right column) (identity subspace in orange). (B) Variance across images within a class can be measured in two different linear subspaces: that containing the majority of variance for all other parameters (a, “other_param_subspace”) and that containing the majority of the variance for that parameter (b, “param_subspace”). Factorization is defined as the fraction of parameter-induced variance that avoids the other-parameter subspace (left). By contrast, invariance to the parameter of interest is computed by comparing the overall parameter-induced variance to the variance in response to other parameters (c, “var_other_param”) (right). (C) In a simulation of coding strategies for two binary variables out of 10 total dimensions that are varying (see Methods), a decrease in orthogonality of the relationship between the encoding of the two variables, or a>0 (going from a square to a parallelogram geometry; only 3 of 10 dimensions used in simulation are shown), despite maintaining linear separability of variables, results in poor classifier performance in the few training-samples regime.

Given the theoretically desirable properties of factorized representations, we next asked whether such representations are observed in neural data, and how much factorization contributes empirically to downstream decoding performance in real data. Specifically, we took advantage of an existing dataset in which the tested images independently varied object identity versus object pose plus background context13. We found that both V4 and IT responses exhibited more significant factorization of object identity information from non-identity information than a shuffle control (which accounts for effects on factorization due to changes in dimensionality between regions) (Figure S1; see Methods). Furthermore, the degree of factorization increased from V4 to IT (Figure 2A). Consistent with prior studies, we also found that invariance to non-identity information increased from V4 to IT in our analysis (Figure 2A, right, solid lines)14. Invariance to non-identity information was even more pronounced when measured in the subspace of population activity capturing the bulk (90%) of identity-driven variance, as a consequence of increased factorization of identity from non-identity information (Figure 2A, right, dashed lines). To illustrate the beneficial effect of factorization on decoding performance, we performed a statistical lesion experiment that precisely targeted this aspect of representational geometry. Specifically, we analyzed a transformed neural representation obtained by rotating the population data so that inter-class variance more strongly overlapped with the principal components of the within-class variance in the data (see Methods). Note that this transformation, designed to decrease factorization, acts on the angle between latent variable subspaces. The applied linear basis rotation leaves all other activity statistics completely intact (such as mean neural firing rates, covariance structure of the population, and its invariance to non-class variables) yet has the effect of strongly reducing object identity decoding performance in both V4 and IT (Figure 2B). Our analysis shows that maintaining invariance alone in the neural population code was insufficient to account for a large fraction of decoding performance in the cortex; factorization of non-identity variables is key to the decoding performance achieved by V4 and IT representations.

Benefit of factorization to neural decoding in macaque V4 and IT

(A) Factorization of object identity and position increased from macaque V4 to IT (dataset E1 – multiunit activity in macaque visual cortex)(left). Like factorization, invariance also increased from V4 to IT (note, “identity” refers to invariance to all non-identity position factors, solid black line)(right). Combined with increased factorization of the remaining variance, this led to higher invariance within the variable’s subspace (orange lines), representing a neural subspace for identity information with invariance to nuisance parameters which decoders can target for read-out. (B) Applying a transformation to the data that rotated the relative positions of mean responses to object classes (see Methods), designed to preserve relevant activity statistics (including invariance to non-class factors) while decreasing factorization of class information from non-class factors, has the effect of reducing object class decoding performance (light vs. dark red bars, chance = 1/64; n=128 multi-unit sites in V4 and 128 in IT) .

We next asked whether factorization is found in deep neural network (DNN) model representations and whether this novel, heretofore unconsidered metric is a strong indicator of more brainlike models. When working with computational models, we have the liberty to test an arbitrary number of stimuli; therefore, we could independently vary multiple scene parameters at sufficient scale to enable computing factorization and invariance for each, and we explored factorization in DNN model representations in more depth than previously measured in existing neural experiments. To gain insight back into neural representations, we also assessed the ability of each model to predict separately collected neural and behavioral data. In this fashion, we may indirectly assess the relative significance of geometric properties like factorization and invariance to biological visual representations – if, for instance, models with more factorized representations consistently match neural data more closely, we may infer that those neural representations likely exhibit factorization themselves (Figure 3A). To measure factorization, invariance, and decoding properties of DNN models, we generated an augmented image set, based on the images used in the previous dataset (Figure 2A), in which we independently varied the foreground object identity, foreground object pose, background identity, scene lighting, and 2D scene viewpoint. Specifically for each base image from the original dataset, we generated sets of images that varied exactly one of the above scene parameters while keeping the others constant, allowing us to measure the variance induced by each parameter relative to the variance across all scene parameters (Figure 3B; 100 base scenes and 10 transformed images for each source of variation). We presented this large image dataset to models (4000 images total) to assess the relative degree of representational factorization of and invariance to each scene parameter. We conducted this analysis across a broad range of DNNs varying in architecture and objective as well as other implementational choices to obtain the widest possible range of DNN representations for testing our hypothesis. These included models using supervised training for object classification15, 16, contrastive self-supervised training17, 18, and self-supervised models trained using auxiliary objective functions1922 (see Methods and Table S2).

Measurement of factorization in DNN models and relationship to neural predictivity

(A) Schematic showing how metaanalysis on models and brain data was conducted by first computing various representational metrics on models and then measuring a model’s predictive power across a variety of datasets. The combination of model-layer metric and model-layer dataset predictivity for a choice of model, layer, metric, and dataset specifies the coordinates of a single dot on the scatter plots in (C), and the across model correlation coefficient between a particular representational metric and neural predictivity for a dataset summarizes the potential importance of the metric in producing more brainlike models (see Figure 4). (B) For computing the representational metrics of factorization of and invariance to a scene parameter, variance in model responses was induced by individually varying each of four scene parameters (n=10 parameter levels) for each base scene (n=100 base scenes). (C) Scatter plots for example neural dataset (IT single-units, macaque E2 dataset) showing the correlation between a model’s predictive power as an encoding model for IT neural data versus a model’s ability to factorize or become invariant to different scene parameters (each dot is a different model, using each model’s penultimate layer). Note that factorization in trained models is consistently higher than that for an untrained, randomly initialized Resnet-50 DNN architecture (rightward shift relative to yellow vertical dashed line). Invariance to background and lighting but not to object pose and viewpoint increased in trained models relative to the untrained control (rightward versus leftward shift relative to yellow vertical dashed line). (D) Same as (C) except for human behavior performance patterns across images (human I2 dataset).

First, we asked whether, in the course of training, DNN models develop factorized representations at all. We found that the final layers of trained networks exhibited consistent increases in factorization of all tested scene parameters relative to a randomly initialized (untrained) baseline with the same architecture (Figure 3C, top row, rightward shift relative to black cross, a randomly initialized ResNet-50). By contrast, training DNNs produced mixed effects on invariance, typically increasing it for background and lighting but reducing it for object pose and camera viewpoint (Figure 3C, bottom row, leftward shift relative to black cross for left two panels). Moreover, we found that the degree of factorization in models correlated with the degree to which they predicted neural activity for single-unit IT data (Figure 3C, top row), which can be seen as correlative evidence that neural representations in IT exhibit factorization of all scene variables tested. Interestingly, we saw a different pattern for representational invariance to a scene parameter. Invariance showed mixed correlations with neural predictivity (Figure 3C, bottom row), suggesting that IT neural representations build invariance to some scene information (background and lighting) but not to others (object pose and observer viewpoint). Similar effects were observed when we assessed correlations between these metrics and fits to human behavioral data (rather than macaque neural data) (Figure 3D).

To assess the robustness of these findings to choice of images and brain regions used in an experiment, we conducted the same analyses across a large and diverse set of previously collected neural and behavioral datasets, from different primate species and visual regions (6 macaque datasets13, 23, 24: two V4, two IT, and two behavior; 6 human datasets2426: two V4, two HVC, and two behavior; Table S1). Consistently, increased factorization of scene parameters in model representations correlated with models being more predictive of neural spiking responses, voxel BOLD signal, and behavioral responses to images (Figure 4A, black bars; see Figure S2 for scatter plots across all datasets). Although invariance to appearance factors (background identity and scene lighting) correlated with more brainlike models, invariance for spatial transforms (object pose and camera viewpoint) consistently did not (zero or negative correlation values; Figure 4C, red and green circles). Our results were preserved when we re-ran the analyses using only the subset of models with the identical ResNet-50 architecture (Figure S3) or when we evaluated model predictivity using representational dissimilarity matrices of the population (RDM) instead of linear regression fits of individual neurons or voxels (Figure S4). Furthermore, the main finding of a positive correlation between factorization and neural predictivity was robust to the particular choice of PCA threshold we used to quantify factorization (Figure S5). Finally, we tested whether our results generalized across the particular image set used for computing the model factorization scores in the first place. Here, instead of relying on our synthetically generated images, where each scene parameter was directly controlled, we re-computed factorization from two types of relatively unconstrained natural movies, one where the observer moves in an urban environment (approximates camera viewpoint changes)27 and another where objects move in front of a fairly stationary observer (approximates object pose changes)28. Similar to the result found for factorization measured using augmentations of synthetic images, factorization of frame-by-frame variance (local in time, presumably dominated by either observer or camera motion; see Methods) from other sources of variance across natural movies (non-local in time) was correlated with improved neural predictivity in both macaque and human data while invariance to local frame-by-frame differences was not (Figure 4B; black versus gray bars). Thus, we have shown that a main finding – the importance of object pose and camera viewpoint factorization for achieving brainlike representations – holds across types of brain signal (spiking vs. BOLD), species (monkey vs. human), cortical brain areas (V4 vs. IT), images used in experiments (synthetic, grayscale vs. natural, color), and image sets for computing the metric (synthetic images vs. natural movies).

Scene parameter factorization correlates with more brainlike DNN models

(A) Factorization of scene parameters in model representations consistently correlated with a model being more brainlike across multiple independent datasets measuring monkey neurons, human fMRI voxels, or behavioral performance in both macaques and humans (left vs. right column)(gray bars). By contrast, increased invariance to camera viewpoint or object pose was not indicative of brainlike models (black bars). In all cases, model representational metric and neural predictivity score were computed by averaging scores across the last 5 model layers. (B) Recomputing camera viewpoint or object pose factorization from natural movie datasets that primarily contained camera or object motion, respectively (right: example movie frames; also see Methods), gave similar results for predicting which model representations would be more brainlike as computing these factorization scores using our synthetic images. (C) Summary of the results from (A) across datasets (x-axis) for invariance (open symbols) versus factorization (closed symbols).

Our analysis of DNN models provides strong evidence that greater factorization of a variety of scene variables is consistently associated with a stronger match to neural and behavioral data. Prior work has identified similar correlation between object classification performance (measured fitting a decoder for object class using model representations) and fidelity to neural data3. A priori, it is possible that the correlations we have demonstrated between scene parameter factorization and neural fit can be entirely captured by the known correlation between classification performance and neural fits2, 3, as factorization and classification may themselves be correlated. However, we found that factorization scores significantly boosted cross-validated predictive power of neural/behavioral fit performance compared to simply using object classification alone (Figure 5A). Thus, considering factorization in addition to object classification performance improves upon our prior understanding of the properties of more brainlike models (Figure 5B).

Scene parameter factorization correlates with more brainlike DNN models

(A) Average brain predictivity across datasets of classification (black solid bar) and factorization (colored solid bars) in a model representation. Adding factorization to classification in a regression model produced significant improvements in predicting the most brainlike models, cross-validated (across models) performance averaged across datasets (unfaded bars exceed dashed line for classification alone as a metric). (B) Example scatter plots for neural and fMRI datasets (IT multi-units & single-units, macaque E1 & E2; fMRI with grayscale & color images, human F1 & F2) showing a reversing trend in neural (voxel) predictivity for models that are increasingly good at classification (left column). This saturating/reversing trend is no longer present when adding object pose factorization to classification as a combined, predictive metric for brainlikeness of a model (right column).

Discussion

Object classification, which has been proposed as a normative principle for the function of the ventral visual stream, can be supported by qualitatively different representational geometries3, 29. These include representations that are completely invariant to non-class information11, 12 and representations that retain a high-dimensional but factorized encoding of non-class information, which disentangles the representation of multiple variables (Figure 1A, left). Here, we presented evidence that factorization of non-class information is an important strategy used, alongside invariance, by the high-level visual cortex (Figure 2) and by DNNs that are predictive of primate neural and behavioral data (Figures 3,4). Prior work has indicated that building representations that support object classification performance, and representations that preserve high-dimensional information about natural images, are both important principles of the primate visual system1, 30 (though see Conwell et al.31 who find that effective dimensionality alone may not be a robust predictor in high-level visual cortex). Our work unifies these two perspectives, as factorization of multiple scene variables enables high object identity decoding performance while preserving high-dimensional information for various other properties of the visual scene. Disentangled, factorized variable manifolds can enable task performance in the low-training sample regime (Figure 1C), which was the focus of recent work on few-shot object classification in the visual system32. Our work is also complementary to work on representational straightening of natural movie trajectories in the population space33. This work suggests that visual representations maintain a locally linear code of latent variables like viewpoint, while our work focuses on the global arrangement of the linear subspaces affected by different variables. The idea of local straightening of natural movies was found to be predictive for early visual cortex neural responses but not necessarily for high-level visual cortex34, where the present work suggests factorization may play a role.

Going forward, we expect factorization could prove to be a useful objective function for optimizing neural network models that better resemble primate visual systems. Complementing prior theoretical work on representational geometry for manifolds of a single variable, namely object class12, we show that it is the orthogonal relationships between population subspaces encoding different variables that is a consistent principle of brain-like visual representations35. An important limitation of our work is that we do not specify the details of how a particular scene parameter is encoded within its factorized subspace. Neural codes could adopt different strategies resulting in similar factorization scores at the population level, each with some support in visual cortex literature: (1) Each neuron encodes a single latent variable36, 37, (2) Separate brain subregions encode qualitatively different latent variables but using distributed representations within each region3840, (3) Each neuron encodes multiple variables in a distributed population code, such that the factorization of different variables is only apparent as independent directions when assessed in high-dimensional population activity space36, 41. Future work can disambiguate among these possibilities by systematically examining ventral visual stream subregions8, 40, 42 and single neuron tuning curves within them43, 44.

Methods

Monkey Datasets

Macaque monkey datasets were of single-unit neural recordings23, multi-unit neural recordings13, and object recognition behavior24. Single-unit spiking responses to natural images were measured in V4 and anterior ventral IT23. The advantages of this dataset is that it contains well-isolated single neurons, the gold standard for electrophysiology. Furthermore, the IT recordings were obtained from penetrating electrodes targeting the anterior ventral portion of IT near the base of skull, reflecting the highest level of the IT hierarchy. On the other hand, the multi-unit dataset was obtained from across IT with a bias toward where multi-unit arrays are more easily placed such as CIT and PIT13, complementing the recording locations of the single-unit dataset. An advantage of the multi-unit dataset using chronic recording arrays is that an order of magnitude more images were tested per recording site (see dataset comparisons in Supplementary Table S1). Finally, the monkey behavioral dataset came from a third study examining the image-by-image object classification performance of macaques and humans24.

Human Datasets

Three datasets from humans were used, two fMRI datasets and one object recognition behavior dataset4, 24, 25. The fMRI datasets used different images (color versus grayscale) but otherwise used fairly similar number of images and voxel resolution in imaging. Human fMRI studies have found that different DNN layers tend to map to V4 and HVC human fMRI voxels4. The human behavioral dataset measured image-by-image classification performance and was collected in the same study as the monkey behavioral signatures24.

Computational Models

In recent years, a variety of approaches to training DNN vision models have been developed that learn representations that can be used for downstream classification (and other) tasks. Models differ in a variety of implementational choices including in their architecture, objective function, and training dataset. In the models we sampled, objectives included supervised learning of object classification (AlexNet, ResNet), self-supervised contrastive learning (MoCo, SimCLR), and other unsupervised learning algorithms based on auxiliary tasks (e.g. reconstruction, or colorization). A majority of the models that we considered relied on the widely used, performant ResNet-50 architecture, though some in our library utilized different architectures. The randomly initialized network control utilized ResNet-50 (see Figure 3C, D). The set of models we used is listed in Table S2.

Simulation of Factorized versus Non-factorized Representational Geometries

For the simulation in Figure 1C, we generated data in the following way. First we randomly sampled the values of N=10 binary features. Feature values corresponded to positions in an N-dimensional vector space as follows: each feature was assigned an axis in N-dimensional space, and the value of each feature (+1 or −1) was treated as a coefficient indicating the position along that axis. All but two of the feature axes were orthogonal to the rest. The last two features, which served as targets for the trained linear decoders, were assigned axes whose alignment ranged from 0 (orthogonal) to 1 (identical). In the noiseless case, factorization of these two variables with respect to one another is given by subtracting the square of the cosine of the angle between the axes from 1. We added Gaussian noise to the positions of each data point and randomly sampled K positive and negative examples for each variable of interest to use as training data for the linear classifier (a support vector machine).

Macaque neural data analyses

For the shuffle control used as a null model for factorization, we shuffled the object identity labels of the images (Figure S1). For the transformation used in Figure 2B, we computed the principal components of the mean neural activity response to each object class (“class centers”), referred to as the inter-class PCs. We also computed the principal components of the data with corresponding class centers subtracted from each activity pattern, referred to as the intra-class PCs. We transformed the data by applying to the class centers a change of basis matrix that rotated each inter-class PC into the corresponding intra-class PC. That is, the class centers were transformed by this matrix, but the relative positions of activity patterns for a given class were fixed. This transformation has the effect of preserving intra-class variance statistics exactly from the original data, and preserving everything about the statistics of inter-class variance except its orientation relative to intra-class variance. That is, the transformation is designed to affect (specifically decrease) factorization while controlling for all other statistics of the activity data that may be relevant to object classification performance (considering the simulation in Figure 1C of two binary variables, this basis change of the neural data in Figure 2B is equivalent to turning a square into the maximally flat parallelogram, the degenerate one where all the points are collinear) .

Scene Parameter Variation

Our generated scenes consisted of foreground objects imposed upon natural backgrounds. To measure variance associated with a particular parameter like the background identity, we randomly sampled ten different backgrounds while holding the other variables (e.g., foreground object identity and pose constant). To measure variance associated with foreground object pose, we randomly varied object angle from [-90, 90] along all three axes independently, object position on the two in-plane axes, horizontal [-30%, 30%] and vertical [-60%, 60%], and object size [×1/1.6, ×1.6]. To measure variance associated with camera position, we took crops of the image with scale uniformly varying from 20% to 100% of the image size, and position uniformly distributed across the image. To measure variance associated with lighting conditions we applied random jitters to the brightness, contrast, saturation, and hue of an image, with jitter value bounds of [-0.4, 0.4] for brightness, contrast, and saturation and [-0.1, 0.1] for hue. These parameter choices follow standard data augmentation practices for self-supervised neural network training, as used, for example, in the SimCLR and MoCo models tested here17, 18.

Factorization and invariance metrics

Factorization and invariance were measured according to the following equations:

Variance induced by a parameter (varparam) is computed by measuring the variance (summed across all dimensions of neural activity space) of neural responses to the 10 augmented versions of a base images where the augmentations are those obtained by varying the parameter of interest. This quantity is then averaged across the 100 base images. The variance induced by all parameters is simply the sum of the variances across all images and augmentations. To define the “other-parameter subspace,” we averaged neural responses for a given base images over all augmentations using the parameter of interest, and ran PCA on the resulting set of averaged responses. The subspace was defined as the space spanned by top PCA components containing 90% of the variance of these responses. Intuitively, this space captures the bulk of the variance driven by all parameters other than the parameter of interest (due to the averaging step). The variance of the parameter of interest within this “other-parameter subspace,” varparam|other_param_subspace, subspace, was computed the same way as varparam, but using the projections of neural activity responses onto the other-parameter subspace.

Natural movie factorization metrics

For natural movies, variance is not induced by explicit control of a parameter as in our synthetic scenes but implicitly, by considering contiguous frames (separated by 200ms in real time) as reflective of changes in one of two motion parameters (object versus observer motion) depending on how stationary the observer is (MIT Moments in Time movie set: stationary observer; UT-Austin Egocentric movie set: nonstationary)27, 28. Here, the all parameters condition is simply the variance across all movie frames which in the case of MIT Moments in Time dataset includes variance across thousands of video clips taken in many different settings and in the case of the UT-Austin Egocentric movie dataset includes variance across only 4 movies but over long durations of time during which an observer translates extensively in an an environment (3-5 hours). Thus, movie clips in the MIT Moments in Time movie set contained new scenes with different object identities, backgrounds, and lightings and thus effectively captured variance induced by these non-spatial parameters28. In the UT Austin Egocentric movie set, new objects are encountered as the subject navigates around the urban landscape27.

Model Neural Encoding Fits

Linear mappings between model features and neuron (or voxel) responses were computed using ridge regression (with regularization coefficient selected by cross validation) on a low-dimensional linear projection of model features (top 300 PCA components computed using images in each dataset). We also tested an alternative approach to measuring representational similarity between models and experimental data based on representational similarity analysis (RSA)45, computing dot product similarities of the representations of all pairs of images and measuring the Spearman correlation coefficient between these pairwise similarity matrices obtained from a given model and neural dataset, respectively.

Model Behavioral Signatures

We followed the approach of Rajalingham, Issa et al.24 We took human and macaque behavioral data from the object classification task and used it to create signatures of image-level difficulty (the “I1” vector) and image-by-distractor-object confusion rates (the “I2” matrix). We did the same for the DNN models, extracting model “behavior” by training logistic regression classifiers to classify object identity in the same image dataset used in the experiments of Rajalingham, Issa et al.24, using model layer activations as inputs. Model behavioral accuracy rates on image by distractor object pairs were assessed using the classification probabilities output by the logistic regression model, and these were used to compute I1 and I2 metrics as was done for the true behavioral data. Behavioral similarity between models and data was assessed by measuring the correlation between the entries of the I1 and I2 vectors and matrices, respectively (both I1 and I2 results are reported).

Model Layer Choices

The scatter plots in Figure 3C,D and Figure S2 use metrics (factorization, invariance, and goodness of neural fit) taken from the final representational layer of the network (the layer prior to the logits layer used for classification in supervised network, prior to the embedding head in contrastive learning models, or prior to any auxiliary task-specific layers in unsupervised models trained using auxiliary tasks). However, representational geometries of model activations, and their match to neural activity and behavior, vary across layers. This variability arises because different model layers correspond to different stages of processing in the model (convolutional layers in some cases, and pooling operations in others), and even may have different dimensionalities. To ensure that our results do not depend on idiosyncrasies of representations in one particular model layer and the particular network operations that precede it, summary correlation statistics in all other figures (Figure 3 and Figures S3-S5) show the results of the analysis in question averaged over the five final representational layers of the model. That is, the metrics of interest (factorization, invariance, neural fits, behavioral similarity scores) were computed independently for each of the five final representational layers of each model, and these five values were averaged prior to computing correlations between different metrics.

Correlation of Model Predictions and Experimental Data

A Spearman’s linear correlation coefficient was calculated for each model layer x biological dataset combination (6 monkey datasets and 6 human datasets). Here, we do not correct for noise in the biological data when computing the correlation coefficient, as this would require trial repeats (for computing intertrial variability) that were limited or not available in the fMRI data used. In any event, normalizing by the data noise ceiling applies a uniform scaling to all model prediction scores and does not affect model comparison, which only depends on ranking models as being relatively better or worse in predicting brain data. Finally, we estimate the effectiveness of model factorization or invariance in combination with model object classification performance for predicting model neural and behavioral fit by performing a linear regression on the particular dual metric combination (e.g., classification plus object pose factorization) and reporting the Spearman correlation coefficient of the linearly weighted metric combination. The correlation was assessed on held-out models (80% used for training, 20% for testing) and the results were averaged over 100 randomly sampled train/test splits.

Acknowledgements

This work was performed on the Columbia Zuckerman Institute Axon GPU cluster and via generous access to Cloud TPUs from Google’s TPU Research Cloud (TRC). JWL was supported by the DOE CSGF (DE–SC0020347). EBI was supported by a Klingenstein-Simons fellowship, Sloan Foundation fellowship, and Grossman-Kavli Scholar Award. We thank Erica Shook for comments on a previous version of the manuscript. The authors declare no competing interests.

Author contributions

JWL and EBI designed the research. JWL performed the computational modeling and data analysis. JWL and EBI wrote the manuscript. EBI supervised the research.

Supplement

Factorization and invariance in V4 & IT neural data

Normalized factorization and invariance as in Figure 2A but after subtracting shuffle control for V4 and IT neural dataset. Shuffling the image identities of each population vector accounts for increases in factorization driven purely by changes in the covariance statistics of population responses between V4 and IT. However, normalized factorization scores remain significantly above zero for both brain areas.

Scatter plots for all datasets

Scatter plots as in Figure 3C,D for all datasets. Brain metrics (y-axes) by panel are: (A) macaque neuron/human voxel fits in V4 cortex, (B) macaque neuron/human voxel fits in ITC/HVC, and (C) macaque/human per-image classification performance (I1) and image-by-distractor class performance (I2). In all panels, the plots in the top half use DNN factorization scores on the x-axis while the bottom half use DNN invariance scores.

Predictivity of factorization and invariance restricting to ResNet-50 model architectures

Same format as Figure 4C except with the analyses restricted to using only models with the Resnet-50 architecture. The main finding of factorization of scene parameters in DNNs being generally positively correlated with better predictions of brain data is replicated using this architecture-matched subset of models, controlling for potential confounds from model architecture.

Predictivity of factorization and invariance for RDMs

Same format as Figure 4C except for predicting population representational dissimilarity matrices (RDMs) of macaque neurophysiological and human fMRI data (in the main analyses linear encoding fits of each single neuron/voxel were used to measure brain predictivity of a model). The main finding of factorization of scene parameters in DNNs being positively correlated with better predictions of brain data is replicated using RDMs instead of neural/voxel goodness of fit.

Effect on neural and behavioral predictivity of PCA threshold for computing factorization, Related to Figure 4

The % variance threshold used in the main text for estimating a PCA linear subspace capturing the bulk of the variance induced by all other parameters besides the parameter of interest is somewhat arbitrary. Here we show that results of our main analysis change little if we vary this parameter from 50-99%. In the main text, a PCA threshold of 90% was used for computing factorization scores.

Datasets used for measuring similarity of models to the brain

Datasets from both macaque and human high-level visual cortex as well as high-level visual behavior were collated for testing the brainlikeness of computational models. For neural and fMRI datasets, the features in the model were used to predict the image-by-image response pattern of each neuron or voxel. For behavior datasets, the performance of linear decoders built atop model representations were compared to performance per image of macaques and humans.

Models tested

For each model, we measured representational factorization and invariance in each of the final five representational layers of the model as well as evaluating their brainlikeness using the datasets in Table S1.