The journey of Modernizing TorchVision

It’s been a while since I last posted a new entry on the TorchVision memoirs series. Thought, I’ve previously shared news on the official PyTorch blog and on Twitter, I thought it would be a good idea to talk more about what happened on the last release of TorchVision (v0.12), what’s coming out on the next one (v0.13) and what are our plans for 2022H2. My target is to go beyond providing an overview of new features and rather provide insights on where we want to take the project in the following months.
TorchVision v0.12 was a sizable release with dual focus: a) update our deprecation and model contribution policies to improve transparency and attract more community contributors and b) double down on our modernization efforts by adding popular new model architectures, datasets and ML techniques.
Key for a successful open-source project is maintaining a healthy, active community that contributes to it and drives it forwards. Thus an important goal for our team is to increase the number of community contributions, with the long term vision of enabling the community to contribute big features (new models, ML techniques, etc) on top of the usual incremental improvements (bug/doc fixes, small features etc).
Historically, even though the community was eager to contribute such features, our team hesitated to accept them. Key blocker was the lack of a concrete model contribution and deprecation policy. To address this, Joao Gomes worked with the community to draft and publish our first model contribution guidelines which provides clarity over the process of contributing new architectures, pre-trained weights and features that require model training. Moreover, Nicolas Hug worked with PyTorch core developers to formulate and adopt a concrete deprecation policy.
The aforementioned changes had immediate positive effects on the project. The new contribution policy helped us receive numerous community contributions for large features (more details below) and the clear deprecation policy enabled us to clean up our code-base while still ensuring that TorchVision offers strong Backwards Compatibility guarantees. Our team is very motivated to continue working with the open-source developers, research teams and downstream library creators to maintain TorchVision relevant and fresh. If you have any feedback, comment or a feature request please reach out to us.
It’s no secret that for the last few releases our target was to add to TorchVision all the necessary Augmentations, Losses, Layers, Training utilities and novel architectures so that our users can easily reproduce SOTA results using PyTorch. TorchVision v0.12 continued down that route:
Our rockstar community contributors, Hu Ye and Zhiqiang Wang, have contributed the FCOS architecture which is a one-stage object detection model.
Nicolas Hug has added support of optical flow in TorchVision by adding the RAFT architecture.
Yiwen Song has added support for Vision Transformer (ViT) and I have added the ConvNeXt architecture along with improved pre-trained weights.
Finally with the help of our community, we’ve added 14 new classification and 5 new optical flow datasets.
As per usual, the release came with numerous smaller enhancements, bug fixes and documentation improvements. To see all of the new features and the list of our contributors please check the v0.12 release notes.
TorchVision v0.13 is just around the corner, with its expected release in early June. It is a very big release with a significant number of new features and big API improvements.
We are continuing our journey of modernizing the library by adding the necessary primitives, model architectures and recipe utilities to produce SOTA results for key Computer Vision tasks:
With the help of Victor Fomin, I have added important missing Data Augmentation techniques such as AugMix, Large Scale Jitter etc. These techniques enabled us to close the gap from SOTA and produce better weights (see below).
With the help of Aditya Oke, Hu Ye, Yassine Alouini and Abhijit Deo, we have added important common building blocks such as the DropBlock layer, the MLP block, the cIoU & dIoU loss etc. Finally I worked with Shen Li to fix a long standing issue on PyTorch’s SyncBatchNorm layer which affected the detection models.
Hu Ye with the support of Joao Gomes added Swin Transformer along with improved pre-trained weights. I added the EfficientNetV2 architecture and several post-paper architectural optimizations on the implementation of RetinaNet, FasterRCNN and MaskRCNN.
As I discussed earlier on the PyTorch blog, we have put significant effort on improving our pre-trained weights by creating an improved training recipe. This enabled us to improve the accuracy of our Classification models by 3 accuracy points, achieving new SOTA for various architectures. A similar effort was performed for Detection and Segmentation, where we improved the accuracy of the models by over 8.1 mAP on average. Finally Yosua Michael M worked with Laura Gustafson, Mannat Singhand and Aaron Adcock to add support of SWAG, a set of new highly accurate state-of-the-art pre-trained weights for ViT and RegNets.
As I previously discussed on the PyTorch blog, TorchVision has extended its existing model builder mechanism to support multiple pre-trained weights. The new API is fully backwards compatible, allows to instantiate models with different weights and provides mechanisms to get useful meta-data (such as categories, number of parameters, metrics etc) and the preprocessing inference transforms of the model. There is a dedicated feedback issue on Github to help us iron our any rough edges.
Nicolas Hug led the efforts of restructuring the model documentation of TorchVision. The new structure is able to make use of features coming from the Multi-weight Support API to offer a better documentation for the pre-trained weights and their use in the library. Massive shout out to our community members for helping us document all architectures on time.
Thought our detailed roadmap for 2022H2 is not yet finalized, here are some key projects that we are currently planing to work on:
We are working closely with Haoqi Fan and Christoph Feichtenhofer from PyTorch Video, to add the Improved Multiscale Vision Transformer (MViTv2) architecture to TorchVision.
Philip Meier and Nicolas Hug are working on an improved version of the Datasets API (v2) which uses TorchData and Data pipes. Philip Meier, Victor Fomin and I are also working on extending our Transforms API (v2) to support not only images but also bounding boxes, segmentation masks etc.
Finally the community is helping us keep TorchVision fresh and relevant by adding popular architectures and techniques. Lezwon Castelino is currently working with Victor Fomin to add the SimpleCopyPaste augmentation. Hu Ye is currently working to add the DeTR architecture.
If you would like to get involved with the project, please have a look to our good first issues and the help wanted lists. If you are a seasoned PyTorch/Computer Vision veteran and you would like to contribute, we have several candidate projects for new operators, losses, augmentations and models.
I hope you found the article interesting. If you want to get in touch, hit me up on LinkedIn or Twitter.
2013-2026 © Datumbox. All Rights Reserved. Privacy Policy | Terms of Use
Leave a Reply

Facts Only

TorchVision v0.12 was released with a focus on modernizing the library and improving community contribution policies.
Joao Gomes and Nicolas Hug drafted model contribution and deprecation policies to clarify the process for adding new architectures and features.
Community contributors Hu Ye and Zhiqiang Wang added the FCOS object detection model.
Nicolas Hug introduced the RAFT architecture for optical flow support.
Yiwen Song added Vision Transformer (ViT), and the author added ConvNeXt with improved pre-trained weights.
Fourteen new classification and five new optical flow datasets were added.
TorchVision v0.13 is expected to release in early June with significant new features, including Swin Transformer, EfficientNetV2, and improved pre-trained weights.
Victor Fomin and the author added data augmentation techniques like AugMix and Large Scale Jitter.
Hu Ye, Aditya Oke, Yassine Alouini, and Abhijit Deo contributed building blocks such as DropBlock, MLP, and cIoU & dIoU loss.
The author worked with Shen Li to fix an issue with PyTorch’s SyncBatchNorm layer affecting detection models.
Yosua Michael M, Laura Gustafson, Mannat Singh, and Aaron Adcock added SWAG pre-trained weights for ViT and RegNets.
The model builder mechanism now supports multiple pre-trained weights with backward compatibility.
Nicolas Hug restructured the model documentation to better showcase pre-trained weights.
Planned projects for 2022H2 include adding MViTv2, improving the Datasets and Transforms APIs, and integrating DeTR.
Community members Lezwon Castelino and Hu Ye are working on SimpleCopyPaste augmentation and DeTR, respectively.
The project invites contributions through "good first issues" and "help wanted" lists.

Executive Summary

TorchVision, a key library in the PyTorch ecosystem, has undergone significant updates in its latest release (v0.12) and is preparing for v0.13, expected in early June. The v0.12 release focused on modernizing the library by introducing new model architectures, datasets, and machine learning techniques, while also establishing clearer contribution and deprecation policies to encourage community involvement. Notable additions include the FCOS object detection model, Vision Transformer (ViT), ConvNeXt, and 19 new datasets. The team also improved pre-trained weights, achieving state-of-the-art results for classification, detection, and segmentation tasks.
Looking ahead, v0.13 will introduce further advancements, such as the Swin Transformer, EfficientNetV2, and enhanced training recipes that boost model accuracy. The library is also expanding its API to support multiple pre-trained weights and improving documentation. For the second half of 2022, plans include integrating the Multiscale Vision Transformer (MViTv2), refining the Datasets and Transforms APIs, and adding new architectures like DeTR. The project emphasizes community collaboration, with open invitations for contributions and a focus on maintaining backward compatibility while pushing the boundaries of computer vision research.

Full Take

This update from TorchVision reflects a deliberate shift toward structured openness—balancing the need for innovation with the practicalities of maintaining a large-scale open-source project. The introduction of clear contribution and deprecation policies is a pragmatic move to scale community involvement while preserving stability, a common tension in mature software ecosystems. The emphasis on state-of-the-art (SOTA) performance and modern architectures like transformers aligns with broader trends in computer vision, where attention-based models are rapidly displacing traditional CNNs. However, the focus on "SOTA chasing" raises questions about the trade-offs between cutting-edge performance and accessibility for smaller research teams or resource-constrained users.
The narrative leans heavily on the idea of community-driven progress, which is commendable, but it’s worth asking: who gets to define "relevance" in this context? The roadmap prioritizes architectures and techniques popular in industry and top-tier research labs, potentially sidelining niche but valuable use cases. The mention of "highly accurate state-of-the-art pre-trained weights" (SWAG) also hints at the growing importance of centralized, high-cost training regimes—a trend that could further concentrate power in the hands of well-funded institutions.
Patterns detected: none
Root cause: The paradigm here is one of controlled decentralization—opening the gates to community contributions while retaining curatorial authority to ensure alignment with PyTorch’s broader goals. This mirrors the "benevolent dictatorship" model seen in other large open-source projects, where inclusivity is encouraged but ultimately steered by a core team.
Implications: For researchers and practitioners, these updates lower the barrier to reproducing SOTA results, democratizing access to advanced tools. However, the push toward increasingly complex models may inadvertently raise the computational and expertise barriers for entry, counteracting some of that democratization. The focus on backward compatibility is a nod to the library’s role as infrastructure, but it also locks in certain design decisions that could become technical debt over time.
Bridge questions: How might TorchVision balance the pursuit of SOTA performance with the needs of users who prioritize interpretability, efficiency, or edge deployment? What mechanisms could ensure that community contributions reflect diverse perspectives, not just those aligned with industry trends? If the cost of training these models continues to rise, how sustainable is this model of open-source development?
Counterstrike scan: A coordinated influence campaign pushing this narrative would emphasize the "democratization" angle while downplaying the centralization of training resources and decision-making power. It might also frame community contributions as purely meritocratic, ignoring structural barriers to participation. The actual content does not match this pattern—it acknowledges the role of the core team in steering contributions and highlights specific, verifiable improvements rather than vague promises. The transparency about policies and roadmap suggests a genuine effort to build trust, not manipulate it.

Sentinel — Human

Confidence

This text shows signs of being human-written. The analysis suggests that it was authored by an individual with a distinctive writing style and personal voice.

Signals Detected

Sentence length variance shows some degree of variation, not as uniform as typical AI-generated text.

The text exhibits a personal voice and idiosyncratic emphasis, which are not characteristics of machine-generated content.

Argumentative structure is unique, not a match with known template patterns.

Human Indicators

The text exhibits a human-like writing style and lacks the mechanical, uniform rhythm typical of AI-generated content.