Research Focus
  • Visual Understanding and Interactive Vision

Research in this area focuses on the development of computer vision technologies. These technologies include object classification, object detection and tracking, object segmentation, representation learning, key point extraction, human pose estimation, gesture recognition, image captioning, and large-scale distributed training engine. The research aims to resolve issues in e-commerce and general-purpose visual computing scenarios, such as identifying commodities and people and learning about the behavior and interaction of these entities.

  • Video Content Understanding and Video Data Mining

Research in this area focuses on the development of technologies such as video annotation, video search, video object detection, and video generation. The research aims to tackle challenges in terms of efficiency and accuracy of moderating, searching, and editing large amounts of video data.

  • 3D Vision

Research in this area focuses on the development of 3D modeling, 3D perception, 3D understanding, and 3D interaction. The research aims to resolve modeling and measuring issues in onboard computer vision and improve the overall experience of AR and VR.

  • Text Identification

Research in this area focuses on the development of text detection, text identification, and understanding the structure of data for images and videos. The research aims to improve the identification of text and extraction of information from complex visual data, including scanned documents, photographs, and images with multiple languages and objects.

  • Image-text Understanding

Research in this area focuses on the development of core technologies in multimedia content understanding, including image-text matching, image-text joint search, and price estimation.

  • Offline Intelligence

Research in this area focuses on how to remotely analyze images and X-ray image data, perform remote sensing, detect changes, and classify object-orientated land cover. The feature also provides cost-effective optimization for deep neural networks, such as model compression, inference acceleration, and network structure search. This feature is based on the studies of device-side and edge-side vision processing and structural solutions, including object detection, object segmentation, multi-object tracking, object identification (pedestrian, vehicle, and facial recognition), object attribute extraction, and behavior analytics algorithms.

  • Low-level Vision

Research in this area focuses on the development of diverse vision computing technologies to solve challenges in low-level vision, including technologies that are used to preprocess images and videos for visual content analytics and understanding. The research is applied in real-world applications such as image/video repairing, image/video enhancement, and denoising. Research in this area also focuses on the development of image editing and generation to improve user experience and optimize human-machine interaction.


Products and Applications
  • Pailitao and Image Search

    The Vision Lab researches and develops cutting-edge image search and recognition technologies that can be used in a wide variety of scenarios. Pailitao is a smart image search feature integrated in Alibaba's e-commerce platforms (Taobao and Tmall). Over 20 million users use Pailitao on a daily basis to perform reverse image searches for products they want to purchase. Image Search is a cloud service provided by Alibaba Cloud. It provides a comprehensive search-by-image solution for customers who want to search for similar images on e-commerce platforms, photo galleries, and image-sharing websites. This service is well received by users all over the world, such as THE ICONIC, a leading online fashion and footwear store in Australia and New Zealand.

  • 3D Smart Manufacturing

    The Vision Lab studies on 3D vision and computer graphics technologies to provide industry-specific solutions for digitalization and intelligentization. These solutions help create synergy among customers, brands, and manufacturers. For example, when adopted in the footwear industry, these technologies can increase the efficiency and precision of footwear manufacturing, marketing, and recommendations by using 3D scanning and matching algorithms. These technologies can also be applied in the real estate industry, where realistic and immersive experiences can be created at low costs and high efficiency. Furthermore, in the e-commerce industry, these technologies can deliver immersive shopping experiences (see-now-buy-now) to customers through AR and VR technologies, which can increase the sales efficiency and conversion rate.

  • Virtual Avatars

    The Vision Lab integrates graphics, image, and speech technologies. 2D and 3D technologies are now making their way to people's homes, such as virtual influencers on Taobao and virtual tutors on online education platforms. The Vision Lab utilizes advanced technologies to generate, operate, and control these realistic virtual avatars. The Vision Lab has developed industry-leading technologies in high-precision reconstruction of human faces and bodies, photo2avatar, video2avatar, speech2action, and dialogue interaction with virtual avatars. These advanced technologies empower industries such as interactive entertainment, intelligent education, new retail, AR, VR, and XR.

  • AI solution in Media

    The rise of AI technologies brings remarkable changes to the digital media industry. AI technologies are used in video and audio processing. These AI technologies include video and audio structuring, facial recognition, video and audio fingerprinting, content generation, smart content moderation, and multi-modal search. The Vision Lab provides AI solutions to support copyright protection, media cataloging, media editing, media moderation, and multi-modal search. This way, digital media enterprises can improve work efficiency and reduce costs. The Vision Lab has established partnerships with digital media giants such as CCTV, People's Daily, and Xinhua News Agency.

  • Analytical Insight of Earth (AIEARTH)

    The Vision Lab uses computer vision technologies to gain analytical insights from multisource earth observation data, extract earth surface information and monitor dynamic changes. Compared with traditional approaches, computer vision technologies are game-changing solutions that allow high efficiency and high precision in natural resource monitoring, ecological environment monitoring, crop yield estimation, and disaster prevention.


Academic Achievements
Publications and Presentations
  • Bin Wang, Pan Pan, Qinjie Xiao, Likang Luo, Xiaofeng Ren, Rong Jin, and Xiaogang Jin. Seamless Color Mapping for 3D Reconstruction with Consumer-Grade Scanning Devices. In: Proceedings of the 4th International Workshop on Recovering 6D Object Pose Organized at ECCV 2018, Munich, Germany, 2018.
  • Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren and Rong Jin. Visual Search at Alibaba. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'18), London, UK, 2018.
  • Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. Transductive Unbiased Embedding for Zero-shot Learning. In: Proceedings of the 31th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), Salt Lake City, UT, 2018.
  • Peisong Wang, Qinghao Hu, Yifan Zhang, Chunjie Zhang, Yang Liu and Jian Cheng. Two-step Quantization for Low-bit Neural Networks. In: Proceedings of the 31th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), Salt Lake City, UT, 2018.
  • Lechao Cheng, Zicheng Liao, Xiaowei Zhao and Yang Liu. Exploiting Non-Local Action Relationships for Dense Video Captioning. In: Proceedings of the 29th British Machine Vision Conference (BMVC, 18), Newcastle, British, 2018.
  • Zhiqi Cheng, Xiao Wu, Yang Liu and Xiansheng Hua. Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images. In: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), Honolulu, Hawaii, 2017.
  • Chen Chen, Xiaowei Zhao and Yang Liu. Multi-modal Aggregation for Video Classification. In: Proceedings of the 25th ACM Multimedia Workshop 2017 (ACM MM' 17), Mountain View, CA, 2017.
  • L. Cheng, X. Zhou, L. Zhao, D. Li, H. Shang, Y. Zheng, P. Pan, Y. Xu:Weakly Supervised Learning with Side Information for Noisy Labeled Images. ECCV 2020.
  • L. Song, P. Pan, K. Zhao, H. Yang, Y. Chen, Y. Zhang, Y. Xu, R. Jin: Large-Scale Training System for 100-Million Classification at Alibaba. KDD 2020.
  • X. Zhou, P. Pan, Y. Zheng, Y. Xu, R. Jin: Large scale long-tailed product recognition system at Alibaba. CIKM 2020.
  • J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, Y. Guo, X. Jiang, L. Tang, Y. Du, Y. Zhang, P. Pan, Y. Xie: EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. HPCA 2020.
  • Q. Qian, L. Chen, H. Li, R Jin. DR Loss: Improving Object Detection by Distributional Ranking. CVPR 2020.
  • L. Han, P. Wang, Z. Yin, F. Wang, H. Li. Exploiting Better Feature Aggregation for Video Object Detection. ACMMM 2020.
  • Q. Qian, J. Hu, H. Li. Hierarchically Robust Representation Learning. CVPR 2020.
  • C. Luo, Y. Zhu, L. Jin, Y. Wang. Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition. CVPR 2020.
  • Y. Huang,M. He, Y. Wang, L. Jin. RD-GAN: Chinese Character Font Transfer via Radical Decomposition and Rendering. ECCV 2020.
  • L. Li, F. Gao, J. Bu, Y. Wang, Z. Yu, Q. Zheng. An End-to-End OCR Text Re-organization Sequence Learning for Rich-text Detail Image Comprehension. ECCV 2020.
  • M. Zhou, Z Niu. Adversarial Ranking Attack and Defense. ECCV 2020.
  • W. Wang, X. Liu, X. Ji, Enze X., D. Liang, Z. Yang, T. Lu, C. Shen, P. Luo. AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting. ECCV 2020.
  • H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, W. Liu. All You Need is Boundary: Toward Arbitrary-Shaped Text Spotting. AAAI 2020.
  • C. Liu, Y. Liu, L. Jin, S. Zhang, C. Luo, Y. Wang. EraseNet: End-to-End Text Removal in the Wild. TIP 2020.
  • Y. Zhang, P. Pan, Y. Zheng, K. Zhao, J. Wu, Y. Xu, R. Jin: Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting Richness of User Click Behavior for Visual Search Relevance. CIKM 2019.
  • K. Zhao, P. Pan, Y. Zheng, Y. Zhang, C. Wang, Y. Zhang, Y. Xu, R. Jin: Large-Scale Visual Search with Binary Distributed Graph at Alibaba. CIKM 2019.
  • Q. Qian,L. Shang,B. Sun, J. Hu,H. Li,R. Jin. SoftTriple Loss: Deep Metric Learning without Triplet Sampling. ICCV 2019.
  • Z. Tan, X. Nie, Q. Qian, N. Li, H. Li. Learning to rank proposals for object detection. ICCV 2019
  • Q. Qian,S. Zhu, J. Tang, B. Sun,H. Li,R. Jin. Robust Optimization over Multiple Domains. AAAI 2019
  • Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang,X. Bai. TextField: Learning A Deep Direction Field for Irregular Scene Text Detection. TIP 2019
  • M. Zhou, Z Niu. Ladder Loss for Coherent Visual-Semantic Embedding. AAAI 2019
  • Z. Gao, Z. Niu. Video imprint segmentation for temporal action detection in untrimmed videos. AAAI 2019
  • Z. Liu, Z. Niu. Weakly supervised temporal action localization through contrast based evaluation networks. ICCV 2019
  • M. Lin, S. Qiu, J. Ye, X. Song, Q. Qian, L. Sun, S. Zhu, R. Jin. Which Factorization Machine Modeling is Better: A Theoretical Answer with Optimal Guarantee. AAAI 2019
  • M. Lin, X. Song, Q. Qian, H. Li, L. Sun, S. Zhu, R. Jin. Robust Gaussian Process Regression for Real-Time High Precision GPS Signal Enhancement. KDD 2019
  • Q. Qian, J. Tang, H. Li, S. Zhu, R. Jin. Large-scale Distance Metric Learning with Uncertainty. CVPR 2018
  • Z Gao, Z. Niu. Video imprint. TPAMI 2018
  • Z. Liu, Z. Niu. Joint video object discovery and segmentation by coupled dynamic markov networks. TIP 2018
  • Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, R. Jin: Visual Search at Alibaba. KDD 2018
  • B. Wang, P. Pan, Q. Xiao, L. Luo, X. Ren, R. Jin, X. Jin: Seamless Color Mapping for 3D Reconstruction with Consumer-Grade Scanning Devices. ECCV Workshops 2018
  • C. Leng, Z. Dou, H. Li, S. Zhu, R. Jin. Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM. AAAI 2018
Expand
Competition
  • ECCV 2020 VIPriors Semantic Segmentation Challenge: First Place
  • ECCV 2020 Tracking Any Objects Challenge: First Place
  • ECCV 2020 Visual Domain Adaption Challenge: First Place
  • ECCV 2020 Large Vocabulary Instance Segmentation: Second Place
  • LPIRC 2019 Classification: First Place
  • ACM MM 2017 Large-scale Video Classification: First Place
  • CVPR2019/WebVision Challenge on Visual Understanding by Learning from Web Data 2019: First Place
  • ICCV2019/COCO Detection and Segmentation Challenge 2019: First Place
  • CVPR2020/DAVIS Challenge on Video Object Segmentation 2020: First Place
  • CVPR2019/iNaturalist Fine-grained Image Classification 2019: Second Place
  • CVPR2020/BMTT Multiple Object Tracking and Segmentation 2020: Second Place
  • CVPR 2020 Activitynet Temporal Action Localization: First Price
  • CVPR 2020 HACS Temporal Action Localization: First Place
  • The Vision Lab team won the first place in three road-scene segmentation tasks at KITTI in 2018.
  • ICCV 2019 Light Weight Face Recognition Challenge: Third Place
  • The Vision Lab team won the large-scale video competition (LSVC) championship at the 2017 ACM Multimedia Conference.

Scan QR code
关注Ali TechnologyWechat Account