Qwen-VL is a large-scale visual language model (Large Vision Language Model, LVLM) developed by Alibaba Cloud. Qwen-VL can take images, text, and detection boxes as input, and output both text and detection boxes. The characteristics of the Qwen-VL series models include:
Powerful performance: In the standard English evaluation of four types of multi-modal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), all achieved The best effect under the same general model size; Multilingual dialogue model: naturally supports multilingual dialogue, and end-to-end supports long text recognition in both Chinese and English in pictures; Multi-image interleaved dialogue: supports multiple images Input and comparison, question and answer of specified pictures, multi-picture literature creation, etc.; The first general model that supports Chinese open domain positioning: mark the detection frame through Chinese open domain language expression; Fine-grained recognition and understanding: Compared with the 224 resolution currently used by other open source LVLMs, Qwen-VL is the first open source 448 resolution LVLM model. Higher resolutions can improve fine-grained text recognition, document question answering, and detection box annotation.
Visit Official Website