Z.ai Launches GLM-4.5V: Open-source Vision-Language Model Sets New Bar for Multimodal Reasoning

Z.AI (formerly Zchipu) announced GLM-4.5V today, designed an open-source vision language model for robust multimodal reasoning about images, video, long documents, graphs and GUI screens.

Multimodal reasoning is generally seen as a key road to Agi. GLM-4.5V promotes this agenda with a 100B class architecture (106b total parameters, 12B active) that combines high accuracy with practical latency and implementation costs. The release follows GLM-4.1V-9B-Thinging, which hit #1 on hugging face trending and 130,000 downloads and scales that have surpassed a recipe for business workload, while developer Ergonomics becomes ergonomics for and in the middle. The model is accessible via multiple channels, including a cuddly face [http://huggingface.co/zai-org/GLM-4.5V]Github [http://github.com/zai-org/GLM-V]Z.AI API platform [http://docs.z.ai/guides/vlm/glm-4.5v]and z.ai chat [http://chat.z.ai]Ensure broad access to developers.

Open-source Sota

Built on the new GLM-4.5-Air text base and the expansion of the GLM-4.1V-thinking line, GLM-4.5V SOTA performance under comparable open-source VLMs of comparable size over 41 public multimodal evaluations. In addition to leaderboards, the model is designed for usability and reliability in practice on noisy, high resolution and extreme aspect ratio inputs.

The result is all-scenario visual reasoning in practical pipelines: Image reasoning (scene concept, multi-image analysis, localization), video insight (shot segmentation and event recognition), GUI tasks (screen reading, icon detection, desktop help), complex graphic and analysis of the long-of-turacticness of the long-dysis of the long-dysydies of the long-of-turactive-oriented one-orpetative Areas (accurate spatial location of visual elements).

Image: https://www.globalnewsles.com/uploads/2025/08/1ca45A47819aAF6A11E702A896EE2BC.JPG

Main possibilities

Visual grounding and localization

GLM-4.5V identifies and locates target objects precisely on the basis of natural language prompts and returns limitation coordinates. This makes high-quality applications possible, such as safety and quality inspection or analysis of the air/external sensing. In comparison with conventional detectors, the model uses broader world knowledge and a stronger semantic reasoning to follow more complex localization instructions.

Users can switch to the visual positioning mode, an image and a short prompt upload and get the box and the reasoning back. For example, ask: “Point out non-right objects to this image.” GLM-4.5V Reasons about plausibility and materials, then marks the Insect-like sprinkler robot (the item marked in the demo in the demo) as a non-real, so that a tightly limited box returns a reliability score and a short explanation.

Image: https://www.globalnewsles.com/uploads/2025/08/8dcbdddddddddddddddddddddddddom12f7a239bfbb0528b3f7.jpg

Design-to-code of screenshots and interaction videos

The model analyzes page screenshots and even interaction videos to deduce hierarchy, layout rules, styles and intention and then radiates loyal, Runnable HTML/CSS/Javascript. In addition to element detection, it reconstructs the underlying logic and supports processing requests at the regional level, making an iterative loop between visual input and production-ready code possible.

Open-World Precise Reasoning

GLM-4.5V can deduce background context from subtle visual signals without an external search. Given a landscape or street photo, the reasoning of vegetation, climate tracks, signage and architectural styles can estimate the shooting location and estimated coordinates.

For example, with the help of a classic scene of Before Sunrise -“based on architecture and streets in the background, can you identify the specific location in Vienna where this scene was filmed?” -The model parses facade details, street furniture and lay -out signals to locate the exact place in Vienna and to return coordinates and a Landmarkknam. (See demo: https://chat.z.ai/s/39233f25-8CE5-4488-9642-E07e7c638ef6).

Image: htts

In addition to a few images, GLM-4.5V’s Open-World Reasoning scales in competitive institutions: in a worldwide ‘Geo-Game’, it defeated 99% of human players within 16 hours and climbed clearly evidence of robust Real-World performance within seven days.

Complex document and graph concept

The model reads documents visually pages, figures, tables and graphs that offer the head than on Brosse OCR pipelines. This end-to-end approach retains the structure and layout, which improves the accuracy for summary, translation, information extraction and commentary in long, reports with mixed media.

GUI agent foundation

Built-up screen comprehension Let GLM-4.5V interfaces read, find icons and controls and combine the current visual condition with user instructions to plan actions. In combination with agent, Runtimes supports end-to-end desktop automation and complex GUI-agent tasks, which offer a reliable visual backbone for agent systems.

Built for reasoning, designed for use

GLM-4.5V is built on the new GLM-4.5-Air text base and uses a modern VLM pipeline vision-encoder, MLP adapter and LLM-Dododal-with 64k multimodal context, native image and video inputs and improved spatial-temporaling and extreme uphing capacity with stability processed.

The training stack follows a three-phase strategy: large-scale multimodal pre-training on interleaved text request data and long contexts; guided refinement with explicit debit classes to strengthen causal and cross-modal reasoning; And reinforcement learn that verifiable rewards combines with human feedback to eliminate stem, grounding and agent behavior. A simple thinking / non-thinking switch states builders who can act the depth of the speed on demand, so that the model is tailored to varied product lecture goals.

Image: https://www.globalnewsles.com/uploads/2025/08/8c8146f0727d80970ed4f09b16f316.jpg
Mediacontact
Company name: Z.AI
Contact person: Zixuan Li
E -Mail: Send e -Mail [http://www.universalpressrelease.com/?pr=zai-launches-glm45v-opensource-visionlanguage-model-sets-new-bar-for-multimodal-reasoning]
Country: Singapore
Website: https://chat.z.ai/

Legal disclaimer: Information on this page is provided by an independent content provider of third parties. Getnews provides no guarantees or responsibility or liability for accuracy, content, images, videos, licenses, completeness, legality or reliability of the information in this article. If you are affiliated with this article or complaints or copyright issues with regard to this article and want it to be deleted, please contact [email protected]

This release is published on OpenPR.

Source link

What's Hot

South Korean Police Officers Indicted in $186 Million Crypto Money Laundering Case

South Korea banks hit by Russia–North Korea ransomware alliance

Interpol elevates scam network to global threat as crypto fraud spreads

Z.ai Launches GLM-4.5V: Open-source Vision-Language Model Sets New Bar for Multimodal Reasoning

Cronos launches x402 hackathon for AI payment systems

Swiss ETP launches meme coin

Crypto ETP provider Bitcoin Capital launches a BONK ETP on SIX Swiss Exchange

India launches rupee-pegged digital asset Arc with Polygon and Anq

Ethereum details launch of Fusaka upgrade

Coinbase contends state lawsuits impede access to $90M in staking rewards for users

Javier Milei Faces Charges in Argentina Over LIBRA: AP

What's Hot

Z.ai Launches GLM-4.5V: Open-source Vision-Language Model Sets New Bar for Multimodal Reasoning

Related Posts