亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

Home Hardware Tutorial Hardware Review Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.

Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.

Apr 23, 2024 am 08:04 AM
git composer resolution Effect radar beautiful pictures Chinese University of Hong Kong lab

A large model that can automatically analyze the content of PDFs, web pages, posters, and Excel charts is not too convenient for part-time workers.

The InternLM-XComposer2-4KHD (abbreviated as IXC2-4KHD) model proposed by Shanghai AI Lab, the Chinese University of Hong Kong and other research institutions makes this a reality.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

Compared with other multi-modal large models that do not exceed the resolution limit of 1500x1500, this work increases the maximum input image of the multi-modal large model to more than 4K (3840 x1600) resolution, and supports any aspect ratio and dynamic resolution changes from 336 pixels to 4K.

Three days after its release, the model topped the Hugging Face visual question and answer model popularity list.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

Easy 4K image understanding

Let’s take a look at the effect first~

The researcher inputs the paper (ShareGPT4V: Improving Large Multi-Modal Models with Better Captions) (resolution is 2550x3300), and asked the paper which model has the highest performance on MMBench.

It should be noted that this information is not mentioned in the text part of the input screenshot, but only appears in a rather complicated radar chart. Faced with such a tricky question, IXC2-4KHD successfully understood the information in the radar chart and answered the question correctly.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

Faced with more extreme resolution image input (816 x 5133), IXC2-4KHD easily understands that the image consists of 7 parts and accurately explains what each part contains. Text message content.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

Subsequently, the researchers also comprehensively tested the capabilities of IXC2-4KHD on 16 multi-modal large model evaluation indicators, of which 5 evaluations (DocVQA, ChartQA, InfographicVQA , TextVQA, OCRBench) focuses on the model’s high-resolution image understanding capabilities.

Using only 7B parameters, IXC2-4KHD achieved results that were comparable to or even surpassed GPT4V and Gemini Pro in 10 of the evaluations, demonstrating that it is not limited to high-resolution image understanding, but also for various tasks and Scenario versatility.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

△With only 7B parameters, the performance of IXC2-4KHD is comparable to GPT-4V and Gemini-Pro. How to achieve 4K dynamic resolution?

In order to achieve the goal of 4K dynamic resolution, IXC2-4KHD includes three main designs:

(1) Dynamic resolution training:

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

△4K resolution image processing strategy

In the framework of IXC2-4KHD, the input image is randomly enlarged to a value between the input area and the maximum area (not exceeding An intermediate size (55x336x336, equivalent to 3840x1617 resolution).

Subsequently, the image is automatically cut into multiple 336x336 areas to extract visual features respectively. This dynamic resolution training strategy allows the model to adapt to visual input of any resolution, while also making up for the problem of insufficient high-resolution training data.

Experiments show that as the upper limit of dynamic resolution increases, the model achieves stable performance improvement on high-resolution image understanding tasks (InfographicVQA, DocVQA, TextVQA), and it still does not reach the upper limit at 4K resolution. world, demonstrating the potential for further expansion at higher resolutions.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

(2) Add tile layout information:

In order to enable the model to adapt to changing dynamic resolutions, the researchers found that it is necessary to add tile layout information information as additional input. To achieve this, the researchers adopted a simple strategy: a special ‘newline’ (‘ n ’) token is inserted after each row of tiles to inform the model of the layout of the tiles. Experiments show that adding tile layout information has little impact on dynamic resolution training with relatively small changes (HD9 represents that the number of tile areas does not exceed 9), but can bring significant performance improvements to dynamic 4K resolution training .

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

(3) Expanding the resolution during the inference phase

Researchers also found that models using dynamic resolution can be directly expanded during the inference phase by increasing the maximum tile upper limit resolution and bring additional performance gains. For example, by testing a trained model on HD9 (up to 9 blocks) directly using HD16, a performance improvement of up to 8% can be observed on InfographicVQA.

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

IXC2-4KHD increases the resolution supported by multi-modal large models to the 4K level. Researchers said that currently this method supports larger images by increasing the number of tiles. The input strategy encountered computational cost and video memory bottlenecks, so they plan to propose a more efficient strategy to support higher resolutions in the future.

Paper link:

https://arxiv.org/pdf/2404.06512.pdf

Project link:

https://github.com /InternLM/InternLM-XComposer

— Finished—

Please send an email to:

ai@qbitai.com

Indicate the title and tell us :

Who are you, where are you from, submission content

Attach the paper/project homepage link and contact information

We will (try our best) to reply to you in time

 輕松拿捏 4K 高清圖像理解!這個(gè)多模態(tài)大模型自動(dòng)分析網(wǎng)頁(yè)海報(bào)內(nèi)容,打工人簡(jiǎn)直不要太方便

Click here to follow me and remember to star~

One click three times to "share", "like" and "watch"

The cutting-edge progress of science and technology will be seen every day~

The above is the detailed content of Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial
1488
72
PHP calls AI intelligent voice assistant PHP voice interaction system construction PHP calls AI intelligent voice assistant PHP voice interaction system construction Jul 25, 2025 pm 08:45 PM

User voice input is captured and sent to the PHP backend through the MediaRecorder API of the front-end JavaScript; 2. PHP saves the audio as a temporary file and calls STTAPI (such as Google or Baidu voice recognition) to convert it into text; 3. PHP sends the text to an AI service (such as OpenAIGPT) to obtain intelligent reply; 4. PHP then calls TTSAPI (such as Baidu or Google voice synthesis) to convert the reply to a voice file; 5. PHP streams the voice file back to the front-end to play, completing interaction. The entire process is dominated by PHP to ensure seamless connection between all links.

PHP integrated AI intelligent picture recognition PHP visual content automatic labeling PHP integrated AI intelligent picture recognition PHP visual content automatic labeling Jul 25, 2025 pm 05:42 PM

The core idea of integrating AI visual understanding capabilities into PHP applications is to use the third-party AI visual service API, which is responsible for uploading images, sending requests, receiving and parsing JSON results, and storing tags into the database; 2. Automatic image tagging can significantly improve efficiency, enhance content searchability, optimize management and recommendation, and change visual content from "dead data" to "live data"; 3. Selecting AI services requires comprehensive judgments based on functional matching, accuracy, cost, ease of use, regional delay and data compliance, and it is recommended to start from general services such as Google CloudVision; 4. Common challenges include network timeout, key security, error processing, image format limitation, cost control, asynchronous processing requirements and AI recognition accuracy issues.

How to use PHP to combine AI to generate image. PHP automatically generates art works How to use PHP to combine AI to generate image. PHP automatically generates art works Jul 25, 2025 pm 07:21 PM

PHP does not directly perform AI image processing, but integrates through APIs, because it is good at web development rather than computing-intensive tasks. API integration can achieve professional division of labor, reduce costs, and improve efficiency; 2. Integrating key technologies include using Guzzle or cURL to send HTTP requests, JSON data encoding and decoding, API key security authentication, asynchronous queue processing time-consuming tasks, robust error handling and retry mechanism, image storage and display; 3. Common challenges include API cost out of control, uncontrollable generation results, poor user experience, security risks and difficult data management. The response strategies are setting user quotas and caches, providing propt guidance and multi-picture selection, asynchronous notifications and progress prompts, key environment variable storage and content audit, and cloud storage.

What is Useless Coin? Overview of USELESS currency usage, outstanding features and future growth potential What is Useless Coin? Overview of USELESS currency usage, outstanding features and future growth potential Jul 24, 2025 pm 11:54 PM

What are the key points of the catalog? UselessCoin: Overview and Key Features of USELESS The main features of USELESS UselessCoin (USELESS) Future price outlook: What impacts the price of UselessCoin in 2025 and beyond? Future Price Outlook Core Functions and Importances of UselessCoin (USELESS) How UselessCoin (USELESS) Works and What Its Benefits How UselessCoin Works Major Advantages About USELESSCoin's Companies Partnerships How they work together

Completed python blockbuster online viewing entrance python free finished website collection Completed python blockbuster online viewing entrance python free finished website collection Jul 23, 2025 pm 12:36 PM

This article has selected several top Python "finished" project websites and high-level "blockbuster" learning resource portals for you. Whether you are looking for development inspiration, observing and learning master-level source code, or systematically improving your practical capabilities, these platforms are not to be missed and can help you grow into a Python master quickly.

How to set environment variables in PHP environment Description of adding PHP running environment variables How to set environment variables in PHP environment Description of adding PHP running environment variables Jul 25, 2025 pm 08:33 PM

There are three main ways to set environment variables in PHP: 1. Global configuration through php.ini; 2. Passed through a web server (such as SetEnv of Apache or fastcgi_param of Nginx); 3. Use putenv() function in PHP scripts. Among them, php.ini is suitable for global and infrequently changing configurations, web server configuration is suitable for scenarios that need to be isolated, and putenv() is suitable for temporary variables. Persistence policies include configuration files (such as php.ini or web server configuration), .env files are loaded with dotenv library, and dynamic injection of variables in CI/CD processes. Security management sensitive information should be avoided hard-coded, and it is recommended to use.en

Solana Summer: Developer Events, Meme Coins and the Next Wave Solana Summer: Developer Events, Meme Coins and the Next Wave Jul 25, 2025 am 07:54 AM

Solana's strong recovery: Can the surge in developers and meme coin carnival drive last? In-depth interpretation of trends Solana is making a comeback! After a period of silence, the public chain has rejuvenated again, the coin price continues to rise, and the development community is becoming more and more lively. But where is the real driving force for this rebound? Is it just a flash in the pan? Let's dig into the current core trends of Solana: developer ecology, meme coin fanaticism and overall ecological expansion. Behind the surge in coin prices: Real development activities have recovered Recently, SOL prices have returned to above $200 for the first time since June, causing heated discussions in the market. This is not groundless - according to Santiment data, its developers have reached a new high in the past two months. this

How to build a PHP Nginx environment with MacOS to configure the combination of Nginx and PHP services How to build a PHP Nginx environment with MacOS to configure the combination of Nginx and PHP services Jul 25, 2025 pm 08:24 PM

The core role of Homebrew in the construction of Mac environment is to simplify software installation and management. 1. Homebrew automatically handles dependencies and encapsulates complex compilation and installation processes into simple commands; 2. Provides a unified software package ecosystem to ensure the standardization of software installation location and configuration; 3. Integrates service management functions, and can easily start and stop services through brewservices; 4. Convenient software upgrade and maintenance, and improves system security and functionality.

See all articles