亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

Table of Contents
生成自定義Q/ A數(shù)據(jù)集
OpenAI嵌入模型
開源嵌入模型
Home Technology peripherals AI Choosing the embedding model that best fits your data: A comparison test of OpenAI and open source multi-language embeddings

Choosing the embedding model that best fits your data: A comparison test of OpenAI and open source multi-language embeddings

Feb 26, 2024 pm 06:10 PM
AI openai

OpenAI最近宣布推出他們的最新一代嵌入模型embedding v3,他們聲稱這是性能最出色的嵌入模型,具備更高的多語言性能。這一批模型被劃分為兩種類型:規(guī)模較小的text-embeddings-3-small和更為強大、體積較大的text-embeddings-3-large。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

這些模型的設(shè)計和訓(xùn)練方式的信息披露得很少,模型只能通過付費API訪問。所以就出現(xiàn)了很多開源的嵌入模型但是這些開源的模型與OpenAI閉源模型相比如何呢?

本文將對這些新模型與開源模型的性能進(jìn)行實證比較。我們計劃建立一個數(shù)據(jù)檢索工作流程,其中關(guān)鍵任務(wù)是根據(jù)用戶的查詢,從語料庫中找到最相關(guān)的文檔。

我們的語料庫是歐洲人工智能法案,目前正處于驗證階段。這個語料庫是全球首個涉及人工智能的法律框架,其獨特之處在于擁有24種語言版本。這使得我們能夠比較不同語言背景下數(shù)據(jù)檢索的準(zhǔn)確性,為人工智能的跨文化應(yīng)用提供了重要的支持。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

我們計劃利用多語言文本語料庫創(chuàng)建一個自定義合成問題/答案數(shù)據(jù)集,并使用這個數(shù)據(jù)集來比較OpenAI和最先進(jìn)的開源嵌入模型的準(zhǔn)確性。我們將分享完整的代碼,因為我們的方法可以輕松適應(yīng)其他數(shù)據(jù)語料庫。

生成自定義Q/ A數(shù)據(jù)集

首先,我們可以從創(chuàng)建自定義問答(Q/A)數(shù)據(jù)集開始,這樣做的好處在于可以確保數(shù)據(jù)集不會成為模型訓(xùn)練中的偏差因素,避免類似于MTEB等基準(zhǔn)參考中可能出現(xiàn)的情況。此外,通過生成自定義數(shù)據(jù)集,我們可以根據(jù)特定的數(shù)據(jù)語料庫來調(diào)整評估過程,這對于類似于檢索增強應(yīng)用程序(RAG)等場景可能非常重要。

我們將按照Llama Index文檔中建議的簡單流程進(jìn)行操作。首先,將語料庫劃分為多個塊。接著,針對每個塊,利用大型語言模型(LLM)生成一系列合成問題,確保答案在相應(yīng)的塊中。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

使用Llama Index之類的LLM數(shù)據(jù)框架實現(xiàn)此策略非常簡單,如下面的代碼所示。

from llama_index.readers.web import SimpleWebPageReader from llama_index.core.node_parser import SentenceSplitter  language = "EN" url_doc = "https://eur-lex.europa.eu/legal-content/"+language+"/TXT/HTML/?uri=CELEX:52021PC0206"  documents = SimpleWebPageReader(html_to_text=True).load_data([url_doc])  parser = SentenceSplitter(chunk_size=1000) nodes = parser.get_nodes_from_documents(documents, show_progress=True)

語料庫是歐盟人工智能法案的英文版本,使用這個官方URL直接從Web上獲取。本文使用2021年4月的草案版本,因為最終版本尚未適用于所有歐洲語言。所以我們選擇的這一版可以用其他23種歐盟官方語言中的任何一種語言替換URL中的language,檢索不同語言的文本(BG表示保加利亞語,ES表示西班牙語,CS表示捷克語,等等)。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

使用SentenceSplitter對象將文檔分成每1000個令牌的塊。對于英語來說,這會生成大約100個塊。然后將每個塊作為上下文提供給以下提示(Llama Index庫中建議的默認(rèn)提示):

prompts={} prompts["EN"] = """\ Context information is below.  --------------------- {context_str} ---------------------  Given the context information and not prior knowledge, generate only questions based on the below query.  You are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided." """

這個提示可以生成關(guān)于文檔塊的問題,要為每個數(shù)據(jù)塊生成的問題數(shù)量作為參數(shù)“num_questions_per_chunk”傳遞,我們將其設(shè)置為2。然后可以通過調(diào)用Llama Index庫中的generate_qa_embedding_pairs來生成問題:

from llama_index.llms import OpenAI from llama_index.legacy.finetuning import generate_qa_embedding_pairs  qa_dataset = generate_qa_embedding_pairs(llm=OpenAI(model="gpt-3.5-turbo-0125",additional_kwargs={'seed':42}),nodes=nodes,qa_generate_prompt_tmpl = prompts[language],num_questions_per_chunk=2 )

我們依靠OpenAI的GPT-3.5-turbo-0125來完成這項任務(wù),結(jié)果對象' qa_dataset '包含問題和答案(塊)對。作為生成問題的示例,以下是前兩個問題的結(jié)果(其中“答案”是文本的第一部分):

  1. What are the main objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) according to the explanatory memorandum?
  2. How does the proposal for a Regulation on artificial intelligence aim to address the risks associated with the use of AI while promoting the uptake of AI in the European Union, as outlined in the context information?

OpenAI嵌入模型

評估函數(shù)也是遵循Llama Index文檔:首先所有答案(文檔塊)的嵌入都存儲在VectorStoreIndex中,以便有效檢索。然后評估函數(shù)循環(huán)遍歷所有查詢,檢索前k個最相似的文檔,并根據(jù)MRR (Mean Reciprocal Rank)評估檢索的準(zhǔn)確性,代碼如下:

def evaluate(dataset, embed_model, insert_batch_size=1000, top_k=5):# Get corpus, queries, and relevant documents from the qa_dataset objectcorpus = dataset.corpusqueries = dataset.queriesrelevant_docs = dataset.relevant_docs # Create TextNode objects for each document in the corpus and create a VectorStoreIndex to efficiently store and retrieve embeddingsnodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]index = VectorStoreIndex(nodes, embed_model=embed_model, insert_batch_size=insert_batch_size)retriever = index.as_retriever(similarity_top_k=top_k) # Prepare to collect evaluation resultseval_results = [] # Iterate over each query in the dataset to evaluate retrieval performancefor query_id, query in tqdm(queries.items()):# Retrieve the top_k most similar documents for the current query and extract the IDs of the retrieved documentsretrieved_nodes = retriever.retrieve(query)retrieved_ids = [node.node.node_id for node in retrieved_nodes] # Check if the expected document was among the retrieved documentsexpected_id = relevant_docs[query_id][0]is_hit = expected_id in retrieved_ids # assume 1 relevant doc per query # Calculate the Mean Reciprocal Rank (MRR) and append to resultsif is_hit:rank = retrieved_ids.index(expected_id) + 1mrr = 1 / rankelse:mrr = 0eval_results.append(mrr) # Return the average MRR across all queries as the final evaluation metricreturn np.average(eval_results)

嵌入模型通過' embed_model '參數(shù)傳遞給評估函數(shù),對于OpenAI模型,該參數(shù)是一個用模型名稱和模型維度初始化的OpenAIEmbedding對象。

from llama_index.embeddings.openai import OpenAIEmbedding  embed_model = OpenAIEmbedding(model=model_spec['model_name'],dimensinotallow=model_spec['dimensions'])

dimensions參數(shù)可以縮短嵌入(即從序列的末尾刪除一些數(shù)字),而不會失去嵌入的概念表示屬性。OpenAI在他們的公告中建議,在MTEB基準(zhǔn)測試中,嵌入可以縮短到256大小,同時仍然優(yōu)于未縮短的text-embedding-ada-002嵌入(大小為1536)。

我們在四種不同的嵌入模型上運行評估函數(shù):

兩個版本的text-embedding-3-large:一個具有最低可能維度(256),另一個具有最高可能維度(3072)。它們被稱為“OAI-large-256”和“OAI-large-3072”。

OAI-small:text-embedding-3-small,維數(shù)為1536。

OAI-ada-002:傳統(tǒng)的文本嵌入text-embedding-ada-002,維度為1536。

每個模型在四種不同的語言上進(jìn)行評估:英語(EN),法語(FR),捷克語(CS)和匈牙利語(HU),分別涵蓋日耳曼語,羅曼語,斯拉夫語和烏拉爾語的例子。

embeddings_model_spec = { }  embeddings_model_spec['OAI-Large-256']={'model_name':'text-embedding-3-large','dimensions':256} embeddings_model_spec['OAI-Large-3072']={'model_name':'text-embedding-3-large','dimensions':3072} embeddings_model_spec['OAI-Small']={'model_name':'text-embedding-3-small','dimensions':1536} embeddings_model_spec['OAI-ada-002']={'model_name':'text-embedding-ada-002','dimensions':None}  results = []  languages = ["EN", "FR", "CS", "HU"]  # Loop through all languages for language in languages: # Load datasetfile_name=language+"_dataset.json"qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name) # Loop through all modelsfor model_name, model_spec in embeddings_model_spec.items(): # Get modelembed_model = OpenAIEmbedding(model=model_spec['model_name'],dimensinotallow=model_spec['dimensions']) # Assess embedding score (in terms of MRR)score = evaluate(qa_dataset, embed_model) results.append([language, model_name, score])  df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR"])

MRR精度如下:

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

嵌入尺寸越大,性能越好。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

開源嵌入模型

圍繞嵌入的開源研究也是非?;钴S的,Hugging Face 的 MTEB leaderboard會經(jīng)常發(fā)布最新的嵌入模型。

為了在本文中進(jìn)行比較,我們選擇了一組最近發(fā)表的四個嵌入模型(2024)。選擇的標(biāo)準(zhǔn)是他們在MTEB排行榜上的平均得分和他們處理多語言數(shù)據(jù)的能力。所選模型的主要特性摘要如下。

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

e5-mistral-7b-instruct:微軟的這個E5嵌入模型是從Mistral-7B-v0.1初始化的,并在多語言混合數(shù)據(jù)集上進(jìn)行微調(diào)。模型在MTEB排行榜上表現(xiàn)最好,但也是迄今為止最大的(14GB)。

multilingual-e5-large-instruct(ML-E5-large):微軟的另一個E5模型,可以更好地處理多語言數(shù)據(jù)。它從xlm-roberta-large初始化,并在多語言數(shù)據(jù)集的混合上進(jìn)行訓(xùn)練。它比E5-Mistral小得多(10倍),上下文大小也小得多(514)。

BGE-M3:該模型由北京人工智能研究院設(shè)計,是他們最先進(jìn)的多語言數(shù)據(jù)嵌入模型,支持100多種工作語言。截至2024年2月22日,它還沒有進(jìn)入MTEB排行榜。

nomic-embed-text-v1 (Nomic- embed):該模型由Nomic設(shè)計,其性能優(yōu)于OpenAI Ada-002和text-embedding-3-small,而且大小僅為0.55GB。該模型是第一個完全可復(fù)制和可審計的(開放數(shù)據(jù)和開源訓(xùn)練代碼)的模型。

用于評估這些開源模型的代碼類似于用于OpenAI模型的代碼。主要的變化在于模型參數(shù):

embeddings_model_spec = { }  embeddings_model_spec['E5-mistral-7b']={'model_name':'intfloat/e5-mistral-7b-instruct','max_length':32768, 'pooling_type':'last_token', 'normalize': True, 'batch_size':1, 'kwargs': {'load_in_4bit':True, 'bnb_4bit_compute_dtype':torch.float16}} embeddings_model_spec['ML-E5-large']={'model_name':'intfloat/multilingual-e5-large','max_length':512, 'pooling_type':'mean', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}} embeddings_model_spec['BGE-M3']={'model_name':'BAAI/bge-m3','max_length':8192, 'pooling_type':'cls', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}} embeddings_model_spec['Nomic-Embed']={'model_name':'nomic-ai/nomic-embed-text-v1','max_length':8192, 'pooling_type':'mean', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'trust_remote_code' : True}}  results = []  languages = ["EN", "FR", "CS", "HU"]  # Loop through all models for model_name, model_spec in embeddings_model_spec.items(): print("Processing model : "+str(model_spec)) # Get modeltokenizer = AutoTokenizer.from_pretrained(model_spec['model_name'])embed_model = AutoModel.from_pretrained(model_spec['model_name'], **model_spec['kwargs']) if model_name=="Nomic-Embed":embed_model.to('cuda') # Loop through all languagesfor language in languages: # Load datasetfile_name=language+"_dataset.json"qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name) start_time_assessment=time.time() # Assess embedding score (in terms of hit rate at k=5)score = evaluate(qa_dataset, tokenizer, embed_model, model_spec['normalize'], model_spec['max_length'], model_spec['pooling_type']) # Get duration of score assessmentduration_assessment = time.time()-start_time_assessment results.append([language, model_name, score, duration_assessment])  df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR", "Duration"])

The results are as follows:

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

BGE-M3 performed the best, followed by ML-E5-Large, E5-mistral- 7b and Nomic-Embed. The BGE-M3 model has not yet been benchmarked on the MTEB rankings, and our results indicate that it may rank higher than other models. Although BGE-M3 is optimized for multilingual data, it also performs better in English than other models.

Because open source models generally need to be run locally, we also deliberately recorded the processing time of each embedded model.

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

E5-mistral-7b is more than 10 times larger than other models, so the slowest is normal

Summary

We summarize all the results

選擇最適合數(shù)據(jù)的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

We obtained it using the open source model For the best performance, the BGE-M3 model performed the best. This model has the same context length (8K) as the OpenAI model and is 2.2GB in size.

The performance of OpenAI's large(3072), small and ada models are very similar. Reducing the embedding size of large (256) results in performance degradation and is not as good as ada as OpenAI says.

Almost all models (except ML-E5-large) perform best in English. In languages ??such as Czech and Hungarian, there are significant differences in performance, possibly because there is less data to train on.

Should we pay to subscribe to OpenAI, or host an open source embedded model?

OpenAI’s recent price adjustment makes their API more affordable , now costs $0.13 per million tokens. If you process a million queries per month (assuming each query involves about 1K tokens), the cost is about $130. Therefore, you can choose whether to host the open source embedding model based on actual needs.

Of course cost-effectiveness is not the only consideration. Other factors such as latency, privacy, and control over data processing workflows may also need to be considered. The open source model offers the advantages of complete data control, enhanced privacy and customization.

Speaking of latency, OpenAI’s API also has latency issues, which sometimes results in extended response times, so sometimes OpenAI’s API is not necessarily the fastest choice.

In short, choosing between an open source model and a proprietary solution like OpenAI is not a simple answer. Open source embedding offers a great option that combines performance with greater control over your data. And OpenAI's products may still appeal to those who prioritize convenience, especially if privacy concerns are secondary.

Code of this article: https://github.com/Yannael/multilingual-embeddings

The above is the detailed content of Choosing the embedding model that best fits your data: A comparison test of OpenAI and open source multi-language embeddings. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial
1488
72
Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ??model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A new era of VSCode front-end development: 12 highly recommended AI code assistants A new era of VSCode front-end development: 12 highly recommended AI code assistants Jun 11, 2024 pm 07:47 PM

In the world of front-end development, VSCode has become the tool of choice for countless developers with its powerful functions and rich plug-in ecosystem. In recent years, with the rapid development of artificial intelligence technology, AI code assistants on VSCode have sprung up, greatly improving developers' coding efficiency. AI code assistants on VSCode have sprung up like mushrooms after a rain, greatly improving developers' coding efficiency. It uses artificial intelligence technology to intelligently analyze code and provide precise code completion, automatic error correction, grammar checking and other functions, which greatly reduces developers' errors and tedious manual work during the coding process. Today, I will recommend 12 VSCode front-end development AI code assistants to help you in your programming journey.

SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. Aug 01, 2024 pm 09:40 PM

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

See all articles