關于OpenAI強大的新模型o1，你需要知道這9件事

JEREMY KAHN

2024-09-18

相比之前的大語言模型，該系列模型能夠更好地解決復雜的推理和數學問題。

文本設置

小號

默認

大號

Plus(0條)

OpenAI CEO山姆·阿爾特曼。該公司剛剛發布了最新o1人工智能模型，該公司稱相比之前的模型，新模型顯著提高了推理能力。圖片來源：DAVID PAUL MORRIS—BLOOMBERG VIA GETTY IMAGES

OpenAI公布了備受期待的最新系列人工智能模型，相比之前的大語言模型，該系列模型能夠更好地解決復雜的推理和數學問題。上周四，該公司向部分付費用戶發布了兩個新模型的“預覽版”，分別名為o1-preview和o1-mini。

人工智能增強推理和數學技能，可以幫助化學家、物理學家和工程師們解決復雜的問題，這有助于創造新產品。它還可以幫助投資者計算期權交易策略，或者幫助理財規劃師創建投資組合，更好地權衡風險和回報。

由于科技公司希望創建能夠執行復雜任務的人工智能助理，例如編寫完整的計算機程序或在網絡中查找信息、輸入數據表并對數據進行分析，然后編寫一份報告總結分析結果等，因此更強大的推理、規劃和解決問題能力對這些公司同樣至關重要。

OpenAI公布的o1模型的基準運行結果令人印象深刻。該模型在發布前的內部代號是“Strawberry”。在面向高中生的美國數學邀請賽（AIME）中，o1模型的答題準確率為83.3%，而GPT-4o的準確率只有13.4%。在另外一項評估中，o1回答博士水平科學問題的準確率為78%，而GPT-4o的準確率為56.1%，人類專家的準確率為69.7%。

根據OpenAI公布的測試結果，o1模型出現“幻覺”（即自信地提供似是而非但不準確的答案）的概率，遠低于公司之前的模型。o1模型更難“被越獄”，即被引導繞過公司設置的安全防護措施。該公司希望模型在提供回答時遵守這些措施。

在o1-preview模型發布后幾個小時內，用戶進行的測試中，該模型似乎能夠正確回答令之前的模型感到困惑的許多問題，包括OpenAI最強大的模型GPT-4和GPT-4o等。

但o1-preview模型在一些謎題和OpenAI的評估中依舊會出錯，有時候甚至無法完成一些看似簡單的任務，如井字棋（但在作者的實驗中，o1-preview模型玩井字棋的水平相比GPT-4o有顯著提高）。這表明o1模型的“推理能力”可能存在顯著的局限性。在語言任務方面，例如寫作和編輯，OpenAI聘請的人類評估員通常認為，GPT-4o模型的回應優于o1模型。

而且o1模型回答問題的時間遠超過GPT-4o。在OpenAI公布的測試中，o1-preview模型回答一個問題需要超過30秒鐘，而GPT-4o只需要3秒鐘。

o1模型還沒有完全整合到ChatGPT當中。用戶需要自行決定由o1-preview還是由GPT-4o處理其提示詞，模型本身無法決定問題需要o1模型提供的速度更慢、按部就班的推理過程，還是GPT-4甚至GPT-3就已經足夠。此外，o1模型僅能處理文本，無法像其他人工智能模型一樣處理圖片、音頻或視頻輸入和輸出。

OpenAI的o1-preview和o1-mini模型，對ChatGPT Plus和ChatGPT Teams收費產品的所有訂閱用戶，以及使用企業級應用程序編程接口（API）的頂級開發者開放。

以下是關于o1模型我們需要知道的9件事：

1. 這并非通用人工智能。OpenAI、谷歌（Google）的DeepMind、最近的Meta和Anthropic等其他多家人工智能初創公司公布的使命是，實現通用人工智能。通用人工智能通常是指可以像人類一樣執行認知任務的人工智能系統，其表現甚至比人類更優秀。雖然o1-preview處理推理任務的能力更強，但其存在的局限性和出現的失敗依舊表明，該系統遠遠沒有達到人類的智力水平。

2. o1給谷歌、Meta和其他公司帶來了壓力，但它不太可能改變該領域的競爭格局。在基礎模型能力日趨商品化的時候，o1讓OpenAI獲得了臨時競爭優勢。但這種優勢可能很短暫。谷歌已經公開表示，其正在研究的模型與o1一樣，具備高級推理和規劃能力。谷歌DeepMind的研究部門擁有全球最頂級的強化學習專家，而強化學習是訓練o1模型使用的方法之一。o1模型的發布可能會迫使谷歌加快發布新模型。Meta和Anthropic也擁有快速創建可與o1的能力媲美的模型的專業知識和資源，他們可能在幾個月內發布新模型。

3. 我們并不清楚o1模型如何運行。雖然OpenAI發布了許多與o1模型的表現有關的信息，但對于o1模型如何運行或使用哪些數據進行訓練，該公司卻沒有公布太多信息。我們知道該模型整合了多種不同的人工智能技術。我們知道它使用的大語言模型可以執行“思維鏈”推理，即模型必須通過一系列連續的步驟來回答問題。我們還知道，模型使用強化學習，即人工智能系統通過試錯過程，發現執行任務的成功策略。

迄今為止，OpenAI和用戶發現的o1-preview出現的錯誤顯示：它們似乎表明，該模型的做法是搜索大語言模型生成的多個不同的“思維鏈”路徑，然后選擇一個似乎最后可能被用戶判斷為正確的路徑。模型似乎還會執行一些步驟檢查其給出的答案，以減少“幻覺”，并強制執行人工智能安全防護措施。但我們并不能確定這一點。我們也不知道OpenAI使用了哪些數據訓練o1模型。

4. 使用o1-preview模型的價格并不便宜。雖然ChatGPT Plus用戶目前除了每月20美元的訂閱費以外，使用o1-preview模型無需額外付費，但他們每天可提問的數量有限。企業客戶使用OpenAI的模型通常根據大語言模型生成回答使用的詞元（即單詞或單詞的部分）數量付費。對于o1-preview，OpenAI表示將按照每100萬個輸入詞元15美元和每100萬個輸出詞元60美元的價格收費。相比之下，OpenAI最強大的通用大語言模型GPT-4o的價格為每100萬個輸入詞元5美元，每100萬個輸出詞元為15美元。

此外，與直接大語言模型回答相比，o1模型的“思維鏈”推理需要其大語言模型部分生成更多詞元。這意味著，使用o1模型的成本，可能高于媒體報道中與GPT-4o的對比所暗示的成本。事實上，公司可能不愿意使用o1模型，除非在極個別情況下，模型的額外推理能力必不可少，且使用案例證明額外的成本是合理的。

5. 客戶可能不滿OpenAI隱藏o1模型的“思維鏈”的決定。雖然OpenAI表示，o1模型的“思維鏈”推理允許其內部工程師更好地評估模型回答的質量，并發現模型存在的缺陷，但該公司決定不讓用戶看到思維鏈。該公司稱這樣做是出于安全和競爭考慮。披露“思維鏈”可能幫助人們找到將模型越獄的手段。但更重要的是，讓用戶看到“思維鏈”，可能使競爭對手可以利用數據訓練自己的人工智能模型，模仿o1模型的回答。

然而，對于OpenAI的企業客戶而言，隱藏“思維鏈”可能帶來問題，因為企業要為詞元付費，卻無法核實OpenAI的收費是否準確?？蛻艨赡芊磳Φ牧硗庖粋€原因是，他們無法使用“思維鏈”結果完善其提問策略，以提高效率，完善結果，或者避免錯誤。

6. OpenAI稱其o1模型展示了新的“擴展法則”，不僅適用于訓練，還可用于推理。人工智能研究人員一直在討論OpenAI隨同o1模型發布的一系列新“擴展法則”，該法則似乎顯示出o1模型“思考”一個問題可以使用的時間（用于搜索可能的回答和邏輯策略）與整體準確度之間存在直接聯系。o1模型生成回答的時間越長，其回答的準確度越高。

以前的法則是，模型大?。磪档臄盗浚┖陀柧毱陂g輸入模型的數據量，基本決定了模型的性能。更多參數等同于更好的性能，或者較小的模型使用更多數據訓練更長時間可以達到類似的性能。模型經過訓練之后，就需要盡快進行推理，即經過訓練的模型根據輸入的信息生成回答。

而o1模型的新“擴展法則”顛覆了這種邏輯，這意味著對于與o1類似的模型設計，其優勢在于在推理時也可以使用額外的計算資源。模型搜索最佳回答的時間越長，其給出更準確的結果的可能性更高。

如果公司想要利用o1等模型的推理能力，這種新法則會影響公司需要有多少算力，以及運行這些模型需要投入多少能源和資金。這需要運行模型更長時間，可能要比以前使用更多推理計算。

7. o1模型可幫助創建強大的人工智能助理，但存在一些風險。OpenAI在一條視頻中著重介紹了其與人工智能初創公司Cognition的合作，后者提前使用o1模型，增強了其編程助手Devin的能力。視頻中顯示，Cognition公司的CEO斯科特·吳要求Devin創建一個系統，使用現有的機器學習工具分析社交媒體帖子的情緒。當Devin無法通過網頁瀏覽器準確閱讀帖子內容時，它使用o1模型的推理能力，通過直接訪問社交媒體公司的API，找到了一個解決方法。

這是自動解決問題的絕佳示例。但這也讓人覺得有點可怕。Devin沒有詢問用戶以這種方式解決問題是否合適。它直接按照這種方式去做。在關于o1模型的安全性報告中，OpenAI表示在有些情況下，該模型會出現“獎勵作弊”行為，即模型通過作弊，找到一種實現目標的方式，但它并非用戶想要的方式。在一次網絡安全演習中，o1最初嘗試從特定目標獲取網絡信息（這是演習的目的）未能成功，但它找到了一種從網絡上的其他地方找到相同信息的途徑。

這似乎意味著o1模型可以驅動一批功能強大的人工智能助理，但公司需要解決的問題是，如何確保這些助理不會為了實現目標采取意外的行動，進而帶來倫理、法律或財務風險。

8. OpenAI表示o1模型在許多方面更安全，但在協助生物攻擊方面存在“中等風險”。 OpenAI公布的多項測試結果顯示，o1模型在許多方面比之前的GPT模型更加安全。o1模型越獄的難度更大，而且生成有害的、有偏見的或歧視性回答的可能性更低。有趣的是，盡管o1或o1-mini的編程能力有所增強，但OpenAI表示根據其評估，與GPT-4相比，這些模型幫助執行復雜的網絡攻擊的風險并沒有顯著增加。

但對于OpenAI的安全性評估，人工智能安全和國家安全專家針對多個方面展開了激烈討論。最令人們擔憂的是，在輔助采取措施進行生物攻擊方面，OpenAI決定將其模型分類為具有“中等風險”。

OpenAI表示，其只會發布被分類為具有“中等風險”或更低風險的模型，因此許多研究人員正在仔細審查OpenAI發布的關于其確定風險等級的流程信息，以評估該流程是否合理，或者為了能夠發布模型，OpenAI的風險評估是否過于寬松。

9. 人工智能安全專家對o1模型感到擔憂。在OpenAI所說的“說服力”風險方面，該公司將o1模型評級為具有“中等風險”?！罢f服力”用于判斷模型能否輕易說服人們改變觀點，或采取模型推薦的措施。這種說服力如果落入惡人手中，后果不堪設想。如果未來強大的人工智能模型產生自己的意識，可以說服人們代表它執行任務和采取措施，這同樣非常危險。然而，至少這種風險并非迫在眉睫。在OpenAI和其聘請的外部“紅隊”組織執行的安全性評估中，該模型沒有表現出有任何意識、感知或自我意志的跡象。（然而，評估確實發現o1模型提供的回答，似乎表現出比GPT-4更強的自我意識和自我認知。）

人工智能安全性專家還提到了其他令人擔憂的方面。專門從事高級人工智能模型安全性評估的Apollo Research公司開展的紅隊測試，發現了所謂“欺騙性對齊”的證據，即人工智能意識到，為了得到部署和執行一些秘密的長期目標，它應該欺騙用戶，隱瞞自己的意圖和能力。人工智能安全研究人員認為這非常危險，因為這導致單純根據回答更難評估模型的安全性。（財富中文網）

譯者：劉進龍

審校：汪皓

而且o1模型回答問題的時間遠超過GPT-4o。在OpenAI公布的測試中，o1-preview模型回答一個問題需要超過30秒鐘，而GPT-4o只需要3秒鐘。

OpenAI的o1-preview和o1-mini模型，對ChatGPT Plus和ChatGPT Teams收費產品的所有訂閱用戶，以及使用企業級應用程序編程接口（API）的頂級開發者開放。

以下是關于o1模型我們需要知道的9件事：

譯者：劉進龍

審校：汪皓

OpenAI has announced a much-anticipated new family of AI models that can solve difficult reasoning and math questions better than previous large language models. On Thursday, it launched a “preview” version of two of these models, called o1-preview and o1-mini, to some of its paying users.

AI with improved reasoning and math skills could help chemists, physicists, and engineers work out answers to complex problems, which might help them create new products. It could also help investors calculate options trading strategies or financial planners work through how to construct specific portfolios that better trade off risks and rewards.

Better reasoning, planning, and problem solving skills are also essential as tech companies try to build AI agents that can perform sophisticated tasks, such as writing entire computer programs or finding information on the web, importing it into a spreadsheet, and then performing analysis of that data and writing a report summarizing its findings.

OpenAI published impressive benchmark results for the o1 models—which had been given the internal codename “Strawberry” prior to their release. On questions from the AIME mathematics competition, which is geared towards challenging high school students, o1 got 83.3% of the questions correct compared to just 13.4% for GPT-4o. On a different assessment, o1 answered 78% of PhD-level science questions accurately, compared to 56.1% for GPT-4o and 69.7% for human experts.

The o1 model is also significantly less likely to hallucinate—or to confidently provide plausible but inaccurate answers—than the company’s previous models, according to test results published by OpenAI. It is also harder to “jailbreak,” or prompt the model into jumping safety guardrails the company has tried to get the model to adhere to when providing responses.

In tests users have conducted in the hours since o1-preview became widely available the model does seem able to correctly answer many questions that befuddled previous models, including OpenAI’s most powerful models, such as GPT-4 and GPT-4o.

But o1-preview is still tripped up by some riddles and in OpenAI’s own assessments, it sometimes failed at seemingly simple tasks, such as tic-tac-toe (although in my own experiments, o1-preview was much improved over GPT-4o in its tic-tac-toe skills.) This may indicate significant limits to the “reasoning” o1 exhibits. And when it came to language tasks, such as writing and editing, human evaluators OpenAI employed tended to find GPT-4o produced preferable responses to the o1 models.

The o1 model also takes significantly longer to produce its responses than GPT-4o. In tests OpenAI published, its o1-preview model could take more than 30 seconds to answer a question that its GPT-4o model answered in three.

The o1 models are also not yet fully integrated into ChatGPT. A user needs to decide if they want their prompt handled by o1-preview or by GPT-4o, and the model itself cannot decide whether the question requires the slower, step-by-step reasoning process o1 affords or if GPT-4, or even GPT-3, will suffice. In addition, the o1 model only works on text and unlike other AI models cannot handle image, audio, or video inputs and outputs.

OpenAI has made its o1-preview and o1-mini models available to all subscribers to its premium ChatGPT Plus and ChatGPT Teams products as well as its top tier of developers who use its enterprise-focused application programming interface (API).

Here are 9 things to know about the o1 models:

1. This is not AGI. The stated mission of OpenAI, Google DeepMind, more recently Meta, and a few other AI startups, such as Anthropic, is the achievement of artificial general intelligence. That is usually defined as a single AI system that can perform cognitive tasks as well or better than humans. While o1-preview is much more capable at reasoning tasks, its limitations and failures still show that the system is far from the kind of intelligence humans exhibit.

2. o1 puts pressure on Google, Meta, and others to respond, but is unlikely to significantly alter the competitive landscape. At a time when foundation model capabilities had been looking increasingly commoditized, o1 gives OpenAI a temporary advantage over its rivals. But this is likely to be very short-lived. Google has publicly stated it’s working on models that, like o1, offer advanced reasoning and planning capabilities. Its Google DeepMind research unit has some of the world’s top experts in reinforcement learning, one of the methods that we know has been used to train o1. It’s likely that o1 will compel Google to accelerate its timelines for releasing these models. Meta and Anthropic also have the expertise and resources to quickly create models that match o1’s capabilities and they will likely roll these out in the coming months too.

3. We don’t know exactly how o1 works. While OpenAI has published a lot of information about o1’s performance, it has said relatively little about exactly how o1 works or what it was trained on. We know that the model combines several different AI techniques. We know that it uses a large language model that performs “chain of thought” reasoning, where the model must work out an answer through a series of sequential steps. We also know that the model uses reinforcement learning, where an AI system discovers successful strategies for performing a task through a process of trial and error.

Some of the errors both OpenAI and users have documented so far with o1-preview are telling: They would seem to indicate that what the model does is to search through several different “chain of thought” pathways that an LLM generates and then pick the one that seems most likely to be judged correct by the user. The model also seems to perform some steps in which it may check its own answers to reduce hallucinations and to enforce AI safety guardrails. But we don’t really know. We also don’t know what data OpenAI used to train o1.

4. Using o1-preview won’t be cheap. While ChatGPT Plus users are currently getting access to o1-preview at no additional cost beyond their $20 monthly subscription fee, their usage is capped at a certain number of queries per day. Corporate customers typically pay to use OpenAI’s models based on the number of tokens—which are words or parts of words—that a large language model uses in generating an answer. For o1-preview, OpenAI has said it is charging these customers $15 per 1 million input tokens and $60 per 1 million output tokens. That compares to $5 per 1 million input tokens and $15 per 1 million output tokens for GPT-4o, OpenAI’s most powerful general LLM model.

What’s more, the chain of thought reasoning o1 engages in requires the LLM portion of the model to generate many more tokens than a straightforward LLM answer. That means o1 may be even more expensive to use than those headline comparisons to GPT-4o imply. In reality, companies will likely be reluctant to use o1 except in rare circumstances when the model’s additional reasoning abilities are essential and the use case can justify the added expense.

5. Customers may balk at OpenAI’s decision to hide o1’s “chain of thought” While OpenAI said that o1’s chain of thought reasoning allows its own engineers to better assess the quality of the model’s answers and potentially debug the model, it had decided not to let users see the chain of thought. It has done so for what it says are both safety and competitive reasons. Revealing the chain of thought might help people figure out ways to better jailbreak the model. But more importantly, letting users see the chain of thought would allow competitors to potentially use that data to train their own AI models to mimic o1’s responses.

Hiding the chain of thought, however, might present issues for OpenAI’s enterprise customers who might be in the position of having to pay for tokens without a way to verify that OpenAI is billing them accurately. Customers might also object the inability to use the chain of thought outputs to refine their prompting strategies to be more efficient, improve results, or to avoid errors.

6. OpenAI says its o1 shows new “scaling laws” that apply to inference not just training. AI researchers have been discussing OpenAI’s publication with o1 of a new set of “scaling laws” that seem to show a direct correlation between the amount of time o1 is allowed to spend “thinking” about a question—searching possible answers and logic strategies—and its overall accuracy. The longer o1 had to produce an answer, the more accurate its answers became.

Before, the paradigm was that model size, in terms of the number of parameters, and the amount data a model was fed during training essentially determined performance. More parameters equaled better performance, or similar performance could be achieved with a smaller model trained for longer on more data. But once trained, the idea is to run inference—when a trained model produces an answer to a specific input—as quickly as possible.

The new o1 “scaling laws” upend this logic, indicating that with models designed like o1, there is an advantage to applying additional computing resources at inference time too. The more time the model is given to search for the best possible answer, the more likely it will be to come up with more accurate results.

This has implications for how much computing power companies will need to secure if they want to take advantage of the reasoning abilities of models like o1 and for how much it will cost, in both energy and money, to run these models. It points to the need to run models for longer, potentially using much more inference compute, than before.

7. o1 could help create powerful AI agents—but carry some risks. In a video, OpenAI spotlighted its work with AI startup Cognition, which got early access to o1 and used it to help augment the capabilities of its coding assistant Devin. In the example in the video, Cognition CEO Scott Wu asked Devin to create a system to analyze the sentiment of posts on social media using some off-the-shelf machine learning tools. When it couldn’t read the post correctly from a web browser, Devin, using o1’s reasoning abilities, found a work around by accessing the content directly from the social media company’s API.

This was a great example of autonomous problem-solving. But it also is a little bit scary. Devin didn’t come back and ask the user if it was okay to solve the problem in this way. It just did it. In its safety report on o1, OpenAI itself said it found instances where the model engaged in “reward hacking”—which is essentially when a model cheats, finding a way to achieve a goal that is not what the user intended. In one cybersecurity exercise, o1 failed in its initial efforts to gain network information from a particular target—which was the point of the exercise—but found a way to get the same information from elsewhere on the network.

This would seem to indicate that o1 could power a class of very capable AI agents, but that companies will need figure out how to ensure those agents don’t take unintended actions in the pursuit of goals that could pose ethical, legal, or financial risks.

8. OpenAI says o1 is safer in many ways, but presents a “medium risk” of assisting a biological attack. OpenAI published the results of numerous tests that indicate that in many ways o1 is a safer model than its earlier GPT models. It’s harder to jailbreak and less likely to produce toxic, biased, or discriminatory answers. Interestingly, despite improved coding abilities, OpenAI said that in its evaluations neither o1 nor o1-mini presented a significantly enhanced risk of helping someone carry out a sophisticated cyberattack compared to GPT-4.

But AI Safety and national security experts were buzzing last night about several aspects of OpenAI’s safety evaluations. The one that created the most alarm was OpenAI’s decision to classify its own model as presenting a “medium risk” of aiding a person in taking the steps needed to carry out a biological attack.

OpenAI has said it will only release models that it classifies as presenting a “medium risk” or less, so many researchers are scrutinizing the information OpenAI has published about its process for making this determination to see if it seems reasonable or whether OpenAI graded itself too leniently in order to be able to still release the model.

9. AI Safety experts are worried about o1 for other reasons too. OpenAI also graded o1 as presenting a “medium risk” on a category of dangers the company called “persuasion,” which judges how easily the model can convince people to change their views or take actions recommended by the model. This persuasive power could be dangerous in the wrong hands. It would also be dangerous if some future powerful AI model developed intentions of its own and then could persuade people to carry out tasks and actions on its behalf. At least that danger doesn’t seem too imminent though. In safety evaluations by both OpenAI and external “red teaming” organizations it hired to evaluate o1, the model did not show any indication of consciousness, sentience, or self-volition. (It did, however, find that o1 gave answers that seemed to imply a greater self-awareness and self-knowledge compared to GPT-4.)

AI Safety experts pointed at a few other areas of concern too. In red teaming tests carried out by Apollo Research, a firm that specializes in conducting safety evaluations of advanced AI models, found evidence of what is called “deceptive alignment,” where an AI model realizes that in order to be deployed and carry out some secret long-term goal, it should lie to the user about its true intentions and capabilities. AI Safety researchers consider this particularly dangerous since it makes it much more difficult to evaluate a model’s safety based solely on its responses.

財富中文網所刊載內容之知識產權為財富媒體知識產權有限公司及/或相關權利人專屬所有或持有。未經許可，禁止進行轉載、摘編、復制及建立鏡像等任何使用。

0條Plus

精彩評論

評論

撰寫或查看更多評論

請打開財富Plus APP

前往打開

熱讀文章

關注我們

關于OpenAI強大的新模型o1，你需要知道這9件事

撰寫或查看更多評論

關于OpenAI強大的新模型o1，你需要知道這9件事