• Google’s PaLM 2 large language model uses nearly five times as much text data for training as its predecessor, LLM, CNBC has learned.
  • Google, which announced PaLM 2 last week, said the model is smaller than the previous PaLM, but uses more efficient “techniques.”
  • The lack of transparency around training data in artificial intelligence models is becoming an increasingly hot topic among researchers.

Sundar Pichai, CEO of Alphabet Inc., during the Google I/O Developers Conference on Wednesday, May 10, 2023, in Mountain View, California.

David Paul Morris | Bloomberg | Getty Images

Google’s new big language model, which the company announced last week, uses almost five times more training data than its predecessor in 2022, replacing more advanced coding, math and creative writing tasks, CNBC has learned. allows to do.

According to internal documents seen by CNBC, PaLM 2, the company’s new general-purpose large-scale language model (LLM) introduced at Google I/O, was trained on 3.6 trillion tokens. Tokens with word strings are an important building block for training LLMs, as they teach the model to predict the next word that will appear in the sequence.

The previous version of PaLM, which stands for Google’s Pathways Language Model, was released in 2022 and was trained on 780 billion tokens.

While Google has been eager to demonstrate the power of its AI technology and how it can be incorporated into search, email, word processing and spreadsheets, the company has been reluctant to publish the size or other details of its training data. OpenAI, the Microsoft-backed creator of ChatGPT, has also kept the features of its latest LLM, GPT-4, under wraps.

Companies attribute the reason for non-disclosure to the competitive nature of business. Google and OpenAI are rushing to attract users who want to search for information using conversational chatbots rather than traditional search engines.

But as the AI ​​arms race heats up, the research community is demanding more transparency.

Since introducing PaLM 2, Google has said the new model is smaller than previous LLMs, which is significant because it means the company’s technology is more efficient when performing more complex tasks. PaLM 2 is built on 340 billion parameters, an indication of the model’s complexity, according to internal documents. The initial PaLM was developed on 540 billion parameters.

Google did not immediately comment for this story.

Google he said A blog post about PaLM 2 says the model uses a “new technique” called “computationally optimal scaling.” This makes LLM “more efficient with better performance, including faster results, fewer service parameters and lower service cost”.

In announcing PaLM 2, Google confirmed an earlier CNBC report that the model was trained in 100 languages ​​and could handle a wide range of tasks. It is already used to power 25 features and products, including the company’s experimental chatbot Bard. Available in four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn.

Based on public disclosures, PaLM 2 is more powerful than any existing model. Facebook’s LLM is called LLaMA announced Studying on 1.4 trillion tokens in February. OpenAI last shared ChatGPT’s training size with GPT-3, when the company said at the time that it had trained on 300 billion tokens. OpenAI released GPT-4 in March and said it demonstrated “human-level performance” in many professional tests.

LaMDA, Conversation with Google LLM presented two years ago and launched with Bard in February, it has studied for 1.5 trillion tokens, according to recent documents seen by CNBC.

As new AI applications rapidly gain popularity, the debate over the underlying technology is heating up.

El Mehdi Al Mhamdi, Senior Google Research Scientist, He resigned in February lack of transparency of the company. OpenAI CEO Sam Altman testified at a Senate Judiciary subcommittee hearing on privacy and technology on Tuesday and agreed with lawmakers that a new system for dealing with artificial intelligence is needed.

“We need a new framework for a very new technology,” Altman said. “Certainly, companies like ours have a lot of responsibility for the tools we make around the world.”

— CNBC’s Jordan Novet contributed to this report.

WATCH: Sam Altman, CEO of OpenAI, calls for control of artificial intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *