Skip to main content

What is an AI token?

A presenter at Google IO shows information on a new AI project.
Google

Google recently announced that Gemini 1.5 Pro would increase from a 1 million token context window to 2 million. That sounds impressive, but what in the world is a token anyways?

At its core, even chatbots need help processing the text they get so they can understand concepts and communicate with you in a human-like fashion. This is accomplished using a token system in the generative AI space that breaks down data so it is more easily digestible by AI models.

What is an AI token?

An infograph highlighting Gemini's 1 million token long context window capability.
Google

An AI token is the smallest unit a word or phrase can be broken down into when being processed by a large language model (LLM). Tokens account for words, punctuation marks, or subwords, which allow models to efficiently analyze and interpret text and, subsequently, generate content in a similar unit-based fashion. This is similar to how a computer will convert data into Unicode zeros and ones for easier processing. Tokens allow a model to determine a pattern or relationship within words and phrases so they can predict future terms and respond in the context of your prompt.

When you input a prompt, the phrase and words are too long for a chatbot to interpret as is – they must be broken down into smaller pieces before the LLM can even process the request. They are converted into tokens, then the request is submitted and analyzed, and a response is returned to you.

The process of turning text into tokens is called tokenization. There are many tokenization methods, which can differ based on variants, including dictionary instructions, word combinations, language, etc. For example, the space-based tokenization method splits words up based on the spaces between them. The phrase “It’s raining outside” would be split into the tokens ‘It’s’, ‘raining’, ‘outside’.

How do AI tokens work?

The general token conversion breakdown followed in the generative AI space denotes that one token equals approximately four characters in English — or 3/4 of a word — and 100 tokens equals approximately 75 words. Other conversions suggest one to two sentences equals about 30 tokens, one paragraph equals about 100 tokens, and 1,500 words equals about 2,048 tokens.

Whether you’re a general user, a developer, or an enterprise, the AI program you’re using is employing tokens to perform its tasks. Once you begin paying for generative AI services, you’re paying for tokens to maintain the service at its optimum level.

Most generative AI brands also have basic rules around how tokens function on their AI models. Many companies have token limitations, which put a cap on the number of tokens that can be processed in one turn. If the request is larger than the token limit on an LLM, the tool won’t be able to complete a request in a single turn. For example, if you input a 10,000-word article for translation into a GPT with a 4,096-token limit, it won’t be able to process it fully to give a detailed answer because such a request would require at least 15,000 tokens.

However, companies have quickly been advancing the capabilities of their LLMs, adding to the token limitation with new versions. Google’s research-based BERT model had a maximum input length of 512 tokens. OpenAI’s GPT-3.5 LLM, which runs the free version of ChatGPT, has a max of 4,096 input tokens, while its GPT-4 LLM, which runs the paid version of ChatGPT, has a max of 32,768 input tokens. This equates to approximately 64,000 words or 50 pages of text.

Google’s Gemini 1.5 Pro which provides audio functionality to the brand’s AI Studio has a standard 128,000 token context window. The Claude 2.1 LLM has a limit of up to 200,000 context tokens. This equates to approximately 150,000 words or 500 pages of text.

What are the different types of AI tokens?

There are several types of tokens used in the generative AI space that allow LLMs to identify the smallest units available for analysis. Here are some of the main tokens that are of interest to an AI model.

  • Word Tokens are words that represent single units on their own, such as “bird,” “house,” or “television.”
  • Sub-word Tokens are words that can be truncated into smaller units, such as splitting Tuesday into “Tues” and “day.”
  • Punctuation Tokens take the place of punctuation marks, including commas (,), periods (.), and others.
  • Number Tokens take the place of numerical figures, including the number “10.”
    Special Tokens can note several unique instructions within executing queries and training data.

What are the benefits of tokens?

There are several benefits to tokens in the generative AI space. Primarily, they act as a connector between human language and computer language when working with LLMs and other AI processes. Tokens help models process large amounts of data at once, which is especially beneficial in enterprise spaces that use LLMs. Companies can work with token limits to optimize the performance of AI models. As future LLM versions are introduced, tokens will allow models to have a larger memory through higher limits or context windows.

Other benefits of tokens lie in the training aspects of LLMs. Since they are small units, they can be used to make it easier to optimize the speed of processing data. Due to the predictive nature of tokens, they have a greater understanding of concepts and improve sequences over time. Tokens assist in implementing multimodal aspects such as images, videos, and audio into LLMs alongside text-to-speech chatbots.

Tokens also have some data security and cost-efficiency benefits, due to their Unicode setup protecting vital data and truncating longer text into a simplified version.

Fionna Agomuoh
Fionna Agomuoh is a technology journalist with over a decade of experience writing about various consumer electronics topics…
This new free tool lets you easily train AI models on your own
Gigabyte AI TOP utility branding

Gigabyte has announced the launch of AI TOP, its in-house software utility designed to bring advanced AI model training capabilities to home users. Making its first appearance at this year’s Computex, AI TOP allows users to locally train and fine-tune AI models with a capacity of up to 236 billion parameters when used with recommended hardware.

AI TOP is essentially a comprehensive solution for local AI model fine-tuning, enhancing privacy and security for sensitive data while providing maximum flexibility and real-time adjustments. According to Gigabyte, the utility comes with a user-friendly interface and has been designed to help beginners and experienced users easily navigate and understand the information and settings. Additionally, the utility includes AI TOP Tutor, which offers various AI TOP solutions, setup guidance, and technical support for all types of AI model operators.

Read more
An ‘AI-native’ school is coming to revolutionize education
Eureka Labs promo banner

AI has been causing problems for schools and educational institutions ever since ChatGPT first launched, but a new education startup is embracing AI rather than resisting it. Mere months after departing OpenAI, which he helped found, AI researcher Andrej Karpathy announced the launch of his new "AI+Education" startup, dubbed Eureka Labs.

Karpathy calls Eureka Labs a "new kind of school that is AI native," with the express aim of developing a "Teacher + AI symbiosis" that will allow "anyone to learn anything." He envisions an education system built from the ground up with AI as its core tenet, with human teachers developing lesson plans while being supplemented in the classroom by digital assistants.

Read more
What is Microsoft 365? Here’s the cloud software suite, explained
Microsoft Office free apps.

Microsoft 365 is the brand’s suite of cloud-based productivity apps that can be used for word processing, group collaboration, data analysis, presentation development, storage, and email. Many may be familiar with Microsoft Teams, Word, Excel, PowerPoint, Outlook, and OneDrive as separate applications at one point; however, many high-performance users may utilize more than one of these programs for work, hobbies, or their everyday lives.

This could serve as a reason to consider Microsoft 365, to get more comprehensive access to the brand’s app library. Here is a look at what you need to know about the Microsoft 365 productivity suite.
Microsoft 365 paid subscriptions 

Read more