Transformers Agent: AI Tool That Automates Everything

We have a new AI tool in the market called Transformers Agent which is so powerful that it can automate just about any task you can think of. It can generate and edit images, video, audio, answer questions about documents, convert speech to text and do a lot of other things.

Hugging Face, a well-known name in the open-source AI world, released Transformers Agent that provides a natural language API on top of transformers. The API is designed to be easy to use. With a single line code, it provides a variety of tools for performing natural language tasks, such as question answering, image generation, video generation, text to speech, text classification, and summarization.

Transformers Agent released by Hugging Face

Table of Contents

How does Transformers Agent work?

Let's understand these two terms Transformers and Agent.

Transformers are models used for natural language processing (NLP) tasks. Let's say a chatbot that helps users book flights. When a user types in a query like "I want to book a flight from New York to San Francisco on June 15th," the chatbot's transformer model will break down the input text into a sequence of tokens, such as "book", "flight", "New York", "San Francisco", and "June 15th".

The transformer will then use self-attention to analyze each token in the sequence and determine its relevance to the overall meaning of the query. For instance, it might pay more attention to the "New York" and "San Francisco" tokens to identify the user's departure and destination cities.

Once the self-attention step is complete, the transformer will generate a response based on the input sequence. In this case, it might respond with flight options that match the user's query, such as "Here are some flights from New York to San Francisco on June 15th."

In layman's term, the term Agent in Transformers Agent refers to a computer program that uses Transformers to perform tasks. Here computer program is a large language model. In the example of flight booking, Transformers Agent fetches flight schedules and prices. It allows developers to provide the language model with a description of the task they want, such as finding available flights between two cities on a specific date.

Tools

Tools are functions which are used to generate final output depending on the prompt. For example it generates image if prompt is about drawing picture about something. See the list of some of tools that are run at backend.

Function Name	Description
image_generator	Generates images based on a text prompt.
image_captioner	Generates captions for images.
image_transformer	Transforms images such as resizing, cropping, and rotating.
classifier	Classifies text into predefined categories.
translator	Translates text from one language to another.
speaker	Reads text aloud.
summarizer	Summarizes a long piece of text into a shorter, more concise version.
transcriber	Converts speech to text.
text_qa	Answers questions about text.
text_downloader	Downloads text from the internet.
image_qa	Answers questions about images.
video_generator	Generates videos based on a text prompt.
document_qa	Answers questions about documents.
image_segmenter	Segment images into their parts.

Benefits of Transformers Agent

Some of the benefits of using the Transformers Agent API are as follows.

Transformers Agent API is easy to use. It provides a high-level interface that hides the complexity of transformers.
It is efficient which means it can be used to perform natural language tasks at scale.
It can be easily extended to use new transformer models or parameters.
It has several use cases such as in the fields of customer service, marketing, sales, and research.

How to run Transformers Agent

You can use my Google Colab Notebook to explore Transformers Agent. Click on the link below to access it.

Install the required libraries

To get started with the Transformers Agent API, you will need to install the required libraries - transformers openai accelerate diffusers

!pip install transformers openai accelerate diffusers -q

Import transformers library

import transformers

Once transformers librart is installed and loaded, check version of transformers library and make sure it is 4.29 or later.

print(transformers.__version__)

Create an Agent

First, you need to create an agent. An agent is essentially a large language model. It can be OpenAI model, StarCoder model or OpenAssistant model.

To use the OpenAI model, you will need an OpenAI API key. It is not available for free but the cost of OpenAI API is very minimal depending on the number of tokens (words) you use. On the other hand, the StarCoder model and the OpenAssistant model can be loaded from the HuggingFace Hub. Using the HuggingFace Hub is free, but you will need a HuggingFace Hub API key.

OpenAI

import openai
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxxxxxxx"

from transformers import OpenAiAgent
agent = OpenAiAgent(model="gpt-3.5-turbo")

Starcoder

from huggingface_hub import login
login("YOUR_TOKEN")

from transformers import HfAgent
agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")

OpenAssistant

from huggingface_hub import login
login("YOUR_TOKEN")

from transformers import HfAgent
agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

Run Agent

agent.run is a single execution method and selects the tool for the task automatically, e.g., select the image generator tool to create an image.

agent.run("Draw me a picture of person sitting outside river.")

If you want to see the tool that is being used in generating the final result, you can use the argument return_code = True.

agent.run("Draw me a picture of person sitting outside river", return_code = True)

Output

==Explanation from the agent==
I will use the following tool: `image_generator` to generate an image according to the prompt.


==Code generated by the agent==
image = image_generator(prompt="person sitting outside river")
from transformers import load_tool
image_generator = load_tool("huggingface-tools/text-to-image")
image = image_generator(prompt="person sitting outside river")

Chat

The difference between .run and .chat are as follows:

.run does not remember prior chat conversation but performs better for running multiple tools in a row from a given instruction.
.chat keeps chat history which means it remembers prior chats.

agent.chat("Draw me a picture of saint sitting outside river")

How to update image

By using picture= option, you can update previously generated image.

picture = agent.run("Generate a picture of rivers and lakes.")
updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture)

Text to Speech

In the example below, we are converting text to speech.

audio = agent.run("Read out loud the summary of [URL]")
play_audio(audio)

Let's take another example in which we are asking agent to run multiple operations - first generate image and then caption it. Once done, then convert text to speech.

audio = agent.run("Can you generate an image of a boat? Please read out loud the contents of the image afterwards")
play_audio(audio)

Difference between Transformers and LangChain Agent

Both the Transformers Agent and the LangChain Agent allow for the creation of custom agents, and they both utilize Python files to represent each tool as a class. While they share similarities in terms of objectives, it's important to be aware of the few differences between them before using them.

Stability : The Transformers Agent is still in the experimental phase and has a more limited scope and flexibility compared to the LangChain Agent.
Tools : The Transformers Agent offers a variety of tools powered by Transformer models, enabling multimodal capabilities and specialized models for specific tasks. It can interact with over 100,000 Hugging Face models. Whereas the LangChain Agent uses external APIs for its tools, but it also supports Hugging Face Tools integration.
Code Execution : The Transformers Agent includes code-execution as a step after selecting tools, focusing on executing Python code specifically whereas the LangChain Agent includes "code-execution" as one of its tools, providing more flexibility in defining the desired task goal beyond just executing Python code.
Framework : The Transformers Agent employs a prompt template to determine the appropriate tool based on its description and provides explanations and few-shot learning examples. Whereas the LangChain Agent uses the ReAct framework to determine the tool and provides similar thought processes and reasoning as the Transformers Agent.

Conclusion

If you are looking for an efficient way to handle various natural language tasks, we have great news for you: the Transformers Agent API is now available. This powerful AI tool is specifically designed to handle a broad spectrum of natural language processing tasks. What sets it apart is not just its user-friendly nature, but also its exceptional extensibility and performance. It is important to note that the API is currently in an experimental phase and subject to potential changes. However, it holds the promise of even greater robustness and new features in the future.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn