Post

A Handy Python Toolkit for LLMs

Not long after the release of OpenAI’s ChatGPT, data security vendor Cyberhaven observed that employees across industries were sharing sensitive and privacy-protected information while interacting with the large language model. Within the defense industry, usage of such external services poses a significant security risk. However, since the release of open large language models from organizations like Meta and MistralAI, it is now possible to efficiently run ChatGPT-like models on your own machines using non-public data, without sharing information externally.

The onprem Python package makes it easier to run large langugage models (LLMs) on-premises and apply them to non-public data on your own machines. This allows you to run ChatGPT-like large language models behind corporate firewalls and within air-gapped networks with no risk of data leakage.

After the package is installed, you can load an LLM as follows:

1
2
from onprem import LLM
llm = LLM(n_gpu_layers=-1)

As of this writing, the default LLM as a 7B-parameter Mistral model, but you can load any available LLM of your choosing. This is especially useful if the default LLM struggles with a particular task.

Let’s run through some quick examples of using LLMs to solve various tasks. Some examples will use the default LLM. For others, we will employ the use of different LLMs.

Question-Answering

The tookit comes pre-packages with a vector database (Chroma as of this writing). This allows you ask questions about a set of documents and generate answers with sources. In this example, we’ll download the 2024 National Defense Authorization Act (NDAA) and ask it a question. We will first ingest the document(s) into the vector database and then pose questions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# STEP 1: Downlaod the 2024 NDAA
!wget https://www.congress.gov/118/crpt/hrpt125/CRPT-118hrpt125.pdf-O /tmp/ndaa2024/report.pdf

# STEP 2: Insantiate the LLM
from onprem import LLM
llm = LLM(n_gpu_layers=-1, embedding_model_kwargs={'device':'cuda'}, verbose=False)

# STEP 3: Store documents in vector DB
llm.ingest('/tmp/ndaa2024')

# STEP 4: 
output = llm.ask('What is said about hypersonics?')

# OUTPUT
# The context discussees the concern of the US military regarding advancements in hypersonic capabilities
# made by peer and near-peer adversaries. To effectively deter and defeat these national security threats,
# the Department of Defense must make investments to enhance its ability to develop, test, and field advanced
# hypersonic capabilities. This inclueds researh and development funding for reusable hypersonic platforms with
# aircraft-like operations and qualities. The committee also recommends increased funding for advaned hypersonics facilities for research and graduate-level education.

In the second step, the embedding_model_kwargs parameter activates the GPU for speeding up STEP 3, while the n_gpu_layers parameter speeds up STEP 4.

Document Summarization

By insantiating a Summarizer object, you can summarize a document by pointing directly to the file. In this example, we will use the Zephyr-7B-beta model, which performs really well on summarization tasks. When using a model that is different than the default, be sure to change the prompt template as specified on the model’s home page.

1
2
3
4
5
6
from onprem import LLM
from onprem.pipelines import Summarizer
llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',
          prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
          n_gpu_layers=-1, verbose=False, mute_stream=True, temperature=0) # set based on your system
summarizer = Summarizer(llm)

Let’s now summarize a blog post:

1
2
3
4
5
6
7
8
9
10
11
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()
with open('/tmp/blog.txt', 'w') as f:
    f.write(docs[0].page_content)
summary = summarizer.summarize('/tmp/blog.txt')

# OUTPUT
#  This document discusses techniques and approaches for autonomous agents to plan, reason, act, and reflect.
# The document covers two main approaches: ReAct and Reflexion.  ReAct is a technique used in ...

Data Cleaning

LLMs can be used to clean data, as well. In the example below, we correct grammar.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from onprem import LLM
llm = LLM()

prompt = """Correct the grammar and spelling in the supplied sentences.  Here are some examples.
[Sentence]:
I love goin to the beach.
[Correction]: I love going to the beach.
[Sentence]:
Let me hav it!
[Correction]: Let me have it!
[Sentence]:
It have too many drawbacks.
[Correction]: It has too many drawbacks.
[Sentence]:
I do not wan to go
[Correction]:"""
saved_output = llm.prompt(prompt, stop=['\n\n'])

# OUTPUT
#  I do not want to go.

We find prompts like the above to be particularly useful in fixing OCR errors prior to running text through an LLM.

Text Generation

Of course, LLMs are great at generating text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from onprem import LLM
llm = LLM()
prompt = """Generate a tweet based on the supplied Keyword. Here are some examples.
[Keyword]:
markets
[Tweet]:
Take feedback from nature and markets, not from people
###
[Keyword]:
children
[Tweet]:
Maybe we die so we can come back as children.
###
[Keyword]:
startups
[Tweet]:
Startups should not worry about how to put out fires, they should worry about how to start them.
###
[Keyword]:
climate change
[Tweet]:"""

saved_output = llm.prompt(prompt)

# OUTPUT
#  Climate change is not a problem for our grandchildren to solve, it's a problem for us to solve for our grandchildren. #actonclimate #climateaction

Information Extraction

In the example below, we use an LLM to extract the name, position, and employer from bios.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from onprem import LLM
llm = LLM(n_gpu_layers=-1)

prompt = """ Extract the Name, Current Position, and Current Company from each piece of Text.

Text: Alan F. Estevez serves as the Under Secretary of Commerce for Industry and Security.  As Under Secretary, Mr. Estevez leads
the Bureau of Industry and Security, which advances U.S. national security, foreign policy, and economic objectives by ensuring an
effective export control and treaty compliance system and promoting U.S. strategic technology leadership.
A: Name:  Alan F. Estevez | Current Position: Under Secretary | Current Company: Bureau of Industry and Security

Text: Pichai Sundararajan (born June 10, 1972[3][4][5]), better known as Sundar Pichai (/ˈsʊndɑːr pɪˈtʃaɪ/), is an Indian-born American
business executive.[6][7] He is the chief executive officer (CEO) of Alphabet Inc. and its subsidiary Google.[8]
A: Name:   Sundar Pichai | Current Position: CEO | Current Company: Google

Text: Norton Allan Schwartz (born December 14, 1951)[1] is a retired United States Air Force general[2] who served as the 19th Chief of Staff of the
Air Force from August 12, 2008, until his retirement in 2012.[3] He previously served as commander, United States Transportation Command from
September 2005 to August 2008. He is currently the president of the Institute for Defense Analyses, serving since January 2, 2020.[4]
A:"""
saved_output = llm.prompt(prompt, stop=['\n\n'])

# OUTPUT
#  Name: Norton Allan Schwartz | Current Position: President | Current Company: Institute for Defense Analyses

In the example above, we supplied the input text directly as part of the prompt. You can also perform information extraction by pointing to a file directly:

1
2
3
4
5
6
7
8
9
10
11
12
extractor = Extractor(llm)
prompt = """Extract the names of research institutions (e.g., universities, research labs, corporations, etc.)
from the following sentence delimitated by three backticks. If there are no organizations, return NA.
If there are multiple organizations, separate them with commas.
```{text}```
"""
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2104.12871.pdf -O /tmp/mitchell.pdf -q
df = extractor.apply(prompt, fpath='/tmp/mitchell.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]

# OUTPUT
# Santa Fe Insitute

For more information, checkout the full documentation, which has many more code examples.

This post is licensed under CC BY 4.0 by the author.