Adventures with Langchain: Building Siri

There’s this fun library that I’ve seen on Twitter called LangChain. I wanted to take it for a spin and see if I could build some little fun thing with it – let’s see how it goes!

Let’s start out with some boilerplate HuggingFace code, this is often useful when debugging:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")

inputs = tokenizer("how tall is barack obama", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['1.8 m']

When chaining together LLMs, one obvious thing that comes to mind is modality-transition – or concretely, transitioning from audio -> text, and maybe even audio -> text -> image -> video (speak a movie into existence!)

I know Whisper is popular for Audio -> Text, so let’s try to get that running:

from pathlib import Path 
import whisper

model = whisper.load_model("base")
result = model.transcribe(str(Path.home() / 'Downloads/testing.m4a'))
print(result["text"])
 My name is Daniel. I live in California. I like to travel.

Cool – using the whisper library is quite easy! I tried to use the HuggingFace Whisper model, but ran into the following error:

MemoryError: Cannot allocate write+execute memory for ffi.callback(). 
You might be running on a system that prevents this. For more information, see 
https://cffi.readthedocs.io/en/latest/using.html#callbacks

Possibly due to running locally on MacOS? I googled around for a bit and decided to just go with the Whisper library, as it worked out of the box, no need to fight with FFI issues.

Now, let’s figure out how to record audio from a Jupyter Notebook. This is a bit clunky but not hard:

from ipywebrtc import AudioRecorder, CameraStream
import torchaudio
from IPython.display import Audio

camera = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=camera)
recorder

Now, let’s bring in LangChain. They already have strong support for OpenAI and HuggingFace which is great, but I need a way to use the locally running Whisper model. Thankfully, I can add a new primitive and as long as I define the input_keys, output_keys, and a _call method, LangChain will happily accept it as part of a SequentialChain!

from more_itertools import one
from typing import List, Dict

import whisper
from langchain.chains.base import Chain

class AudioToTextChain(Chain):
    MODEL = whisper.load_model("base")

    def _call(self, inputs: Dict[str, str]) -> Dict[str, str]:
        result = self.MODEL.transcribe(inputs[one(self.input_keys)], fp16=False)
        return {one(self.output_keys): result['text']}

    @property
    def input_keys(self) -> List[str]:
        return ['audio_fname']

    @property
    def output_keys(self) -> List[str]:
        return ['transcription']

transcription_chain = AudioToTextChain()

Okay, we have the Audio -> Text setup, now let’s use a small FLAN model to answer questions as a virtual assistant to create a dumber version of Siri!

This runs locally on my MacBook in a couple seconds, which isn’t too bad – but it doesn’t have any internet access. Maybe I’ll add that as a second step.

from langchain.chains import SequentialChain
from langchain import (
    PromptTemplate, 
    HuggingFacePipeline, 
    LLMChain,
)

template = """You are a virtual assistant. Given a request, answer it to the best of your abilities.

Request:
{transcription}
Answer:"""

siri_chain = LLMChain(
    llm=HuggingFacePipeline.from_model_id('google/flan-t5-large', task='text2text-generation'), 
    prompt=PromptTemplate(input_variables=["transcription"], template=template), 
    output_key="answer",
)

overall_chain = SequentialChain(
    chains=[transcription_chain, siri_chain],
    input_variables=["audio_fname"],
    # Here we return multiple variables:
    output_variables=["transcription", "answer"],
    verbose=True,
)

Putting it all together

Okay, now let’s run it! First, we record some audio, and then save it as recording.webm, which we feed in as the input to the SequentialChain, which is then transcribed by Whisper, and then processed by FLAN!

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=camera)
recorder
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)

review = overall_chain({"audio_fname": "recording.webm"})


> Entering new SequentialChain chain...
Chain 0:
{'transcription': " What's the best place to go surfing in Australia?"}

Chain 1:
{'answer': 'The Gold Coast'}


> Finished chain.

There we go! Using the LangChain library and a few off the shelf models, we can create a little virtual assistant that runs locally and can answer all sorts of trivia questions!

TODO:

  • Host it somewhere so I can ask it questions all day when I’m out
  • Train it on all my documents and data so it knows me
  • Connect it to all my applications / calendar / etc so it can take actions on my behalf when I ask it to