Letters to my past self

What’s going on in a HuggingFace Tokenizer?

You’ve probably seen this sort of sample code on the HF website, for example on the FLAN LLM page:

prompt = "A step by step recipe to make bolognese pasta:"

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small", use_fast=False)

inputs = tokenizer("A step by step recipe to make bolognese pasta:", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Pour a cup of bolognese into a large bowl and add the pasta']

The LLM output is impressive, but I was curious what was going on under the hood of the tokenizer(...) line. What’s going on there, and how is the library designed to support all sorts of different encoding styles and parameters?

I chose to investigate the use_fast=False option, although it looks like fast currently isn’t even an option for T5 (FLAN) tokenizers

As we’ll see, it mostly delegates to the SentencePiece library, but there’s a few layers of Python that reveals some nice helper functions, and showed me a bit more about how tokenization is handled by HuggingFace.

When we call tokenizer(...), it’s really calling tokenizer.__call__(...) under the hood – this is a typical Python syntactic sugar.

Using Jupyter’s nice ?? we can inspect the source code and start digging in:

Here’s the call stack – most of these functions pretty trivially delegate to the next one in the list, with lots of error checking and batching and extra-argument passing.

tokenizer.__call__ # delegates to _call_one...
tokenizer._call_one # delegates to encode_plus...
tokenizer.encode_plus # delegates to _encode_plus...
tokenizer._encode_plus

_encode_plus splits into three separate calls of interest (I am adding some additional variable names for legibility here):

tokens = self.tokenize(text, **kwargs)
ids = self.convert_tokens_to_ids(tokens)
self.prepare_for_model(ids, ...)

Let’s look at them one by one:

tokenize

tokenizer.tokenize(prompt) # delegates to _tokenize...
tokenizer._tokenize(prompt) # delegates to sp_model.encode...
tokenizer.sp_model.encode(prompt, out_type=str)

Okay, we’re down to the sp_model, aka the SentencePiece module that the tokenizer holds inside of it. This calls _EncodeAsPieces, which returns a list of strings – our first transformation! We’ve gone from a single prompt to a list of tokens. You can see how the more typical words (by, step, make) were not split up, but bolognese was. Maybe next time I’ll dig more into what words are part of the vocab and what aren’t, and why.

tokenizer.sp_model._EncodeAsPieces(
    text=prompt, 
    enable_sampling=tokenizer.sp_model._enable_sampling, 
    nbest_size=tokenizer.sp_model._nbest_size,
    alpha=tokenizer.sp_model._alpha, 
    add_bos=tokenizer.sp_model._add_bos, 
    add_eos=tokenizer.sp_model._add_eos, 
    reverse=tokenizer.sp_model._reverse, 
    emit_unk_piece=tokenizer.sp_model._emit_unk_piece,
)

['▁A',
 '▁step',
 '▁by',
 '▁step',
 '▁recipe',
 '▁to',
 '▁make',
 '▁',
 'b',
 'ologne',
 's',
 'e',
 '▁pasta',
 ':']

Ok, _EncodeAsPieces is now at the _sentencepiece.so layer, no longer pure Python, which I’ll leave as an adventure for another day.

That was tokenization – let’s move on to…

convert_tokens_to_ids

tokenizer.convert_tokens_to_ids
tokenizer._convert_token_to_id_with_added_voc # basically a list comprehension, operates on each str, not the full list
tokenizer._convert_token_to_id
tokenizer.sp_model.piece_to_id

tokenizer.sp_model.piece_to_id??

Signature: tokenizer.sp_model.piece_to_id(arg)
Docstring: <no docstring>
Source:   
  def _batched_func(self, arg):
    if type(arg) is list:
      return [_func(self, n) for n in arg]
    else:
      return _func(self, arg)
File:      ~/miniconda3/lib/python3.9/site-packages/sentencepiece/__init__.py
Type:      method

tokenizer.sp_model.piece_to_id('▁A')

tokenizer.convert_tokens_to_ids(tokenizer.tokenize(prompt))

[71, 1147, 57, 1147, 2696, 12, 143, 3, 115, 23443, 7, 15, 13732, 10]

Okay, so now we’ve converted each human-readable token to ids! Notice step is 1147, occurring both at the 1st and 3rd index.

Once again it looks like we’re hitting the compiled layer (the source of sp_model.piece_to_id is wrong, IPython is getting confused), but it’s interesting how many Python layers you have to push through to get there.

Let’s move on to the third and final call:

prepare_for_model

tokenizer.prepare_for_model(
    ids=tokenizer.convert_tokens_to_ids(tokenizer.tokenize(prompt)),
)

{'input_ids': [71, 1147, 57, 1147, 2696, 12, 143, 3, 115, 23443, 7, 15, 13732, 10, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

This returns a dictionary with an attention mask, which isn’t that interesting in this specific example, and the input_ids, which is almost the same as the previous output of convert_tokens_to_ids. The only difference is a trailing 1, which is the eos_token, aka end-of-sequence token:

tokenizer.convert_tokens_to_ids(tokenizer.eos_token)

Sure enough, when I looked at the source for prepare_for_model, it does a bunch of different things (padding, truncation, backwards compatibility, etc) – but most importantly here, adds a EOS token.

The specific function that does it is build_inputs_with_special_tokens:

tokenizer.build_inputs_with_special_tokens(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(prompt)))

[71, 1147, 57, 1147, 2696, 12, 143, 3, 115, 23443, 7, 15, 13732, 10, 1]

tokenizer.special_tokens_map['eos_token']

'</s>'

Summary

So – to wrap it all up – for this simple example copied from the HuggingFace website, you can get the same functionality from the tokenizer by breaking down to these smaller helper methods vs the higher level __call__

tokenizer.prepare_for_model(
    ids=tokenizer.convert_tokens_to_ids(tokenizer.tokenize(prompt)),
)

{'input_ids': [71, 1147, 57, 1147, 2696, 12, 143, 3, 115, 23443, 7, 15, 13732, 10, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenizer(prompt)

{'input_ids': [71, 1147, 57, 1147, 2696, 12, 143, 3, 115, 23443, 7, 15, 13732, 10, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

This was mostly an educational exercise to see what’s going on under the hood of the HuggingFace – I definitely wouldn’t recommend doing this sort of thing in production – but is a great way to get a bit deeper understanding of the library!