Huggingface gpt2

Huggingface gpt2 DEFAULT

Natural Language Generation Part 2: GPT2 and Huggingface

So it’s been a while since my last article, apologies for that. Work and then the pandemic threw a wrench in a lot of things so I thought I would come back with a little tutorial on text generation with GPT-2 using the Huggingface framework. This will be a Tensorflow focused tutorial since most I have found on google tend to be Pytorch focused, or light on details around using it with Tensorflow. If you don’t want to read my whole post and just see how it works, I have the following Colab notebook as an outline for people to reference here. This post will be basically going over whats in the notebook so should be easy to reference back and forth.

In my last tutorial, I used Markov chains to learn n-gram probabilities from presidential speeches and used those probabilities to generate similar text output given new starting input. Now we will go a step further and utilize a more state of the art architecture to create text output that should be more accurate and realistic. If you haven't already heard about GPT-2, its a language model from OpenAI trained on a mass amount of data from the web using an architecture called the Transformer. Here is a good visual overview of the transformer architecture used by GPT-2 that should help give you intuition on how it works. GPT-2 is not the most advanced version of the language model from Open AI, but its one that has many reference implementations and frameworks to use compared to the newer GPT-3 model. As well its a version of the model that can run on Colab and is fairly straight forward to setup and hopefully even easier after this tutorial :)

Let's talk about the data

For our task we will create a model to generate financial article titles. If we started the task of training the language model from scratch we would need lots and lots of examples (GPT-2 was trained on 8 million web pages). Fine tuning from the pre-trained model means we don’t need to use nearly the same amount to get decent results on our specific task.

The plan is to get a decent amount of examples, couple hundred thousand, and then split them into train and eval sets. I decided to grab data from reddit titles in the /r/investing subreddit and titles extracted from US Financial News Articlesdataset from Kaggle. Some of the examples from the joined dataset are not just finance related, since many financial news sites also report on non-financial events and the subreddit data has a mix of investing advice and questions.

The titles pulled from reddit submissions are about 100k and the titles extracted from the Kaggle dataset are about another 179k. That should be enough examples so as to not over fit on our task and give us a rich set of possible text to generate from within the “financial” domain.

Data format

The format of the data seems to make or break the training and output of these models I have found. For GPT-2 if you want to just generate a whole bunch of text, say a book or articles, you can throw all the examples into a single document with no special tokens between examples. However if you want to generate output that follows a certain pattern or prompt, you should add special tokens into the dataset to make it more clear what pattern GPT-2 should attempt to learn to output. Below is the basic format for an example in the dataset for our title generation task.

<|title|>Some title about finances or other things<|endoftext|>

Each example is then concatenated together as one long string. We don’t have to add a start token for training since GPT-2 only needs the ‘<|endoftext|>’ token to split examples, but with this leading token we can then have the model generate new random output on each run when we prompt it with “<|title|>” first. You can set the start token to be whatever you want really, or have none at all, but I have found that setting these tokens to something that wont be likely to show up in the vocab of the data makes it easier to generate coherent text and you won’t be as likely to fall into a repetitive cycle.

The gist above shows the cell step that is used to create our train and eval sets. As you can see when we read in the dataset line by line, then append the <|title|> token to the input then rejoin with <|endoftext|> and write back out to their respective file. Now that we have these two files written back out to the Colab environment, we can use the Huggingface training script to fine tune the model for our task.

How to fine tune GPT-2

For fine tuning GPT-2 we will be using Huggingface and will use the provided script run_clm.py found here. I tried to find a way to fine tune the model via TF model calls directly, but had trouble getting it to work easily so defaulted to using the scripts provided. Some things like classifiers can be trained directly via standard TF api calls, but the language models seem to not be fully supported when I started this work. Its possible newer versions of Huggingface will support this.

python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--output_dir=<directory of saved model>

The script above will run the fine tuning process using the medium sized GPT-2 model, though if you are using standard Colab you might only be able to run the small GPT-2 model due to resource limits on the vm. For myself I am using Colab Pro which gives me access to more powerful base machines and GPU’s. Depending on your use case regular Colab may be sufficient or you can use GCP if you really need access to more powerful GPU instances for longer times. Transformer models are very computationally expensive due to their architecture, so when training on a GPU it can easily take hours or days with a large enough dataset.

For the investing title dataset, 5 epochs on a p100 took over 3–4 hours while on a v100 it only took 1.5 to 2 hours depending on the settings I used. Its up to some luck it seems on which GPU you get when starting up your Colab instance. I found I was usually able to get a v100 every other day after a multi hour training session. One thing to call out in the above script call is that I am using mixed precision in the model training with the — fp16 argument. Using mixed precision shaved off about 30 mins of training time with no noticeable drop in model performance when compared to a single precision trained model on our data.

At the end of the model training there is an eval step that happens which tells us our models perplexity. As you can see our title generation GPT-2 model gets us a perplexity score of around 10.6 which isn't bad considering it only ran for 5 epochs.

So now that we have trained our new language model to generate financial news titles, lets give it a try! We will want to use the path to the directory that the script outputs the model file to, and load it up to see if it will output some great new finance article / reddit titles for us!

To load into TF we will want to import the TFGPT2LMHeadModel and then call from_pretrained, making sure to set the from_pt flag to True. This way it will load the Pytorch model into TF compatible tensors. We will also use the pre-trained GPT-2 tokenizer for creating our input sequence to the model.

The pre-trained tokenizer will take the input string and encode it for our model. When using the tokenizer also be sure to set return_tensors=”tf”. If we were using the default Pytorch we would not need to set this. With these two things loaded up we can set up our input to the model and start getting text output.

After creating the input we call the models generate function. Huggingface has a great blog that goes over the different parameters for generating text and how they work together here. I suggest reading through that for a more in depth understanding. The below parameters are ones that I found to work well given the dataset, and from trial and error on many rounds of generating output. The one thing with language models is that you have to try a number of different parameter options to start to see some good output, and even then sometimes it takes many runs to get output that fits your task so do not be surprised if initial results are less than stellar.

Below is some of the output that was generated by our investing title model given the “<|title|>” token as the prompt.

0: <|title|>Tesla's stock jumps 9% after Musk tweets it will hit $1,000 per share

1: <|title|>Avis Budget Group to Announce Fourth Quarter and Full Year 2017 Financial Results on February 27, 2018

2: <|title|>BRIEF-India's Bajaj Finance Dec Qtr Profit Falls

3: <|title|>BRIEF-Dunkin' Brands Reports Q4 Adjusted Earnings Per Share $0.06

4: <|title|>BRIEF-‍UAE's National Investment Posts FY Profit Before Tax Of RMB8.2 Mln

5: <|title|>BRIEF-Cogint Announces $8 Mln Bought Deal Financing

6: <|title|>Question about stock splits.

From the generated examples above they look like believable article and reddit titles. Still sometimes when running you can get some funny output like the one below.

<|title|>Noob

Well, that was maybe a bit long of a post, but hopefully, you found it useful for learning how to use Huggingface to fine tune a language model and generate some text using a Tensorflow back end. Now with these techniques you can start to come up with different tasks and models for your own work/interests. For instance, after building this title model I decided to see if I can generate a title and use that title to generate some sort of article, to varying degrees of success. Try experimenting for your self and see what you can come up with!

Thanks for reading!

Link to colab gist: https://gist.github.com/GeorgeDittmar/5c57a35332b2b5818e51618af7953351

Sours: https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a

Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page.

Disclaimer: The team releasing GPT-2 also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.

Model description

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token only uses the inputs from to but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.

Intended uses & limitations

You can use the raw model for text generation or fine-tune it to a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

Here is how to use this model to get the features of a given text in PyTorch:

and in TensorFlow:

Limitations and bias

The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.

Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

Here's an example of how the model can have biased predictions:

This bias will also affect all fine-tuned versions of this model.

Training data

The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.

Training procedure

Preprocessing

The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens.

The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact details of training.

Evaluation results

The model achieves the following results without any fine-tuning (zero-shot):

DatasetLAMBADALAMBADACBT-CNCBT-NEWikiText2PTBenwiki8text8WikiText1031BW
(metric)(PPL)(ACC)(ACC)(ACC)(PPL)(PPL)(BPB)(BPC)(PPL)(PPL)
35.1345.9987.6583.429.4165.851.161,1737.5075.20

BibTeX entry and citation info

Sours: https://huggingface.co/gpt2
  1. Stable isotope labelling
  2. Tofu meijer
  3. Bench rest bag
  4. Light blue pepsi
  5. Streamlight p320

OpenAI GPT2¶

Overview¶

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:

  • GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

  • GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

  • The PyTorch models can take the past as input, which is the previously computed key/value attention pairs. Using this past value prevents the model from re-computing pre-computed values in the context of text generation. See reusing the past in generative models for more information on the usage of this argument.

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.

The original code can be found here.

GPT2Config¶

class (vocab_size=50257, n_positions=1024, n_ctx=1024, n_embd=768, n_layer=12, n_head=12, activation_function='gelu_new', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, bos_token_id=50256, eos_token_id=50256, **kwargs)[source]¶

This is the configuration class to store the configuration of a . It is used to instantiate an GPT-2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2 small architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

Parameters
  • vocab_size (, optional, defaults to 50257) – Vocabulary size of the GPT-2 model. Defines the different tokens that can be represented by the inputs_ids passed to the forward method of .

  • n_positions (, optional, defaults to 1024) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • n_ctx (, optional, defaults to 1024) – Dimensionality of the causal mask (usually same as n_positions).

  • n_embd (, optional, defaults to 768) – Dimensionality of the embeddings and hidden states.

  • n_layer (, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • n_head (, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • activation_function (, optional, defaults to ‘gelu’) – Activation function selected in the list [“relu”, “swish”, “gelu”, “tanh”, “gelu_new”].

  • resid_pdrop (, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • embd_pdrop (, optional, defaults to 0.1) – The dropout ratio for the embeddings.

  • attn_pdrop (, optional, defaults to 0.1) – The dropout ratio for the attention.

  • layer_norm_epsilon (, optional, defaults to 1e-5) – The epsilon to use in the layer normalization layers

  • initializer_range (, optional, defaults to 16) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • summary_type (, optional, defaults to “cls_index”) –

    Argument used when doing sequence summary. Used in for the multiple choice head in . Is one of the following options:

    • ’last’ => take the last token hidden state (like XLNet)

    • ’first’ => take the first token hidden state (like Bert)

    • ’mean’ => take the mean of all tokens hidden states

    • ’cls_index’ => supply a Tensor of classification token position (GPT/GPT-2)

    • ’attn’ => Not implemented now, use multi-head attention

  • summary_use_proj (, optional, defaults to ) – Argument used when doing sequence summary. Used in for the multiple choice head in . Add a projection after the vector extraction

  • summary_activation ( or , optional, defaults to ) – Argument used when doing sequence summary. Used in for the multiple choice head in . ‘tanh’ => add a tanh activation to the output, Other => no activation.

  • summary_proj_to_labels (, optional, defaults to ) – Argument used when doing sequence summary. Used in for the multiple choice head in . If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.

  • summary_first_dropout (, optional, defaults to 0.1) – Argument used when doing sequence summary. Used in for the multiple choice head in . Add a dropout before the projection and activation

Example:

>>> fromtransformersimportGPT2Model,GPT2Config>>> # Initializing a GPT2 configuration>>> configuration=GPT2Config()>>> # Initializing a model from the configuration>>> model=GPT2Model(configuration)>>> # Accessing the model configuration>>> configuration=model.config

GPT2Tokenizer¶

class (vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', add_prefix_space=False, **kwargs)[source]¶

GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> fromtransformersimportGPT2Tokenizer>>> tokenizer=GPT2Tokenizer.from_pretrained("gpt2")>>> tokenizer("Hello world")['input_ids'][15496, 995]>>> tokenizer(" Hello world")['input_ids'][18435, 995]

You can get around that behavior by passing when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

Note

When used with , this tokenizer will add a space before each word (even the first one).

This tokenizer inherits from which contains most of the methods. Users should refer to the superclass for more information regarding methods.

Parameters
  • vocab_file () – Path to the vocabulary file.

  • merges_file () – Path to the merges file.

  • errors (, optional, defaults to “replace”) – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

  • unk_token (, optional, defaults to <|endoftext|>) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (, optional, defaults to <|endoftext|>) – The beginning of sequence token.

  • eos_token (, optional, defaults to <|endoftext|>) – The end of sequence token.

(save_directory)[source]¶

Save the vocabulary and special tokens file to a directory.

Parameters

save_directory () – The directory in which to save the vocabulary.

Returns

Paths to the files saved.

Return type

GPT2TokenizerFast¶

class (vocab_file, merges_file, unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', add_prefix_space=False, trim_offsets=True, **kwargs)[source]¶

Constructs a “Fast” GPT-2 BPE tokenizer (backed by HuggingFace’s tokenizers library), using byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> fromtransformersimportGPT2TokenizerFast>>> tokenizer=GPT2TokenizerFast.from_pretrained("gpt2")>>> tokenizer("Hello world")['input_ids'][15496, 995]>>> tokenizer(" Hello world")['input_ids'][18435, 995]

You can get around that behavior by passing when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

Note

When used with , this tokenizer needs to be instantiated with .

This tokenizer inherits from which contains most of the methods. Users should refer to the superclass for more information regarding methods.

Parameters
  • vocab_file () – Path to the vocabulary file.

  • merges_file () – Path to the merges file.

  • errors (, optional, defaults to “replace”) – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

  • unk_token (, optional, defaults to <|endoftext|>) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (, optional, defaults to <|endoftext|>) – The beginning of sequence token.

  • eos_token (, optional, defaults to <|endoftext|>) – The end of sequence token.

  • add_prefix_space (, optional, defaults to False) – Whether to add a leading space to the first word. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect beginning of words by the preceeding space)

  • trim_offsets (, optional, defaults to True) – Whether the post processing step should trim offsets to avoid including whitespaces.

GPT2Model¶

class (config)[source]¶

The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The input_ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( of shape , optional, defaults to ) – input_ids_length = sequence_length if `past is None else 1 Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token What are token type IDs?

  • position_ids ( of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( of shape , optional, defaults to ) – This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. If past is used, optionally only the last inputs_embeds have to be input (see past).

  • use_cache () – If use_cache is True, past key value states are returned and can be used to speed up decoding (see past). Defaults to True.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

Returns
last_hidden_state ( of shape ):

Sequence of hidden-states at the last layer of the model. If past is used only the last hidden-state of the sequences of shape is output.

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding.

hidden_states (, optional, returned when ) is passed or when :

Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or ):

Tuple of (one for each layer) of shape .

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Example:

>>> fromtransformersimportGPT2Tokenizer,GPT2Model>>> importtorch>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2Model.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> last_hidden_states=outputs[0]# The last hidden-state is the first element of the output tuple
()[source]¶

Returns the model’s input embeddings.

Returns

A torch module mapping vocabulary to hidden states.

Return type
(new_embeddings)[source]¶

Set model’s input embeddings

Parameters

value () – A module mapping vocabulary to hidden states.

GPT2LMHeadModel¶

class (config)[source]¶

The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The input_ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( of shape , optional, defaults to ) –

    input_ids_length = sequence_length if `past is None else 1 Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token What are token type IDs?

  • position_ids ( of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( of shape , optional, defaults to ) – This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. If past is used, optionally only the last inputs_embeds have to be input (see past).

  • use_cache () – If use_cache is True, past key value states are returned and can be used to speed up decoding (see past). Defaults to True.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

  • labels ( of shape , optional, defaults to ) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set Indices are selected in All labels set to are ignored (masked), the loss is only computed for labels in

Returns
loss ( of shape (1,), optional, returned when is provided)

Language modeling loss.

prediction_scores ( of shape ):

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding.

hidden_states (, optional, returned when is passed or when ):

Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or when ):

Tuple of (one for each layer) of shape .

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Example:

>>> importtorch>>> fromtransformersimportGPT2Tokenizer,GPT2LMHeadModel>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2LMHeadModel.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs,labels=inputs["input_ids"])>>> loss,logits=outputs[:2]
()[source]¶

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type

GPT2DoubleHeadsModel¶

class (config)[source]¶

The GPT2 Model transformer with a language modeling and a multiple-choice classification head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to the input embeddings, the classification head takes as input the input of a specified classification token index in the input sequence).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, mc_token_ids=None, labels=None, mc_labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, **kwargs)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The input_ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( of shape , optional, defaults to ) –

    input_ids_length = sequence_length if `past is None else 1 Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token What are token type IDs?

  • position_ids ( of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( of shape , optional, defaults to ) – This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. If past is used, optionally only the last inputs_embeds have to be input (see past).

  • use_cache () – If use_cache is True, past key value states are returned and can be used to speed up decoding (see past). Defaults to True.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

  • mc_token_ids ( of shape , optional, default to index of the last token of the input) – Index of the classification token in each input sequence. Selected in the range .

  • labels ( of shape , optional, defaults to ) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set Indices are selected in All labels set to are ignored (masked), the loss is only computed for labels in

  • mc_labels ( of shape , optional, defaults to ) – Labels for computing the multiple choice classification loss. Indices should be in where num_choices is the size of the second dimension of the input tensors. (see input_ids above)

  • kwargs (, optional, defaults to {}) – Used to hide legacy arguments that have been deprecated.

Returns
lm_loss ( of shape , optional, returned when is provided):

Language modeling loss.

mc_loss ( of shape , optional, returned when is provided):

Multiple choice classification loss.

lm_prediction_scores ( of shape ):

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

mc_prediction_scores ( of shape ):

Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding.

hidden_states (, optional, returned when is passed or when ):

Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or when ):

Tuple of (one for each layer) of shape .

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Examples:

>>> importtorch>>> fromtransformersimportGPT2Tokenizer,GPT2DoubleHeadsModel>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2DoubleHeadsModel.from_pretrained('gpt2')>>> # Add a [CLS] to the vocabulary (we should train it also!)>>> num_added_tokens=tokenizer.add_special_tokens({'cls_token':'[CLS]'})>>> embedding_layer=model.resize_token_embeddings(len(tokenizer))# Update the model embeddings with the new vocabulary size>>> choices=["Hello, my dog is cute [CLS]","Hello, my cat is cute [CLS]"]>>> encoded_choices=[tokenizer.encode(s)forsinchoices]>>> cls_token_location=[tokens.index(tokenizer.cls_token_id)fortokensinencoded_choices]>>> input_ids=torch.tensor(encoded_choices).unsqueeze(0)# Batch size: 1, number of choices: 2>>> mc_token_ids=torch.tensor([cls_token_location])# Batch size: 1>>> outputs=model(input_ids,mc_token_ids=mc_token_ids)>>> lm_prediction_scores,mc_prediction_scores=outputs[:2]
()[source]¶

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type

TFGPT2Model¶

class (*args, **kwargs)[source]¶

The bare GPT2 Model transformer outputing raw hidden-states without any specific head on top.

Note

TF 2.0 models accepts two formats as inputs:

  • having all inputs as keyword arguments (like PyTorch models), or

  • having all inputs as a list, tuple or dict in the first positional arguments.

This second option is useful when using method which currently requires having all the tensors in the first argument of the model call function: .

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

  • a single Tensor with input_ids only and nothing else:

  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: or

  • a dictionary with one or several input Tensors associated to the input names given in the docstring:

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(inputs, **kwargs)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( or of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( or of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( or of shape , optional, defaults to ) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token

    What are token type IDs?

  • position_ids ( or of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( or of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( or of shape , optional, defaults to ) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • training (, optional, defaults to ) – Whether to activate dropout modules (if set to ) during training or to de-activate them (if set to ) for evaluation.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

Returns
last_hidden_state ( of shape ):

Sequence of hidden-states at the last layer of the model.

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

hidden_states (, optional, returned when is passed or when ):

tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or when ):

tuple of (one for each layer) of shape :

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Example:

>>> fromtransformersimportGPT2Tokenizer,TFGPT2Model>>> importtensorflowastf>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=TFGPT2Model.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="tf")>>> outputs=model(inputs)>>> last_hidden_states=outputs[0]# The last hidden-state is the first element of the output tuple

TFGPT2LMHeadModel¶

class (*args, **kwargs)[source]¶

The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

Note

TF 2.0 models accepts two formats as inputs:

  • having all inputs as keyword arguments (like PyTorch models), or

  • having all inputs as a list, tuple or dict in the first positional arguments.

This second option is useful when using method which currently requires having all the tensors in the first argument of the model call function: .

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

  • a single Tensor with input_ids only and nothing else:

  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: or

  • a dictionary with one or several input Tensors associated to the input names given in the docstring:

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(inputs, **kwargs)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( or of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( or of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( or of shape , optional, defaults to ) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token

    What are token type IDs?

  • position_ids ( or of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( or of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( or of shape , optional, defaults to ) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • training (, optional, defaults to ) – Whether to activate dropout modules (if set to ) during training or to de-activate them (if set to ) for evaluation.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

Returns
prediction_scores ( of shape ):

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

hidden_states (, optional, returned when is passed or when ):

tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or when ):

tuple of (one for each layer) of shape :

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Example:

>>> fromtransformersimportGPT2Tokenizer,TFGPT2LMHeadModel>>> importtensorflowastf>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=TFGPT2LMHeadModel.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="tf")>>> outputs=model(inputs)>>> logits=outputs[0]
()[source]¶

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type

TFGPT2DoubleHeadsModel¶

class (*args, **kwargs)[source]¶

The GPT2 Model transformer with a language modeling and a multiple-choice classification head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to the input embeddings, the classification head takes as input the input of a specified classification token index in the input sequence).

Note

TF 2.0 models accepts two formats as inputs:

  • having all inputs as keyword arguments (like PyTorch models), or

  • having all inputs as a list, tuple or dict in the first positional arguments.

This second option is useful when using method which currently requires having all the tensors in the first argument of the model call function: .

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

  • a single Tensor with input_ids only and nothing else:

  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: or

  • a dictionary with one or several input Tensors associated to the input names given in the docstring:

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(inputs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, mc_token_ids=None, use_cache=None, output_attentions=None, output_hidden_states=None, training=False)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( or of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If past is used, only input_ids that do not have their past calculated should be passed as input_ids.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past ( of length ) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input_ids as they have already been computed.

  • attention_mask ( or of shape , optional, defaults to ) –

    Mask to avoid performing attention on padding token indices. Mask values selected in : for tokens that are NOT MASKED, for MASKED tokens.

    What are attention masks?

  • token_type_ids ( or of shape , optional, defaults to ) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in : corresponds to a sentence A token, corresponds to a sentence B token

    What are token type IDs?

  • position_ids ( or of shape , optional, defaults to ) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( or of shape or , optional, defaults to ) – Mask to nullify selected heads of the self-attention modules. Mask values selected in : indicates the head is not masked, indicates the head is masked.

  • inputs_embeds ( or of shape , optional, defaults to ) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • training (, optional, defaults to ) – Whether to activate dropout modules (if set to ) during training or to de-activate them (if set to ) for evaluation.

  • output_attentions (, optional, defaults to ) – If set to , the attentions tensors of all attention layers are returned. See under returned tensors for more detail.

  • mc_token_ids ( or of shape , optional, default to index of the last token of the input) – Index of the classification token in each input sequence. Selected in the range .

Returns
lm_prediction_scores ( of shape ):

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

mc_prediction_scores ( of shape ):

Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).

past ( of length with each tensor of shape ):

Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input_ids as they have already been computed.

hidden_states (, optional, returned when is passed or when ):

tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (, optional, returned when is passed or when ):

tuple of (one for each layer) of shape :

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

comprising various elements depending on the configuration () and inputs

Examples:

>>> importtensorflowastf>>> fromtransformersimportGPT2Tokenizer,TFGPT2DoubleHeadsModel>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=TFGPT2DoubleHeadsModel.from_pretrained('gpt2')>>> # Add a [CLS] to the vocabulary (we should train it also!)>>> num_added_tokens=tokenizer.add_special_tokens({'cls_token':'[CLS]'})>>> embedding_layer=model.resize_token_embeddings(len(tokenizer))# Update the model embeddings with the new vocabulary size>>> choices=["Hello, my dog is cute [CLS]","Hello, my cat is cute [CLS]"]>>> encoded_choices=[tokenizer.encode(s)forsinchoices]>>> cls_token_location=[tokens.index(tokenizer.cls_token_id)fortokensinencoded_choices]>>> input_ids=tf.constant(encoded_choices)[None,:]# Batch size: 1, number of choices: 2>>> mc_token_ids=tf.constant([cls_token_location])# Batch size: 1>>> outputs=model(input_ids,mc_token_ids=mc_token_ids)>>> lm_prediction_scores,mc_prediction_scores=outputs[:2]
()[source]¶

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type

© Copyright 2020, huggingface

Built with Sphinx using a theme provided by Read the Docs.
Sours: https://huggingface.co/transformers/v3.0.2/model_doc/gpt2.html
AI Text Summarization with Hugging Face Transformers in 4 Lines of Python

Source code for transformers.modeling_gpt2

# coding=utf-8# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License."""PyTorch OpenAI GPT-2 model."""importosimportwarningsfromdataclassesimportdataclassfromtypingimportList,Optional,Tupleimporttorchimporttorch.nnasnnfromtorch.nnimportCrossEntropyLoss,MSELossfrom.activationsimportACT2FNfrom.configuration_gpt2importGPT2Configfrom.file_utilsimport(ModelOutput,add_code_sample_docstrings,add_start_docstrings,add_start_docstrings_to_model_forward,replace_return_docstrings,)from.modeling_outputsimport(BaseModelOutputWithPastAndCrossAttentions,CausalLMOutputWithPastAndCrossAttentions,SequenceClassifierOutputWithPast,)from.modeling_utilsimport(Conv1D,PreTrainedModel,SequenceSummary,find_pruneable_heads_and_indices,prune_conv1d_layer,)from.utilsimportlogginglogger=logging.get_logger(__name__)_CONFIG_FOR_DOC="GPT2Config"_TOKENIZER_FOR_DOC="GPT2Tokenizer"GPT2_PRETRAINED_MODEL_ARCHIVE_LIST=["gpt2","gpt2-medium","gpt2-large","gpt2-xl","distilgpt2",# See all GPT-2 models at https://huggingface.co/models?filter=gpt2]defload_tf_weights_in_gpt2(model,config,gpt2_checkpoint_path):"""Load tf checkpoints in a pytorch model"""try:importreimporttensorflowastfexceptImportError:logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see ""https://www.tensorflow.org/install/ for installation instructions.")raisetf_path=os.path.abspath(gpt2_checkpoint_path)logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))# Load weights from TF modelinit_vars=tf.train.list_variables(tf_path)names=[]arrays=[]forname,shapeininit_vars:logger.info("Loading TF weight {} with shape {}".format(name,shape))array=tf.train.load_variable(tf_path,name)names.append(name)arrays.append(array.squeeze())forname,arrayinzip(names,arrays):name=name[6:]# skip "model/"name=name.split("/")pointer=modelform_nameinname:ifre.fullmatch(r"[A-Za-z]+\d+",m_name):scope_names=re.split(r"(\d+)",m_name)else:scope_names=[m_name]ifscope_names[0]=="w"orscope_names[0]=="g":pointer=getattr(pointer,"weight")elifscope_names[0]=="b":pointer=getattr(pointer,"bias")elifscope_names[0]=="wpe"orscope_names[0]=="wte":pointer=getattr(pointer,scope_names[0])pointer=getattr(pointer,"weight")else:pointer=getattr(pointer,scope_names[0])iflen(scope_names)>=2:num=int(scope_names[1])pointer=pointer[num]try:assert(pointer.shape==array.shape),f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"exceptAssertionErrorase:e.args+=(pointer.shape,array.shape)raiselogger.info("Initialize PyTorch weight {}".format(name))pointer.data=torch.from_numpy(array)returnmodelclassAttention(nn.Module):def__init__(self,nx,n_ctx,config,scale=False,is_cross_attention=False):super().__init__()n_state=nx# in Attention: n_state=768 (nx=n_embd)# [switch nx => n_state from Block to Attention to keep identical to TF implem]assertn_state%config.n_head==0self.register_buffer("bias",torch.tril(torch.ones((n_ctx,n_ctx),dtype=torch.uint8)).view(1,1,n_ctx,n_ctx))self.register_buffer("masked_bias",torch.tensor(-1e4))self.n_head=config.n_headself.split_size=n_stateself.scale=scaleself.is_cross_attention=is_cross_attentionifself.is_cross_attention:self.c_attn=Conv1D(2*n_state,nx)self.q_attn=Conv1D(n_state,nx)else:self.c_attn=Conv1D(3*n_state,nx)self.c_proj=Conv1D(n_state,nx)self.attn_dropout=nn.Dropout(config.attn_pdrop)self.resid_dropout=nn.Dropout(config.resid_pdrop)self.pruned_heads=set()defprune_heads(self,heads):iflen(heads)==0:returnheads,index=find_pruneable_heads_and_indices(heads,self.n_head,self.split_size//self.n_head,self.pruned_heads)index_attn=torch.cat([index,index+self.split_size,index+(2*self.split_size)])# Prune conv1d layersself.c_attn=prune_conv1d_layer(self.c_attn,index_attn,dim=1)self.c_proj=prune_conv1d_layer(self.c_proj,index,dim=0)# Update hyper paramsself.split_size=(self.split_size//self.n_head)*(self.n_head-len(heads))self.n_head=self.n_head-len(heads)self.pruned_heads=self.pruned_heads.union(heads)def_attn(self,q,k,v,attention_mask=None,head_mask=None,output_attentions=False):w=torch.matmul(q,k)ifself.scale:w=w/(float(v.size(-1))**0.5)nd,ns=w.size(-2),w.size(-1)ifnotself.is_cross_attention:# if only "normal" attention layer implements causal maskmask=self.bias[:,:,ns-nd:ns,:ns]w=torch.where(mask.bool(),w,self.masked_bias.to(w.dtype))ifattention_maskisnotNone:# Apply the attention maskw=w+attention_maskw=nn.Softmax(dim=-1)(w)w=self.attn_dropout(w)# Mask heads if we want toifhead_maskisnotNone:w=w*head_maskoutputs=[torch.matmul(w,v)]ifoutput_attentions:outputs.append(w)returnoutputsdefmerge_heads(self,x):x=x.permute(0,2,1,3).contiguous()new_x_shape=x.size()[:-2]+(x.size(-2)*x.size(-1),)returnx.view(*new_x_shape)# in Tensorflow implem: fct merge_statesdefsplit_heads(self,x,k=False):new_x_shape=x.size()[:-1]+(self.n_head,x.size(-1)//self.n_head)x=x.view(*new_x_shape)# in Tensorflow implem: fct split_statesifk:returnx.permute(0,2,3,1)# (batch, head, head_features, seq_length)else:returnx.permute(0,2,1,3)# (batch, head, seq_length, head_features)defforward(self,hidden_states,layer_past=None,attention_mask=None,head_mask=None,encoder_hidden_states=None,encoder_attention_mask=None,use_cache=False,output_attentions=False,):ifencoder_hidden_statesisnotNone:asserthasattr(self,"q_attn"),"If class is used as cross attention, the weights `q_attn` have to be defined. Please make sure to instantiate class with `Attention(..., is_cross_attention=True)`."query=self.q_attn(hidden_states)key,value=self.c_attn(encoder_hidden_states).split(self.split_size,dim=2)attention_mask=encoder_attention_maskelse:query,key,value=self.c_attn(hidden_states).split(self.split_size,dim=2)query=self.split_heads(query)key=self.split_heads(key,k=True)value=self.split_heads(value)iflayer_pastisnotNone:past_key,past_value=layer_past[0].transpose(-2,-1),layer_past[1]# transpose back cf belowkey=torch.cat((past_key,key),dim=-1)value=torch.cat((past_value,value),dim=-2)ifuse_cacheisTrue:present=torch.stack((key.transpose(-2,-1),value))# transpose to have same shapes for stackingelse:present=(None,)attn_outputs=self._attn(query,key,value,attention_mask,head_mask,output_attentions)a=attn_outputs[0]a=self.merge_heads(a)a=self.c_proj(a)a=self.resid_dropout(a)outputs=[a,present]+attn_outputs[1:]returnoutputs# a, present, (attentions)classMLP(nn.Module):def__init__(self,n_state,config):# in MLP: n_state=3072 (4 * n_embd)super().__init__()nx=config.n_embdself.c_fc=Conv1D(n_state,nx)self.c_proj=Conv1D(nx,n_state)self.act=ACT2FN[config.activation_function]self.dropout=nn.Dropout(config.resid_pdrop)defforward(self,x):h=self.act(self.c_fc(x))h2=self.c_proj(h)returnself.dropout(h2)classBlock(nn.Module):def__init__(self,n_ctx,config,scale=False):super().__init__()hidden_size=config.n_embdinner_dim=config.n_innerifconfig.n_innerisnotNoneelse4*hidden_sizeself.ln_1=nn.LayerNorm(hidden_size,eps=config.layer_norm_epsilon)self.attn=Attention(hidden_size,n_ctx,config,scale)self.ln_2=nn.LayerNorm(hidden_size,eps=config.layer_norm_epsilon)ifconfig.add_cross_attention:self.crossattention=Attention(hidden_size,n_ctx,config,scale,is_cross_attention=True)self.ln_cross_attn=nn.LayerNorm(hidden_size,eps=config.layer_norm_epsilon)self.mlp=MLP(inner_dim,config)defforward(self,hidden_states,layer_past=None,attention_mask=None,head_mask=None,encoder_hidden_states=None,encoder_attention_mask=None,use_cache=False,output_attentions=False,):attn_outputs=self.attn(self.ln_1(hidden_states),layer_past=layer_past,attention_mask=attention_mask,head_mask=head_mask,use_cache=use_cache,output_attentions=output_attentions,)attn_output=attn_outputs[0]# output_attn: a, present, (attentions)outputs=attn_outputs[1:]# residual connectionhidden_states=attn_output+hidden_statesifencoder_hidden_statesisnotNone:# add one self-attention block for cross-attentionasserthasattr(self,"crossattention"),f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"cross_attn_outputs=self.crossattention(self.ln_cross_attn(hidden_states),attention_mask=attention_mask,head_mask=head_mask,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,output_attentions=output_attentions,)attn_output=cross_attn_outputs[0]# residual connectionhidden_states=hidden_states+attn_outputoutputs=outputs+cross_attn_outputs[2:]# add cross attentions if we output attention weightsfeed_forward_hidden_states=self.mlp(self.ln_2(hidden_states))# residual connectionhidden_states=hidden_states+feed_forward_hidden_statesoutputs=[hidden_states]+outputsreturnoutputs# hidden_states, present, (attentions, cross_attentions)classGPT2PreTrainedModel(PreTrainedModel):""" An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models. """config_class=GPT2Configload_tf_weights=load_tf_weights_in_gpt2base_model_prefix="transformer"def__init__(self,*inputs,**kwargs):super().__init__(*inputs,**kwargs)def_init_weights(self,module):"""Initialize the weights."""ifisinstance(module,(nn.Linear,nn.Embedding,Conv1D)):# Slightly different from the TF version which uses truncated_normal for initialization# cf https://github.com/pytorch/pytorch/pull/5617module.weight.data.normal_(mean=0.0,std=self.config.initializer_range)ifisinstance(module,(nn.Linear,Conv1D))andmodule.biasisnotNone:module.bias.data.zero_()elifisinstance(module,nn.LayerNorm):module.bias.data.zero_()module.weight.data.fill_(1.0)

[docs]@dataclassclassGPT2DoubleHeadsModelOutput(ModelOutput):""" Base class for outputs of models predicting if two sentences are consecutive or not. Args: loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided): Language modeling loss. mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`mc_labels` is provided): Multiple choice classification loss. logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`): Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). mc_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`): Prediction scores of the multiple choice classification head (scores for each choice before SoftMax). past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``): List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`). Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see :obj:`past_key_values` input) to speed up sequential decoding. hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the initial embedding outputs. attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. """loss:Optional[torch.FloatTensor]=Nonemc_loss:Optional[torch.FloatTensor]=Nonelogits:torch.FloatTensor=Nonemc_logits:torch.FloatTensor=Nonepast_key_values:Optional[List[torch.FloatTensor]]=Nonehidden_states:Optional[Tuple[torch.FloatTensor]]=Noneattentions:Optional[Tuple[torch.FloatTensor]]=None

GPT2_START_DOCSTRING=r""" This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: config (:class:`~transformers.GPT2Config`): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights."""GPT2_INPUTS_DOCSTRING=r""" Args: input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`): :obj:`input_ids_length` = ``sequence_length`` if :obj:`past_key_values` is ``None`` else ``past_key_values[0].shape[-2]`` (``sequence_length`` of input past key value states). Indices of input sequence tokens in the vocabulary. If :obj:`past_key_values` is used, only ``input_ids`` that do not have their past calculated should be passed as ``input_ids``. Indices can be obtained using :class:`~transformers.GPT2Tokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for details. `What are input IDs? <../glossary.html#input-ids>`__ past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`): Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see :obj:`past_key_values` output below). Can be used to speed up sequential decoding. The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they have already been computed. attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`): Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0, 1]``: - 0 corresponds to a `sentence A` token, - 1 corresponds to a `sentence B` token. `What are token type IDs? <../glossary.html#token-type-ids>`_ position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, config.max_position_embeddings - 1]``. `What are position IDs? <../glossary.html#position-ids>`_ head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``: - 1 indicates the head is **not masked**, - 0 indicates the head is **masked**. inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated vectors than the model's internal embedding lookup matrix. If :obj:`past_key_values` is used, optionally only the last :obj:`inputs_embeds` have to be input (see :obj:`past_key_values`). use_cache (:obj:`bool`, `optional`): If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up decoding (see :obj:`past_key_values`). output_attentions (:obj:`bool`, `optional`): Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned tensors for more detail. output_hidden_states (:obj:`bool`, `optional`): Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for more detail. return_dict (:obj:`bool`, `optional`): Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple."""
[docs]@add_start_docstrings("The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.",GPT2_START_DOCSTRING,)classGPT2Model(GPT2PreTrainedModel):def__init__(self,config):super().__init__(config)self.wte=nn.Embedding(config.vocab_size,config.n_embd)self.wpe=nn.Embedding(config.n_positions,config.n_embd)self.drop=nn.Dropout(config.embd_pdrop)self.h=nn.ModuleList([Block(config.n_ctx,config,scale=True)for_inrange(config.n_layer)])self.ln_f=nn.LayerNorm(config.n_embd,eps=config.layer_norm_epsilon)self.init_weights()defget_input_embeddings(self):returnself.wtedefset_input_embeddings(self,new_embeddings):self.wte=new_embeddingsdef_prune_heads(self,heads_to_prune):""" Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} """forlayer,headsinheads_to_prune.items():self.h[layer].attn.prune_heads(heads)

[docs]@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC,checkpoint="gpt2",output_type=BaseModelOutputWithPastAndCrossAttentions,config_class=_CONFIG_FOR_DOC,)defforward(self,input_ids=None,past_key_values=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,encoder_hidden_states=None,encoder_attention_mask=None,use_cache=None,output_attentions=None,output_hidden_states=None,return_dict=None,**kwargs,):if"past"inkwargs:warnings.warn("The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.",FutureWarning,)past_key_values=kwargs.pop("past")assertkwargs=={},f"Unexpected keyword arguments: {list(kwargs.keys())}."output_attentions=output_attentionsifoutput_attentionsisnotNoneelseself.config.output_attentionsoutput_hidden_states=(output_hidden_statesifoutput_hidden_statesisnotNoneelseself.config.output_hidden_states)use_cache=use_cacheifuse_cacheisnotNoneelseself.config.use_cachereturn_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dictifinput_idsisnotNoneandinputs_embedsisnotNone:raiseValueError("You cannot specify both input_ids and inputs_embeds at the same time")elifinput_idsisnotNone:input_shape=input_ids.size()input_ids=input_ids.view(-1,input_shape[-1])batch_size=input_ids.shape[0]elifinputs_embedsisnotNone:input_shape=inputs_embeds.size()[:-1]batch_size=inputs_embeds.shape[0]else:raiseValueError("You have to specify either input_ids or inputs_embeds")iftoken_type_idsisnotNone:token_type_ids=token_type_ids.view(-1,input_shape[-1])ifposition_idsisnotNone:position_ids=position_ids.view(-1,input_shape[-1])ifpast_key_valuesisNone:past_length=0past_key_values=[None]*len(self.h)else:past_length=past_key_values[0][0].size(-2)ifposition_idsisNone:device=input_ids.deviceifinput_idsisnotNoneelseinputs_embeds.deviceposition_ids=torch.arange(past_length,input_shape[-1]+past_length,dtype=torch.long,device=device)position_ids=position_ids.unsqueeze(0).view(-1,input_shape[-1])# Attention mask.ifattention_maskisnotNone:assertbatch_size>0,"batch_size has to be defined and > 0"attention_mask=attention_mask.view(batch_size,-1)# We create a 3D attention mask from a 2D tensor mask.# Sizes are [batch_size, 1, 1, to_seq_length]# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]# this attention mask is more simple than the triangular masking of causal attention# used in OpenAI GPT, we just need to prepare the broadcast dimension here.attention_mask=attention_mask[:,None,None,:]# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_mask=attention_mask.to(dtype=self.dtype)# fp16 compatibilityattention_mask=(1.0-attention_mask)*-10000.0# If a 2D ou 3D attention mask is provided for the cross-attention# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]ifself.config.add_cross_attentionandencoder_hidden_statesisnotNone:encoder_batch_size,encoder_sequence_length,_=encoder_hidden_states.size()encoder_hidden_shape=(encoder_batch_size,encoder_sequence_length)ifencoder_attention_maskisNone:encoder_attention_mask=torch.ones(encoder_hidden_shape,device=device)encoder_attention_mask=self.invert_attention_mask(encoder_attention_mask)else:encoder_attention_mask=None# Prepare head mask if needed# 1.0 in head_mask indicate we keep the head# attention_probs has shape bsz x n_heads x N x N# head_mask has shape n_layer x batch x n_heads x N x Nhead_mask=self.get_head_mask(head_mask,self.config.n_layer)ifinputs_embedsisNone:inputs_embeds=self.wte(input_ids)position_embeds=self.wpe(position_ids)hidden_states=inputs_embeds+position_embedsiftoken_type_idsisnotNone:token_type_embeds=self.wte(token_type_ids)hidden_states=hidden_states+token_type_embedshidden_states=self.drop(hidden_states)output_shape=input_shape+(hidden_states.size(-1),)presents=()ifuse_cacheelseNoneall_self_attentions=()ifoutput_attentionselseNoneall_cross_attentions=()ifoutput_attentionsandself.config.add_cross_attentionelseNoneall_hidden_states=()ifoutput_hidden_stateselseNonefori,(block,layer_past)inenumerate(zip(self.h,past_key_values)):ifoutput_hidden_states:all_hidden_states=all_hidden_states+(hidden_states.view(*output_shape),)ifgetattr(self.config,"gradient_checkpointing",False):defcreate_custom_forward(module):defcustom_forward(*inputs):# checkpointing only works with tuple returns, not with listsreturntuple(outputforoutputinmodule(*inputs,use_cache,output_attentions))returncustom_forwardoutputs=torch.utils.checkpoint.checkpoint(create_custom_forward(block),hidden_states,layer_past,attention_mask,head_mask[i],encoder_hidden_states,encoder_attention_mask,)else:outputs=block(hidden_states,layer_past=layer_past,attention_mask=attention_mask,head_mask=head_mask[i],encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,use_cache=use_cache,output_attentions=output_attentions,)hidden_states,present=outputs[:2]ifuse_cacheisTrue:presents=presents+(present,)ifoutput_attentions:all_self_attentions=all_self_attentions+(outputs[2],)ifself.config.add_cross_attention:all_cross_attentions=all_cross_attentions+(outputs[3],)hidden_states=self.ln_f(hidden_states)hidden_states=hidden_states.view(*output_shape)# Add last hidden stateifoutput_hidden_states:all_hidden_states=all_hidden_states+(hidden_states,)ifnotreturn_dict:returntuple(vforvin[hidden_states,presents,all_hidden_states,all_self_attentions]ifvisnotNone)returnBaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states,past_key_values=presents,hidden_states=all_hidden_states,attentions=all_self_attentions,cross_attentions=all_cross_attentions,)

[docs]@add_start_docstrings(""" The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings). """,GPT2_START_DOCSTRING,)classGPT2LMHeadModel(GPT2PreTrainedModel):authorized_missing_keys=[r"h\.\d+\.attn\.masked_bias",r"lm_head\.weight"]def__init__(self,config):super().__init__(config)self.transformer=GPT2Model(config)self.lm_head=nn.Linear(config.n_embd,config.vocab_size,bias=False)self.init_weights()defget_output_embeddings(self):returnself.lm_headdefprepare_inputs_for_generation(self,input_ids,past=None,**kwargs):# only last token for inputs_ids if past is defined in kwargsifpast:input_ids=input_ids[:,-1].unsqueeze(-1)attention_mask=kwargs.get("attention_mask",None)position_ids=kwargs.get("position_ids",None)ifattention_maskisnotNoneandposition_idsisNone:# create position_ids on the fly for batch generationposition_ids=attention_mask.long().cumsum(-1)-1position_ids.masked_fill_(attention_mask==0,1)ifpast:position_ids=position_ids[:,-1].unsqueeze(-1)else:position_ids=Nonereturn{"input_ids":input_ids,"past_key_values":past,"use_cache":kwargs.get("use_cache"),"position_ids":position_ids,"attention_mask":attention_mask,}

[docs]@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC,checkpoint="gpt2",output_type=CausalLMOutputWithPastAndCrossAttentions,config_class=_CONFIG_FOR_DOC,)defforward(self,input_ids=None,past_key_values=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,encoder_hidden_states=None,encoder_attention_mask=None,labels=None,use_cache=None,output_attentions=None,output_hidden_states=None,return_dict=None,**kwargs,):r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids`` Indices are selected in ``[-100, 0, ..., config.vocab_size]`` All labels set to ``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]`` """if"past"inkwargs:warnings.warn("The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.",FutureWarning,)past_key_values=kwargs.pop("past")assertkwargs=={},f"Unexpected keyword arguments: {list(kwargs.keys())}."return_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dicttransformer_outputs=self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states=transformer_outputs[0]lm_logits=self.lm_head(hidden_states)loss=NoneiflabelsisnotNone:# Shift so that tokens < n predict nshift_logits=lm_logits[...,:-1,:].contiguous()shift_labels=labels[...,1:].contiguous()# Flatten the tokensloss_fct=CrossEntropyLoss()loss=loss_fct(shift_logits.view(-1,shift_logits.size(-1)),shift_labels.view(-1))ifnotreturn_dict:output=(lm_logits,)+transformer_outputs[1:]return((loss,)+output)iflossisnotNoneelseoutputreturnCausalLMOutputWithPastAndCrossAttentions(loss=loss,logits=lm_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,cross_attentions=transformer_outputs.cross_attentions,)

[docs]@add_start_docstrings("""The GPT2 Model transformer with a language modeling and a multiple-choice classification head on top e.g. forRocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to theinput embeddings, the classification head takes as input the input of a specified classification token index in theinput sequence).""",GPT2_START_DOCSTRING,)classGPT2DoubleHeadsModel(GPT2PreTrainedModel):def__init__(self,config):super().__init__(config)config.num_labels=1self.transformer=GPT2Model(config)self.lm_head=nn.Linear(config.n_embd,config.vocab_size,bias=False)self.multiple_choice_head=SequenceSummary(config)self.init_weights()defget_output_embeddings(self):returnself.lm_headdefprepare_inputs_for_generation(self,input_ids,past=None,**kwargs):# only last token for inputs_ids if past is defined in kwargsifpast:input_ids=input_ids[:,-1].unsqueeze(-1)return{"input_ids":input_ids,"past_key_values":past,"use_cache":kwargs.get("use_cache"),}

[docs]@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@replace_return_docstrings(output_type=GPT2DoubleHeadsModelOutput,config_class=_CONFIG_FOR_DOC)defforward(self,input_ids=None,past_key_values=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,mc_token_ids=None,labels=None,mc_labels=None,use_cache=None,output_attentions=None,output_hidden_states=None,return_dict=None,**kwargs,):r""" mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input): Index of the classification token in each input sequence. Selected in the range ``[0, input_ids.size(-1) - 1[``. labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids`` Indices are selected in ``[-1, 0, ..., config.vocab_size]`` All labels set to ``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]`` mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`): Labels for computing the multiple choice classification loss. Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above) kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): Used to hide legacy arguments that have been deprecated. Return: Example:: >>> import torch >>> from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2') >>> model = GPT2DoubleHeadsModel.from_pretrained('gpt2, return_dict=True) >>> # Add a [CLS] to the vocabulary (we should train it also!) >>> num_added_tokens = tokenizer.add_special_tokens({'cls_token': '[CLS]'}) >>> embedding_layer = model.resize_token_embeddings(len(tokenizer)) # Update the model embeddings with the new vocabulary size >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"] >>> encoded_choices = [tokenizer.encode(s) for s in choices] >>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices] >>> input_ids = torch.tensor(encoded_choices).unsqueeze(0) # Batch size: 1, number of choices: 2 >>> mc_token_ids = torch.tensor([cls_token_location]) # Batch size: 1 >>> outputs = model(input_ids, mc_token_ids=mc_token_ids) >>> lm_logits = outputs.lm_logits >>> mc_logits = outputs.mc_logits """if"lm_labels"inkwargs:warnings.warn("The `lm_labels` argument is deprecated and will be removed in a future version, use `labels` instead.",FutureWarning,)labels=kwargs.pop("lm_labels")if"past"inkwargs:warnings.warn("The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.",FutureWarning,)past_key_values=kwargs.pop("past")assertkwargs=={},f"Unexpected keyword arguments: {list(kwargs.keys())}."return_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dicttransformer_outputs=self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states=transformer_outputs[0]lm_logits=self.lm_head(hidden_states)mc_logits=self.multiple_choice_head(hidden_states,mc_token_ids).squeeze(-1)mc_loss=Noneifmc_labelsisnotNone:loss_fct=CrossEntropyLoss()mc_loss=loss_fct(mc_logits.view(-1,mc_logits.size(-1)),mc_labels.view(-1))lm_loss=NoneiflabelsisnotNone:shift_logits=lm_logits[...,:-1,:].contiguous()shift_labels=labels[...,1:].contiguous()loss_fct=CrossEntropyLoss()lm_loss=loss_fct(shift_logits.view(-1,shift_logits.size(-1)),shift_labels.view(-1))ifnotreturn_dict:output=(lm_logits,mc_logits)+transformer_outputs[1:]ifmc_lossisnotNone:output=(mc_loss,)+outputreturn((lm_loss,)+output)iflm_lossisnotNoneelseoutputreturnGPT2DoubleHeadsModelOutput(loss=lm_loss,mc_loss=mc_loss,logits=lm_logits,mc_logits=mc_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,)

[docs]@add_start_docstrings(""" The GPT2 Model transformer with a sequence classification head on top (linear layer). :class:`~transformers.GPT2ForSequenceClassification` uses the last token in order to do the classification, as other causal models (e.g. GPT-1) do. Since it does classification on the last token, it requires to know the position of the last token. If a :obj:`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If no :obj:`pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when :obj:`inputs_embeds` are passed instead of :obj:`input_ids`, it does the same (take the last value in each row of the batch). """,GPT2_START_DOCSTRING,)classGPT2ForSequenceClassification(GPT2PreTrainedModel):authorized_missing_keys=[r"h\.\d+\.attn\.masked_bias",r"lm_head\.weight"]def__init__(self,config):super().__init__(config)self.num_labels=config.num_labelsself.transformer=GPT2Model(config)self.score=nn.Linear(config.n_embd,self.num_labels,bias=False)self.init_weights()

[docs]@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC,checkpoint="microsoft/dialogrpt",output_type=SequenceClassifierOutputWithPast,config_class=_CONFIG_FOR_DOC,)defforward(self,input_ids=None,past_key_values=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,labels=None,use_cache=None,output_attentions=None,output_hidden_states=None,return_dict=None,):r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ..., config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). """return_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dicttransformer_outputs=self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states=transformer_outputs[0]logits=self.score(hidden_states)ifinput_idsisnotNone:batch_size,sequence_length=input_ids.shape[:2]else:batch_size,sequence_length=inputs_embeds.shape[:2]assert(self.config.pad_token_idisnotNoneorbatch_size==1),"Cannot handle batch sizes > 1 if no padding token is defined."ifself.config.pad_token_idisNone:sequence_lengths=-1else:ifinput_idsisnotNone:sequence_lengths=torch.ne(input_ids,self.config.pad_token_id).sum(-1)-1else:sequence_lengths=-1logger.warning(f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "f"unexpected if using padding tokens in conjunction with `inputs_embeds.`")pooled_logits=logits[range(batch_size),sequence_lengths]loss=NoneiflabelsisnotNone:ifself.num_labels==1:# We are doing regressionloss_fct=MSELoss()loss=loss_fct(pooled_logits.view(-1),labels.to(self.dtype).view(-1))else:loss_fct=CrossEntropyLoss()loss=loss_fct(pooled_logits.view(-1,self.num_labels),labels.view(-1))ifnotreturn_dict:output=(pooled_logits,)+transformer_outputs[1:]return((loss,)+output)iflossisnotNoneelseoutputreturnSequenceClassifierOutputWithPast(loss=loss,logits=pooled_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,)

Sours: https://huggingface.co/transformers/_modules/transformers/modeling_gpt2.html

Gpt2 huggingface

This web app, built by the Hugging Face team, is the official demo of the repository's text generation capabilities.

Star

Checkpoints

🐎 DistilGPT-2

The student of the now ubiquitous GPT-2 does not come short of its teacher’s expectations. Obtained by distillation, DistilGPT-2 weighs 37% less, and is twice as fast as its OpenAI counterpart, while keeping the same generative power. Runs smoothly on an iPhone 7. The dawn of lightweight generative
transformers?

More infoStart writing

🤓 Arxiv-NLP

Built on the OpenAI GPT-2 model, the Hugging Face team has fine-tuned the small version on a tiny dataset (60MB of text) of Arxiv papers. The targeted subject is Natural Language Processing, resulting in a very Linguistics/Deep Learning oriented generation.

More infoStart writing

Models

🦄 GPT-2

The almighty king of text generation, GPT-2 comes in four available sizes, only three of which have been publicly made available. Feared for its fake news generation capabilities, it currently stands as the most syntactically coherent model. A direct successor to the original GPT, it reinforces the already established pre-training/fine-tuning killer duo. From the paper: Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.

Start writing

💯 XLNet

Overcoming the unidirectional limit while maintaining an independent masking algorithm based on permutation, XLNet improves upon the state-of-the-art autoregressive model that is TransformerXL. Using a bidirectional context while keeping its autoregressive approach, this model outperforms BERT on 20 tasks while keeping an impressive generative coherence. From the paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding, by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.

Start writing

☠️ GPT

Released by OpenAI, this seminal architecture has shown that large gains on several NLP tasks can be achieved by generative pre-training a language model on unlabeled text before fine-tuning it on a downstream task. From the paper: Improving Language Understanding by Generative Pre-Training, by Alec Radford, Karthik Naraimhan, Tim Salimans and Ilya Sutskever.

Start writing

Do you want to contribute or suggest a new model checkpoint? Open an issue on 🔥.

“It is to writing what calculators are to calculus.”

Sours: https://transformer.huggingface.co/
What It's Like To be a Computer: An Interview with GPT-3

OpenAI GPT2¶

Overview¶

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:

  • GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

  • GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

  • The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the method, or for TF the past argument of the method for more information on its usage.

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.

This model was contributed by thomwolf. The original code can be found here.

GPT2Config¶

class (vocab_size=50257, n_positions=1024, n_ctx=1024, n_embd=768, n_layer=12, n_head=12, n_inner=None, activation_function='gelu_new', resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-05, initializer_range=0.02, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, scale_attn_weights=True, use_cache=True, bos_token_id=50256, eos_token_id=50256, **kwargs)[source]¶

This is the configuration class to store the configuration of a or a . It is used to instantiate a GPT-2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2 small architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

Parameters
  • vocab_size (, optional, defaults to 50257) – Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the passed when calling or .

  • n_positions (, optional, defaults to 1024) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • n_ctx (, optional, defaults to 1024) – Dimensionality of the causal mask (usually same as n_positions).

  • n_embd (, optional, defaults to 768) – Dimensionality of the embeddings and hidden states.

  • n_layer (, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • n_head (, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • n_inner (, optional, defaults to None) – Dimensionality of the inner feed-forward layers. will set it to 4 times n_embd

  • activation_function (, optional, defaults to ) – Activation function, to be selected in the list .

  • resid_pdrop (, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • embd_pdrop (, optional, defaults to 0.1) – The dropout ratio for the embeddings.

  • attn_pdrop (, optional, defaults to 0.1) – The dropout ratio for the attention.

  • layer_norm_epsilon (, optional, defaults to 1e-5) – The epsilon to use in the layer normalization layers

  • initializer_range (, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • summary_type (, optional, defaults to ) –

    Argument used when doing sequence summary, used in the models and .

    Has to be one of the following options:

    • : Take the last token hidden state (like XLNet).

    • : Take the first token hidden state (like BERT).

    • : Take the mean of all tokens hidden states.

    • : Supply a Tensor of classification token position (like GPT/GPT-2).

    • : Not implemented now, use multi-head attention.

  • summary_use_proj (, optional, defaults to ) –

    Argument used when doing sequence summary, used in the models and .

    Whether or not to add a projection after the vector extraction.

  • summary_activation (, optional) –

    Argument used when doing sequence summary. Used in for the multiple choice head in .

    Pass for a tanh activation to the output, any other value will result in no activation.

  • summary_proj_to_labels (, optional, defaults to ) –

    Argument used when doing sequence summary, used in the models and .

    Whether the projection outputs should have or classes.

  • summary_first_dropout (, optional, defaults to 0.1) –

    Argument used when doing sequence summary, used in the models and .

    The dropout ratio to be used after the projection and activation.

  • scale_attn_weights (, optional, defaults to ) – Scale attention weights by dividing by sqrt(hidden_size)..

  • use_cache (, optional, defaults to ) – Whether or not the model should return the last key/values attentions (not used by all models).

Example:

>>> fromtransformersimportGPT2Model,GPT2Config>>> # Initializing a GPT2 configuration>>> configuration=GPT2Config()>>> # Initializing a model from the configuration>>> model=GPT2Model(configuration)>>> # Accessing the model configuration>>> configuration=model.config

GPT2Tokenizer¶

class (vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', add_prefix_space=False, **kwargs)[source]¶

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> fromtransformersimportGPT2Tokenizer>>> tokenizer=GPT2Tokenizer.from_pretrained("gpt2")>>> tokenizer("Hello world")['input_ids'][15496, 995]>>> tokenizer(" Hello world")['input_ids'][18435, 995]

You can get around that behavior by passing when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

Note

When used with , this tokenizer will add a space before each word (even the first one).

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • vocab_file () – Path to the vocabulary file.

  • merges_file () – Path to the merges file.

  • errors (, optional, defaults to ) – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

  • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (, optional, defaults to ) – The beginning of sequence token.

  • eos_token (, optional, defaults to ) – The end of sequence token.

  • add_prefix_space (, optional, defaults to ) – Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect beginning of words by the preceding space).

(save_directory:str, filename_prefix:Optional[str]=None) → Tuple[str][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

Parameters
  • save_directory () – The directory in which to save the vocabulary.

  • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

Returns

Paths to the files saved.

Return type

GPT2TokenizerFast¶

class (vocab_file=None, merges_file=None, tokenizer_file=None, unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', add_prefix_space=False, **kwargs)[source]¶

Construct a “fast” GPT-2 tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> fromtransformersimportGPT2TokenizerFast>>> tokenizer=GPT2TokenizerFast.from_pretrained("gpt2")>>> tokenizer("Hello world")['input_ids'][15496, 995]>>> tokenizer(" Hello world")['input_ids'][18435, 995]

You can get around that behavior by passing when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

Note

When used with , this tokenizer needs to be instantiated with .

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • vocab_file () – Path to the vocabulary file.

  • merges_file () – Path to the merges file.

  • errors (, optional, defaults to ) – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

  • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (, optional, defaults to ) – The beginning of sequence token.

  • eos_token (, optional, defaults to ) – The end of sequence token.

  • add_prefix_space (, optional, defaults to ) – Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect beginning of words by the preceding space).

  • trim_offsets (, optional, defaults to ) – Whether or not the post-processing step should trim offsets to avoid including whitespaces.

(save_directory:str, filename_prefix:Optional[str]=None) → Tuple[str][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

Parameters
  • save_directory () – The directory in which to save the vocabulary.

  • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

Returns

Paths to the files saved.

Return type

alias of

GPT2 specific outputs¶

class (loss:Optional[torch.FloatTensor]=None, mc_loss:Optional[torch.FloatTensor]=None, logits:torch.FloatTensor=None, mc_logits:torch.FloatTensor=None, past_key_values:Optional[Tuple[Tuple[torch.FloatTensor]]]=None, hidden_states:Optional[Tuple[torch.FloatTensor]]=None, attentions:Optional[Tuple[torch.FloatTensor]]=None)[source]¶

Base class for outputs of models predicting if two sentences are consecutive or not.

Parameters
  • loss ( of shape , optional, returned when is provided) – Language modeling loss.

  • mc_loss ( of shape , optional, returned when is provided) – Multiple choice classification loss.

  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • mc_logits ( of shape ) – Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).

  • past_key_values (, optional, returned when is passed or when ) –

    Tuple of length , containing tuples of tensors of shape ).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

  • hidden_states (, optional, returned when is passed or when ) –

    Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) –

    Tuple of (one for each layer) of shape .

    GPT2Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class (logits:tensorflow.python.framework.ops.Tensor=None, mc_logits:tensorflow.python.framework.ops.Tensor=None, past_key_values:Optional[List[tensorflow.python.framework.ops.Tensor]]=None, hidden_states:Optional[Tuple[tensorflow.python.framework.ops.Tensor]]=None, attentions:Optional[Tuple[tensorflow.python.framework.ops.Tensor]]=None)[source]¶

Base class for outputs of models predicting if two sentences are consecutive or not.

Parameters
  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • mc_logits ( of shape ) – Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).

  • past_key_values (, optional, returned when is passed or when ) –

    List of of length , with each tensor of shape ).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

  • hidden_states (, optional, returned when is passed or when ) –

    Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) –

    Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

GPT2Model¶

class (config)[source]¶

The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

()[source]¶

Moves the model to cpu from a model parallel state.

Example:

# On a 4 GPU machine with gpt2-large:model=GPT2LMHeadModel.from_pretrained('gpt2-large')device_map={0:[0,1,2,3,4,5,6,7],1:[8,9,10,11,12,13,14,15],2:[16,17,18,19,20,21,22,23],3:[24,25,26,27,28,29,30,31,32,33,34,35]}model.parallelize(device_map)# Splits the model across several devicesmodel.deparallelize()# Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If is used, only that do not have their past calculated should be passed as .

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past_key_values ( of length ) – Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see output below). Can be used to speed up sequential decoding. The which have their past given to this model should not be passed as as they have already been computed.

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) –

    Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    If is used, optionally only the last have to be input (see ).

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • last_hidden_state ( of shape ) – Sequence of hidden-states at the output of the last layer of the model.

    If is used only the last hidden-state of the sequences of shape is output.

  • past_key_values (, optional, returned when is passed or when ) – Tuple of of length , with each tuple having 2 tensors of shape ) and optionally if 2 additional tensors of shape .

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if in the cross-attention blocks) that can be used (see input) to speed up sequential decoding.

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (, optional, returned when and is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Return type

or

Example:

>>> fromtransformersimportGPT2Tokenizer,GPT2Model>>> importtorch>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2Model.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> last_hidden_states=outputs.last_hidden_state
(device_map=None)[source]¶

This is an experimental feature and is a subject to change at a moment’s notice.

Uses a device map to distribute attention modules of the model across several devices. If no device map is given, it will evenly distribute blocks across all devices.

Parameters

device_map (, optional, defaults to None) –

A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always automatically mapped to the first device (for esoteric reasons). That means that the first device should have fewer attention modules mapped to it than other devices. For reference, the gpt2 models have the following number of attention modules:

  • gpt2: 12

  • gpt2-medium: 24

  • gpt2-large: 36

  • gpt2-xl: 48

Example:

# Here is an example of a device map on a machine with 4 GPUs using gpt2-xl, which has a total of 48 attention modules:model=GPT2LMHeadModel.from_pretrained('gpt2-xl')device_map={0:[0,1,2,3,4,5,6,7,8],1:[9,10,11,12,13,14,15,16,17,18,19,20,21],2:[22,23,24,25,26,27,28,29,30,31,32,33,34],3:[35,36,37,38,39,40,41,42,43,44,45,46,47]}model.parallelize(device_map)

GPT2LMHeadModel¶

class (config)[source]¶

The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

()[source]¶

Moves the model to cpu from a model parallel state.

Example:

# On a 4 GPU machine with gpt2-large:model=GPT2LMHeadModel.from_pretrained('gpt2-large')device_map={0:[0,1,2,3,4,5,6,7],1:[8,9,10,11,12,13,14,15],2:[16,17,18,19,20,21,22,23],3:[24,25,26,27,28,29,30,31,32,33,34,35]}model.parallelize(device_map)# Splits the model across several devicesmodel.deparallelize()# Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If is used, only that do not have their past calculated should be passed as .

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past_key_values ( of length ) – Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see output below). Can be used to speed up sequential decoding. The which have their past given to this model should not be passed as as they have already been computed.

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) –

    Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    If is used, optionally only the last have to be input (see ).

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set Indices are selected in All labels set to are ignored (masked), the loss is only computed for labels in

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Language modeling loss (for next-token prediction).

  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (, optional, returned when is passed or when ) – Tuple of tuples of length , with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if .

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

Return type

or

Example:

>>> importtorch>>> fromtransformersimportGPT2Tokenizer,GPT2LMHeadModel>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2LMHeadModel.from_pretrained('gpt2')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs,labels=inputs["input_ids"])>>> loss=outputs.loss>>> logits=outputs.logits
(device_map=None)[source]¶

This is an experimental feature and is a subject to change at a moment’s notice.

Uses a device map to distribute attention modules of the model across several devices. If no device map is given, it will evenly distribute blocks across all devices.

Parameters

device_map (, optional, defaults to None) –

A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always automatically mapped to the first device (for esoteric reasons). That means that the first device should have fewer attention modules mapped to it than other devices. For reference, the gpt2 models have the following number of attention modules:

  • gpt2: 12

  • gpt2-medium: 24

  • gpt2-large: 36

  • gpt2-xl: 48

Example:

# Here is an example of a device map on a machine with 4 GPUs using gpt2-xl, which has a total of 48 attention modules:model=GPT2LMHeadModel.from_pretrained('gpt2-xl')device_map={0:[0,1,2,3,4,5,6,7,8],1:[9,10,11,12,13,14,15,16,17,18,19,20,21],2:[22,23,24,25,26,27,28,29,30,31,32,33,34],3:[35,36,37,38,39,40,41,42,43,44,45,46,47]}model.parallelize(device_map)

GPT2DoubleHeadsModel¶

class (config)[source]¶

The GPT2 Model transformer with a language modeling and a multiple-choice classification head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to the input embeddings, the classification head takes as input the input of a specified classification token index in the input sequence).

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config (): Model configuration class with all the parameters of the model.

Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, mc_token_ids=None, labels=None, mc_labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If is used, only that do not have their past calculated should be passed as .

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past_key_values ( of length ) – Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see output below). Can be used to speed up sequential decoding. The which have their past given to this model should not be passed as as they have already been computed.

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) –

    Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    If is used, optionally only the last have to be input (see ).

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • mc_token_ids ( of shape , optional, default to index of the last token of the input) – Index of the classification token in each input sequence. Selected in the range .

  • labels ( of shape , optional) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set Indices are selected in All labels set to are ignored (masked), the loss is only computed for labels in

  • mc_labels ( of shape , optional) – Labels for computing the multiple choice classification loss. Indices should be in where num_choices is the size of the second dimension of the input tensors. (see input_ids above)

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Language modeling loss.

  • mc_loss ( of shape , optional, returned when is provided) – Multiple choice classification loss.

  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • mc_logits ( of shape ) – Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).

  • past_key_values (, optional, returned when is passed or when ) – Tuple of length , containing tuples of tensors of shape ).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    GPT2Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> importtorch>>> fromtransformersimportGPT2Tokenizer,GPT2DoubleHeadsModel>>> tokenizer=GPT2Tokenizer.from_pretrained('gpt2')>>> model=GPT2DoubleHeadsModel.from_pretrained('gpt2')>>> # Add a [CLS] to the vocabulary (we should train it also!)>>> num_added_tokens=tokenizer.add_special_tokens({'cls_token':'[CLS]'})>>> embedding_layer=model.resize_token_embeddings(len(tokenizer))# Update the model embeddings with the new vocabulary size>>> choices=["Hello, my dog is cute [CLS]","Hello, my cat is cute [CLS]"]>>> encoded_choices=[tokenizer.encode(s)forsinchoices]>>> cls_token_location=[tokens.index(tokenizer.cls_token_id)fortokensinencoded_choices]>>> input_ids=torch.tensor(encoded_choices).unsqueeze(0)# Batch size: 1, number of choices: 2>>> mc_token_ids=torch.tensor([cls_token_location])# Batch size: 1>>> outputs=model(input_ids,mc_token_ids=mc_token_ids)>>> lm_logits=outputs.logits>>> mc_logits=outputs.mc_logits
Return type

or

GPT2ForSequenceClassification¶

class (config)[source]¶

The GPT2 Model transformer with a sequence classification head on top (linear layer).

uses the last token in order to do the classification, as other causal models (e.g. GPT-1) do.

Since it does classification on the last token, it requires to know the position of the last token. If a is defined in the configuration, it finds the last token that is not a padding token in each row. If no is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when are passed instead of , it does the same (take the last value in each row of the batch).

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If is used, only that do not have their past calculated should be passed as .

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past_key_values ( of length ) – Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see output below). Can be used to speed up sequential decoding. The which have their past given to this model should not be passed as as they have already been computed.

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) –

    Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    If is used, optionally only the last have to be input (see ).

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) – Labels for computing the sequence classification/regression loss. Indices should be in . If a regression loss is computed (Mean-Square loss), If a classification loss is computed (Cross-Entropy).

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Classification (or regression if config.num_labels==1) loss.

  • logits ( of shape ) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • past_key_values (, optional, returned when is passed or when ) – Tuple of of length , with each tuple having 2 tensors of shape )

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see input) to speed up sequential decoding.

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

or

Example:

>>> fromtransformersimportGPT2Tokenizer,GPT2ForSequenceClassification>>> importtorch>>> tokenizer=GPT2Tokenizer.from_pretrained('microsoft/DialogRPT-updown')>>> model=GPT2ForSequenceClassification.from_pretrained('microsoft/DialogRPT-updown')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> labels=torch.tensor([1]).unsqueeze(0)# Batch size 1>>> outputs=model(**inputs,labels=labels)>>> loss=outputs.loss>>> logits=outputs.logits

GPT2ForTokenClassification¶

class (config)[source]¶

GPT2 Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    = if is else ( of input past key value states). Indices of input sequence tokens in the vocabulary.

    If is used, only that do not have their past calculated should be passed as .

    Indices can be obtained using . See and for details.

    What are input IDs?

  • past_key_values ( of length ) – Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see output below). Can be used to speed up sequential decoding. The which have their past given to this model should not be passed as as they have already been computed.

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) –

    Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    If is used, optionally only the last have to be input (see ).

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) – Labels for computing the sequence classification/regression loss. Indices should be in . If a regression loss is computed (Mean-Square loss), If a classification loss is computed (Cross-Entropy).

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Classification loss.

  • logits ( of shape ) – Classification scores (before SoftMax).

  • hidden_states (

Sours: https://huggingface.co/transformers/model_doc/gpt2.html

You will also like:

.



426 427 428 429 430