qualcomm
/

Llama-v3.1-8B-Instruct

@@ -33,22 +33,13 @@ More details on model performance across various devices, can be found
   - Input sequence length for Prompt Processor: 128
   - Maximum context length: 4096
   - Precision: w4a16 + w8a16 (few layers)
-  - Num of key-value heads: 8
-  - Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
-  - Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
-  - Prompt processor output: 128 output tokens + KV cache outputs
-  - Model-2 (Token Generator): Llama-TokenGenerator-Quantized
-  - Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
-  - Token generator output: 1 output token + KV cache outputs
-  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
-  - Minimum QNN SDK version required: 2.27.7
   - Language(s) supported: English.
   - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
   - Response Rate: Rate of response generation after the first response token.
 | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
 |---|---|---|---|---|---|
-| Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen5 Mobile | GENIE | 14.91757 | 0.109992 - 3.519753 | -- | -- |
 | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 13.0546 | 0.154517 - 4.944544 | -- | -- |
 | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.357 | 0.207727 - 6.647264 | -- | -- |

   - Input sequence length for Prompt Processor: 128
   - Maximum context length: 4096
   - Precision: w4a16 + w8a16 (few layers)
   - Language(s) supported: English.
   - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
   - Response Rate: Rate of response generation after the first response token.
 | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
 |---|---|---|---|---|---|
+| Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen 5 Mobile | GENIE | 14.91757 | 0.109992 - 3.519753 | -- | -- |
 | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 13.0546 | 0.154517 - 4.944544 | -- | -- |
 | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.357 | 0.207727 - 6.647264 | -- | -- |