qaihm-bot commited on
Commit
345102e
·
verified ·
1 Parent(s): 1c5b794

See https://github.com/quic/ai-hub-models/releases/v0.45.0 for changelog.

Files changed (1) hide show
  1. README.md +1 -10
README.md CHANGED
@@ -33,22 +33,13 @@ More details on model performance across various devices, can be found
33
  - Input sequence length for Prompt Processor: 128
34
  - Maximum context length: 4096
35
  - Precision: w4a16 + w8a16 (few layers)
36
- - Num of key-value heads: 8
37
- - Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
38
- - Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
39
- - Prompt processor output: 128 output tokens + KV cache outputs
40
- - Model-2 (Token Generator): Llama-TokenGenerator-Quantized
41
- - Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
42
- - Token generator output: 1 output token + KV cache outputs
43
- - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
44
- - Minimum QNN SDK version required: 2.27.7
45
  - Language(s) supported: English.
46
  - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
47
  - Response Rate: Rate of response generation after the first response token.
48
 
49
  | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
50
  |---|---|---|---|---|---|
51
- | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen5 Mobile | GENIE | 14.91757 | 0.109992 - 3.519753 | -- | -- |
52
  | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 13.0546 | 0.154517 - 4.944544 | -- | -- |
53
  | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.357 | 0.207727 - 6.647264 | -- | -- |
54
 
 
33
  - Input sequence length for Prompt Processor: 128
34
  - Maximum context length: 4096
35
  - Precision: w4a16 + w8a16 (few layers)
 
 
 
 
 
 
 
 
 
36
  - Language(s) supported: English.
37
  - TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
38
  - Response Rate: Rate of response generation after the first response token.
39
 
40
  | Model | Precision | Device | Chipset | Target Runtime | Response Rate (tokens per second) | Time To First Token (range, seconds)
41
  |---|---|---|---|---|---|
42
+ | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite Gen 5 QRD | Snapdragon® 8 Elite Gen 5 Mobile | GENIE | 14.91757 | 0.109992 - 3.519753 | -- | -- |
43
  | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite Mobile | GENIE | 13.0546 | 0.154517 - 4.944544 | -- | -- |
44
  | Llama-v3.1-8B-Instruct | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | GENIE | 9.357 | 0.207727 - 6.647264 | -- | -- |
45