|
146 | 146 | "Integrated Inference allows you to specify the creation of a Pinecone index with a specific Pinecone-hosted embedding model, which makes it easy to interact with the index. To learn more about Integrated Inference, including what other models are available, take a [look here](https://docs.pinecone.io/guides/get-started/overview#integrated-embedding).\n",
|
147 | 147 | "\n",
|
148 | 148 | "\n",
|
149 |
| - "Here, we specify a starter tier index with the [multilingual-e5-large](https://docs.pinecone.io/models/multilingual-e5-large) embedding model. We also specify a mapping for what field in our\n", |
150 |
| - "records we will embed with this model. Then, we grab the index we just created for embedding later." |
| 149 | + "Here, we specify a starter tier index with the [llama-text-embed-v2](https://docs.pinecone.io/models/llama-text-embed-v2) embedding model. We also specify a mapping for what field in our records we will embed with this model. Then, we grab the index we just created for embedding later.\n", |
| 150 | + "\n", |
| 151 | + "Want to instead embed a subset with multiple languages? Use the [multilingual-e5-large model](https://docs.pinecone.io/models/multilingual-e5-large) and simply specify this inplace of the previous model when creating an index." |
151 | 152 | ]
|
152 | 153 | },
|
153 | 154 | {
|
|
167 | 168 | "{'dimension': 1024,\n",
|
168 | 169 | " 'index_fullness': 0.0,\n",
|
169 | 170 | " 'metric': 'cosine',\n",
|
170 |
| - " 'namespaces': {'english-sentences': {'vector_count': 416}},\n", |
171 |
| - " 'total_vector_count': 416,\n", |
| 171 | + " 'namespaces': {},\n", |
| 172 | + " 'total_vector_count': 0,\n", |
172 | 173 | " 'vector_type': 'dense'}"
|
173 | 174 | ]
|
174 | 175 | },
|
|
187 | 188 | " cloud=\"aws\",\n",
|
188 | 189 | " region=\"us-east-1\",\n",
|
189 | 190 | " embed={\n",
|
190 |
| - " \"model\":\"multilingual-e5-large\",\n", |
| 191 | + " # Use this if you want to instead embed non-english or a multilingual subset of the data\n", |
| 192 | + " #\"model\":\"multilingual-e5-large\",\n", |
| 193 | + " \"model\": \"llama-text-embed-v2\",\n", |
191 | 194 | " \"field_map\":{\"text\": \"chunk_text\"}\n",
|
192 | 195 | " }\n",
|
193 | 196 | " )\n",
|
|
241 | 244 | },
|
242 | 245 | {
|
243 | 246 | "cell_type": "code",
|
244 |
| - "execution_count": 13, |
| 247 | + "execution_count": 6, |
245 | 248 | "metadata": {},
|
246 | 249 | "outputs": [
|
247 | 250 | {
|
|
255 | 258 | " {'en': 'I have to go to sleep.', 'es': 'Tengo que irme a dormir.'}]}"
|
256 | 259 | ]
|
257 | 260 | },
|
258 |
| - "execution_count": 13, |
| 261 | + "execution_count": 6, |
259 | 262 | "metadata": {},
|
260 | 263 | "output_type": "execute_result"
|
261 | 264 | }
|
|
266 | 269 | },
|
267 | 270 | {
|
268 | 271 | "cell_type": "code",
|
269 |
| - "execution_count": 6, |
| 272 | + "execution_count": 7, |
270 | 273 | "metadata": {},
|
271 |
| - "outputs": [], |
| 274 | + "outputs": [ |
| 275 | + { |
| 276 | + "name": "stderr", |
| 277 | + "output_type": "stream", |
| 278 | + "text": [ |
| 279 | + "Filter: 100%|██████████| 214127/214127 [00:00<00:00, 439387.27 examples/s]\n", |
| 280 | + "Flattening the indices: 100%|██████████| 416/416 [00:00<00:00, 237004.95 examples/s]\n" |
| 281 | + ] |
| 282 | + } |
| 283 | + ], |
272 | 284 | "source": [
|
273 | 285 | "keywords= [\"park\"]\n",
|
274 | 286 | "\n",
|
|
295 | 307 | " translation_pairs = translation_pairs.flatten()\n",
|
296 | 308 | " translation_pairs = translation_pairs.shuffle(seed=1)\n",
|
297 | 309 | "\n",
|
| 310 | + " # If you want to include the spanish subset, simply repeat the below steps with \"es\" instead of \"en\"\n", |
| 311 | + " # Be sure to create your index with multilingual-e5-large as well in this case!\n", |
298 | 312 | " english_sentences = translation_pairs.rename_column(\"translation.en\", \"text\").remove_columns(\"translation.es\")\n",
|
299 | 313 | "\n",
|
300 | 314 | " # add lang column to indicate embedding origin\n",
|
|
346 | 360 | },
|
347 | 361 | {
|
348 | 362 | "cell_type": "code",
|
349 |
| - "execution_count": 7, |
| 363 | + "execution_count": 8, |
350 | 364 | "metadata": {
|
351 | 365 | "colab": {
|
352 | 366 | "base_uri": "https://localhost:8080/",
|
|
373 | 387 | "name": "stderr",
|
374 | 388 | "output_type": "stream",
|
375 | 389 | "text": [
|
376 |
| - "Upserting records batch: 100%|██████████| 5/5 [00:04<00:00, 1.18it/s]\n" |
| 390 | + "Upserting records batch: 100%|██████████| 5/5 [00:02<00:00, 1.79it/s]\n" |
377 | 391 | ]
|
378 | 392 | }
|
379 | 393 | ],
|
|
420 | 434 | },
|
421 | 435 | {
|
422 | 436 | "cell_type": "code",
|
423 |
| - "execution_count": 8, |
| 437 | + "execution_count": 12, |
424 | 438 | "metadata": {},
|
425 | 439 | "outputs": [
|
426 | 440 | {
|
427 | 441 | "name": "stdout",
|
428 | 442 | "output_type": "stream",
|
429 | 443 | "text": [
|
430 |
| - "Sentence: I have the afternoon off today, so I plan to go to the park, sit under a tree and read a book. Semantic Similarity Score: 0.8618775606155396\n", |
| 444 | + "Sentence: I have the afternoon off today, so I plan to go to the park, sit under a tree and read a book. Semantic Similarity Score: 0.4675264060497284\n", |
431 | 445 | "\n",
|
432 |
| - "Sentence: Let's go to the park where it's not noisy. Semantic Similarity Score: 0.8588659167289734\n", |
| 446 | + "Sentence: I went to the park to play tennis. Semantic Similarity Score: 0.4330753684043884\n", |
433 | 447 | "\n",
|
434 |
| - "Sentence: Let's go to the park where it is not noisy. Semantic Similarity Score: 0.8588587045669556\n", |
| 448 | + "Sentence: I go to the park. Semantic Similarity Score: 0.4261631369590759\n", |
435 | 449 | "\n",
|
436 |
| - "Sentence: Let's go to the park where it isn't noisy. Semantic Similarity Score: 0.858812153339386\n", |
| 450 | + "Sentence: I went to the park yesterday. Semantic Similarity Score: 0.42239895462989807\n", |
437 | 451 | "\n",
|
438 |
| - "Sentence: I go to the park. Semantic Similarity Score: 0.858041524887085\n", |
| 452 | + "Sentence: I went to the park last Sunday. Semantic Similarity Score: 0.42069774866104126\n", |
439 | 453 | "\n",
|
440 |
| - "Sentence: I'll go to the park. Semantic Similarity Score: 0.8502914905548096\n", |
| 454 | + "Sentence: I like going for a walk in the park. Semantic Similarity Score: 0.41970351338386536\n", |
441 | 455 | "\n",
|
442 |
| - "Sentence: I like going for a walk in the park. Semantic Similarity Score: 0.847651481628418\n", |
| 456 | + "Sentence: I went to the park last Saturday. Semantic Similarity Score: 0.4103226661682129\n", |
443 | 457 | "\n",
|
444 |
| - "Sentence: Let's take a walk in the park. Semantic Similarity Score: 0.8399631977081299\n", |
| 458 | + "Sentence: I need light plates because today my family is going to eat lunch in the park. Semantic Similarity Score: 0.40211308002471924\n", |
445 | 459 | "\n",
|
446 |
| - "Sentence: Who wants to go to the park? Semantic Similarity Score: 0.8391842842102051\n", |
| 460 | + "Sentence: Linda went to the park to listen to music. Semantic Similarity Score: 0.4012303650379181\n", |
447 | 461 | "\n",
|
448 |
| - "Sentence: Do you like to walk in the park? Semantic Similarity Score: 0.8343247771263123\n", |
| 462 | + "Sentence: I'll go to the park. Semantic Similarity Score: 0.3996794819831848\n", |
449 | 463 | "\n"
|
450 | 464 | ]
|
451 | 465 | }
|
|
476 | 490 | },
|
477 | 491 | {
|
478 | 492 | "cell_type": "code",
|
479 |
| - "execution_count": 9, |
| 493 | + "execution_count": 13, |
480 | 494 | "metadata": {},
|
481 | 495 | "outputs": [
|
482 | 496 | {
|
483 | 497 | "name": "stdout",
|
484 | 498 | "output_type": "stream",
|
485 | 499 | "text": [
|
486 |
| - "Sentence: Where can I park? Semantic Similarity Score: 0.8843114376068115\n", |
| 500 | + "Sentence: I can't find a spot to park my spaceship. Semantic Similarity Score: 0.44190075993537903\n", |
487 | 501 | "\n",
|
488 |
| - "Sentence: Where can I park? Semantic Similarity Score: 0.8841626048088074\n", |
| 502 | + "Sentence: I can't find a spot to park my spaceship. Semantic Similarity Score: 0.44190075993537903\n", |
489 | 503 | "\n",
|
490 |
| - "Sentence: Where can we park? Semantic Similarity Score: 0.8696897625923157\n", |
| 504 | + "Sentence: There isn't anywhere else to park. Semantic Similarity Score: 0.4017431437969208\n", |
491 | 505 | "\n",
|
492 |
| - "Sentence: Where can I park my car? Semantic Similarity Score: 0.8663355112075806\n", |
| 506 | + "Sentence: I have to park my car here. Semantic Similarity Score: 0.3978813886642456\n", |
493 | 507 | "\n",
|
494 |
| - "Sentence: May I park here for a while? Semantic Similarity Score: 0.864980161190033\n", |
| 508 | + "Sentence: Where can I park? Semantic Similarity Score: 0.39125218987464905\n", |
495 | 509 | "\n",
|
496 |
| - "Sentence: I'm double-parked. Could you hurry it up? Semantic Similarity Score: 0.8629273176193237\n", |
| 510 | + "Sentence: Where can I park? Semantic Similarity Score: 0.39125218987464905\n", |
497 | 511 | "\n",
|
498 |
| - "Sentence: I'm double-parked. Could you hurry it up? Semantic Similarity Score: 0.8629273176193237\n", |
| 512 | + "Sentence: I am parking my car near the office. Semantic Similarity Score: 0.37668246030807495\n", |
499 | 513 | "\n",
|
500 |
| - "Sentence: I'm double-parked. Could you hurry it up? Semantic Similarity Score: 0.8626684546470642\n", |
| 514 | + "Sentence: May I park here for a while? Semantic Similarity Score: 0.3707844614982605\n", |
501 | 515 | "\n",
|
502 |
| - "Sentence: I'm double-parked. Could you hurry it up? Semantic Similarity Score: 0.8626684546470642\n", |
| 516 | + "Sentence: I parked on the left side of the street just in front of the school. Semantic Similarity Score: 0.37002164125442505\n", |
503 | 517 | "\n",
|
504 |
| - "Sentence: \"May I park here?\" \"No, you can't.\" Semantic Similarity Score: 0.8602052927017212\n", |
| 518 | + "Sentence: Where can I park my car? Semantic Similarity Score: 0.3609045743942261\n", |
505 | 519 | "\n"
|
506 | 520 | ]
|
507 | 521 | }
|
|
549 | 563 | },
|
550 | 564 | {
|
551 | 565 | "cell_type": "code",
|
552 |
| - "execution_count": 10, |
| 566 | + "execution_count": 11, |
553 | 567 | "metadata": {
|
554 | 568 | "id": "-cWdeKzhAtww"
|
555 | 569 | },
|
|
0 commit comments