An Android library that provides a port to sentence-transformers, which are used to generate sentence embeddings (fixed-size vectors for text/sentences)
-
Add support for 16 KB page-size Android devices by updating project NDK version to r28b
-
Modify Gradle scripts of
sentence_embeddings
andmodel2vec
modules to publish the AAR as a package on Maven Central
-
Move the Rust source code from the
libs
branch to themain
branch: We now use the rust-android-plugin to initiatecargo build
from Gradle -
Removed Git LFS: The ONNX models present in
app/src/main/assets
have been removed from the repository. Instead,app/build.gradle.kts
downloads the models and tokenizer configs from HuggingFace usingdownload_model.sh
shell script. -
Add Model2Vec: Model2Vec provides static sentence-embeddings through a fast-lookup
-
Remove Jitpack: A GitHub CI script now builds AARs for
model2vec
andsentence_embeddings
Gradle modules that can be included in other projects
- Along with
token_ids
andattention_mask
, the native library now also returnstoken_type_ids
to support additional models like thebge-small-en-v1.5
(issue #3)
To add more models, refer the Adding New Models section.
Include the following in your build.gradle
script,
dependencies {
// ... other packages
// To use sentence-embeddings
implementation 'io.gitlab.shubham0204:sentence-embeddings:v6'
// To also use model2vec
implementation 'io.gitlab.shubham0204:model2vec:v6'
}
The AARs for the sentence_embeddings
and model2vec
modules are available in
the Releases which can be
downloaded. Add the AARs to the app/libs
directory and then in app/build.gradle.kts
,
dependencies {
// ...
// Add one or both of them as needed
implementation(file("libs/sentence_embeddings.aar"))
implementation(file("libs/model2vec.aar"))
// ...
}
-
Set up Android NDK version r27c
# Using the nttld/setup-ndk action # Example manual equivalent: wget https://dl.google.com/android/repository/android-ndk-r27c-linux.zip unzip android-ndk-r27c-linux.zip export ANDROID_NDK_HOME=/path/to/android-ndk-r27c
-
Install Rust targets for Android
rustup target add aarch64-linux-android armv7-linux-androideabi i686-linux-android x86_64-linux-android
-
Build the Rust code
./gradlew cargoBuild --stacktrace
-
Build AAR for sentence_embeddings module
./gradlew :sentence_embeddings:assembleRelease --stacktrace
-
Build AAR for model2vec module
./gradlew :model2vec:assembleRelease --stacktrace
-
Build APK for app module
./gradlew :app:assembleRelease --stacktrace
-
Build APK for app-model2vec module
./gradlew :app-model2vec:assembleRelease --stacktrace
The library provides a SentenceEmbedding
class with init
and encode
suspend functions that
initialize the model and generate the sentence embedding respectively.
The init
function takes two mandatory arguments, modelBytes
and tokenizerBytes
.
import com.ml.shubham0204.sentence_embeddings.SentenceEmbedding
val sentenceEmbedding = SentenceEmbedding()
// Download the model and store it in the app's internal storage
// (OR) copy the model from the assets folder (see the app module in the repo)
val modelFile = File(filesDir, "model.onnx")
val tokenizerFile = File(filesDir, "tokenizer.json")
val tokenizerBytes = tokenizerFile.readBytes()
CoroutineScope(Dispatchers.IO).launch {
sentenceEmbedding.init(
modelFilepath = modelFile.absolutePath,
tokenizerBytes = tokenizerBytes,
useTokenTypeIds = false,
outputTensorName = "sentence_embedding",
useFP16 = false,
useXNNPack = false
)
}
Once the init
functions completes its execution, we can call the encode
function to transform
the given sentence
to an embedding,
CoroutineScope(Dispatchers.IO).launch {
val embedding: FloatArray = sentenceEmbedding.encode("Delhi has a population 32 million")
println("Embedding: $embedding")
println("Embedding size: ${embedding.size}")
}
The embeddings are vectors whose relative similarity can be computed by measuring the cosine of the angle between the vectors, also termed as cosine similarity,
Tip
Here's an excellent blog to under cosine similarity
private fun cosineDistance(
x1: FloatArray,
x2: FloatArray
): Float {
var mag1 = 0.0f
var mag2 = 0.0f
var product = 0.0f
for (i in x1.indices) {
mag1 += x1[i].pow(2)
mag2 += x2[i].pow(2)
product += x1[i] * x2[i]
}
mag1 = sqrt(mag1)
mag2 = sqrt(mag2)
return product / (mag1 * mag2)
}
CoroutineScope(Dispatchers.IO).launch {
val e1: FloatArray = sentenceEmbedding.encode("Delhi has a population 32 million")
val e2: FloatArray = sentenceEmbedding.encode("What is the population of Delhi?")
val e3: FloatArray =
sentenceEmbedding.encode("Cities with a population greater than 4 million are termed as metro cities")
val d12 = cosineDistance(e1, e2)
val d13 = cosineDistance(e1, e3)
println("Similarity between e1 and e2: $d12")
println("Similarity between e1 and e3: $d13")
}
We demonstrate how the snowflake-arctic-embed-s
model can be added to the sample application
present in the app
module.
-
Download the
model.onnx
andtokenizer.json
files from the HFsnowflake-arctic-embed-s
repository. -
Create a new sub-directory in
app/src/main/assets
namedsnowflake-arctic-embed-s
, the copy the two files to the sub-directory. -
In
Config.kt
, add a new entry in theModels
enum and a new branch ingetModelConfig
corresponding to the new model entry added in the enum,
enum class Model {
ALL_MINILM_L6_V2,
BGE_SMALL_EN_V1_5,
SNOWFLAKE_ARCTIC_EMBED_S // Add the new entry
}
fun getModelConfig(model: Model): ModelConfig {
return when (model) {
Model.ALL_MINILM_L6_V2 -> ModelConfig(
modelName = "all-minilm-l6-v2",
modelAssetsFilepath = "all-minilm-l6-v2/model.onnx",
tokenizerAssetsFilepath = "all-minilm-l6-v2/tokenizer.json",
useTokenTypeIds = false,
outputTensorName = "sentence_embedding"
)
Model.BGE_SMALL_EN_V1_5 -> ModelConfig(
modelName = "bge-small-en-v1.5",
modelAssetsFilepath = "bge-small-en-v1_5/model.onnx",
tokenizerAssetsFilepath = "bge-small-en-v1_5/tokenizer.json",
useTokenTypeIds = true,
outputTensorName = "last_hidden_state"
)
// Add a new branch for the model
Model.SNOWFLAKE_ARCTIC_EMBED_S -> ModelConfig(
modelName = "snowflake-arctic-embed-s",
modelAssetsFilepath = "snowflake-arctic-embed-s/model.onnx",
tokenizerAssetsFilepath = "snowflake-arctic-embed-s/tokenizer.json",
useTokenTypeIds = true,
outputTensorName = "last_hidden_state"
)
}
}
- To determine the values for
useTokenTypeIds
andoutputTensorName
, open the model with Netron or load the model in Python withonnxruntime
. We need to check the names of the input and output tensors.
With Netron, check if token_type_ids
is the name of an input tensor. Accordingly, set the value of
useTokenTypeIds
while creating an instance of ModelConfig
. For outputTensorName
, choose the
name of the output tensor which provides the embedding. For the snowflake-arctic-embed-s
model,
the name of that output tensor is last_hidden_state
.
The same information can be printed to the console with following Python snippet using the
onnxruntime
package,
import onnxruntime as ort
session = ort.InferenceSession("model.onnx" )
print("Inputs: ")
print( [ t.shape for t in session.get_inputs() ] )
print( [ t.type for t in session.get_inputs() ] )
print( [ t.name for t in session.get_inputs() ] )
print("Outputs: ")
print( [ t.shape for t in session.get_outputs() ] )
print( [ t.type for t in session.get_outputs() ] )
print( [ t.name for t in session.get_outputs() ] )
- Run the app on the test-device