Travaux pratiques - Locality Sensitive Hashing (LSH) avec Spark¶
(correspond à 1 séance de TP)
Références externes utiles :
L’objectif de cette séance de TP est de vous familiariser avec l’utilisation des méthodes de hachage sensible à la similarité (LSH) pour la recherche et la jointure par similarité sur la plateforme de calcul distribué Spark.
Noter que pour le moment l’implémentation des fonctionalités LSH sur Spark n’est pas complète, seule étant disponible l’amplification OR (utilisation de plusieurs tables de hachage indépendantes) et pas encore l’amplification AND (combinaison de plusieurs fonctions LSH élémentaires dans chaque table de hachage).
Création de l’environnement de travail, téléchargement des données¶
Si vous avez déjà récupéré certaines de ces données (comme Base10000.zip
) lors des séances de TP précédentes il n’est pas nécessaire de les rechercher de nouveau, vous pouvez les réutiliser.
Attention : les commandes qui suivent %%bash
sont à entrer dans une fenêtre terminal linux (ou MacOS). Si vous travaillez directement sur Windows vous devez exécuter les manipulations équivalentes.
%%bash
mkdir -p tplsh tplsh/rqref
# recuperer les donnees
wget -nc -q http://cedric.cnam.fr/~ferecatu/NFE205/Base10000.zip -P tplsh/
wget -nc -q http://cedric.cnam.fr/~crucianm/src/ResNetDescriptors -P tplsh/
unzip -qq tplsh/Base10000.zip
# recuperer les reponses de reference de deux requetes
wget -nc http://cedric.cnam.fr/~crucianm/src/181081_resnet -P tplsh/rqref/
wget -nc http://cedric.cnam.fr/~crucianm/src/415008_resnet -P tplsh/rqref/
Lancement de PySpark¶
Ouvrez une nouvelle fenêtre terminal, positionnez-vous dans le répertoire tplsh
(par ex. avec cd tplsh
si vous êtes dans le répertoire parent) et lancez PySpark (l’interface de Spark en Python) :
%%bash
pyspark
Si la commande pyspark
n’est pas trouvée, il faut identifier l’endroit où est installé Spark sur votre système (par ex. avec la commande linux whereis spark
, ou avec echo $SPARK_HOME
si la variable système $SPARK_HOME
a été correctement initialisée) ; pyspark
sera dans le sous-répertoire /bin
du répertoire de Spark.
Importation des fonctionalités nécessaires¶
Les fonctionalités associées à la manipulation de vecteurs, à l’accès explicite à des colonnes de DataFrame et à l’utilisation de LSH doivent être importées dans la session Spark :
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
Lecture des données¶
Les données sont des descripteurs obtenus par un réseau de neurones de type ResNet. Chaque ligne du fichier ResNetDescriptors
contient le nom d’un fichier image suivi de 512 valeur réelles qui constituent le descripteur obtenu par ResNet. Regardez la documentation Spark concernant la lecture de fichiers csv
pour mieux comprendre l’utilisation de la fonction de lecture.
L”« assembleur » sert à regrouper les 512 colonnes issues de la lecture des composantes du vecteur descripteur en une seule colonne de DataFrame (qui contient des vecteurs de dimension 512).
dfResnetData = spark.read.load("tplsh/ResNetDescriptors", \
format="csv", sep=" ", inferSchema="true", header="false")
# affichage schema Dataframe
#dfResnetData.printSchema()
# regroupement des composantes du vecteur descripteur
assembler = VectorAssembler( \
inputCols=["_c"+str(i) for i in range(1,513)], \
outputCol="features")
dfResnetFeatures = assembler.transform(dfResnetData)
#dfResnetFeatures.printSchema()
Construction des fonctions de hachage (LSH)¶
Examinez la documentation sur les fonctionalités LSH de Spark pour mieux comprendre la définition des fonctions de hachage et l’utilisation de ces fonctionalités.
bRP = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", \
bucketLength=10.0, numHashTables=3)
lshModel = bRP.fit(dfResnetFeatures)
# quels sont les parametres d'un modele LSH
lshModel.params
Construction des tables de hachage à partir des données de la base¶
Les fonctions de hachage sont appliquées aux données de la base de données (DataFrame dfResnetFeatures
) pour remplir les tables de hachage. Chaque table de hachage est issue d’une seule fonction LSH élémentaire (l’amplification AND n’est pas encore implémentée), plusieurs tables de hachage indépendantes (combien ?) sont employées. La concaténation des valeurs de hash correspondant aux différentes tables se trouve, pour les vecteurs features
, dans la colonne hashes
.
dfRFtransformed = lshModel.transform(dfResnetFeatures)
#dfRFtransformed.printSchema()
dfRFtransformed.select("hashes").show()
Recherche par similarité avec LSH¶
Nous utilisons deux requêtes pour la recherche par similarité. Pour chaque requête nous retournons les 200 plus proches voisins. Ces deux requêtes ont déjà été traitées, avec des recherches exhaustives des 200 plus proches voisins, dans un TP image antérieur.
query181081 = Vectors.dense([0.8251823,0.91035753,0.09495628,1.4742385,0.47077715,0.72143084,0.2326101,1.9182028,1.602904,3.2716675,0.40041175,0.51895624,1.3964912,0.18736123,0.007291014,0.012191964,0.06572782,0.74409735,0.06422359,1.0140035,0.2881519,0.8197073,0.14281192,1.568304,0.14897016,0.08950109,1.1391809,1.8378319,0.76647186,1.4851955,1.8594424,0.2714822,0.74353796,1.9263119,0.1745341,0.15795814,1.7530701,0.060984455,1.0865763,0.7848891,0.30945352,1.2342236,0.23379217,1.1462557,1.267692,0.0,0.7408862,0.09214586,0.18567479,0.12656471,0.021630852,0.955537,0.15970318,0.089150205,1.684812,0.5671266,2.5707386,0.98088783,0.90231615,1.3630576,0.14114097,0.4463689,1.0327705,0.705891,0.12658072,0.48323864,0.0045423703,0.039263647,2.8394363,0.15548013,2.7657537,0.37192833,0.1715558,1.3063127,0.24360695,0.046071324,0.25840083,0.9103306,0.08077669,1.0398829,0.24748424,2.989612,0.3707291,0.87966746,0.2937655,0.748611,0.98078614,0.16775414,2.2089913,0.3963593,0.093315706,1.7715102,0.06331823,2.422794,1.1966951,0.73674244,0.7618964,0.013256618,0.28509787,1.5148723,0.5090185,2.211134,0.7279506,0.060473062,1.4912022,0.032703873,3.8989232,0.029223189,1.845185,0.80920166,1.4031032,0.27063265,0.99870557,0.8388382,0.04311324,0.76778144,0.91299003,0.8400017,1.0793769,0.98688084,1.2363697,0.9989632,0.6078563,1.2884163,1.1453526,0.0,1.303353,0.96292967,1.2171977,0.38198313,0.13100779,0.7961989,0.24733944,0.83064073,0.82503337,0.34207195,0.3793606,0.77659965,1.3392318,1.6045268,0.39323044,1.6529642,0.47640368,0.9393637,0.5614654,0.46222636,0.5178515,1.3867533,0.24465682,0.19497502,0.22383977,0.27872846,4.3903155,0.106261075,1.5941149,0.91558516,1.6847173,0.20818281,0.26747233,0.5019687,1.8784739,0.15575074,0.12299685,4.160663,0.3325281,0.6760867,1.3686028,0.010640076,0.989378,0.16756538,0.6657893,0.2904656,0.91679025,0.11874362,1.856527,1.5264468,0.012967174,1.2890414,0.41133478,0.17087613,0.297161,0.17655171,1.0926753,0.6619003,1.9088618,1.9291886,0.6224267,0.8459799,0.48735592,1.645272,0.3632541,0.5489407,1.7378443,2.7794728,0.5258269,0.32889587,1.0048896,1.1523347,1.2362299,0.5641484,0.70279956,2.050117,0.46726876,1.3492838,0.73215735,0.14453815,0.66694397,0.23404352,0.54396266,1.5131798,0.19280887,0.22775684,0.120462,1.2465322,0.67533493,0.15239385,0.30662766,0.65457433,0.16800117,2.587592,0.0,1.2646242,0.22110419,0.7204129,0.22643027,0.5203356,0.4501484,0.44751263,0.78501004,0.8660302,1.4740266,0.024843605,0.53146344,0.03903717,0.046770047,2.0057867,4.166679,0.25391626,0.09374673,1.0774068,0.7967229,0.0,0.060316835,0.30588898,0.22259046,0.0066905566,0.74676704,2.1363032,0.4819522,0.84701777,1.373439,0.13733654,1.031783,0.6279997,1.6977885,0.550729,0.0052132397,0.9728978,0.27339548,0.31880563,0.353831,0.94810313,2.1523573,0.22927774,0.40312037,0.39954814,0.37173545,1.1860722,0.2620945,0.1612127,0.25837606,0.79436976,2.403545,0.7575483,0.4023421,0.43574995,2.5502756,1.3306122,0.010389698,0.22068267,0.47496215,0.42805463,0.6734205,0.35107052,0.097786106,0.31571063,0.00723672,0.5044784,1.8777064,1.2576957,0.38982293,0.7543205,0.11731417,1.8303478,1.0231175,0.46368614,0.33111092,0.80494094,1.4011557,0.16004916,1.5364176,0.4860146,0.8693424,0.51418024,0.39284527,0.1627969,0.65889996,0.95257366,0.02478139,0.56915337,0.7425185,0.5774835,0.6707734,0.24455981,0.4102656,0.93017966,1.0721194,0.02066911,0.8164203,0.14819643,1.6285479,0.2553866,0.053712346,0.5492008,0.35030517,0.50301754,0.32716075,0.22063479,0.22277375,0.89927036,0.45242822,0.70615697,0.74781585,0.040721733,0.42760938,0.31101212,0.14304462,0.29368323,0.5602965,1.3290967,0.14858778,1.8203995,0.4973353,2.1680048,0.42841884,0.34959275,0.5989331,1.0380756,0.6601667,0.82948405,0.4460639,0.3634462,0.06947707,0.55306345,0.0766763,0.60772634,0.3708077,0.8644778,0.24274044,0.5678104,0.5981921,0.76298964,0.46475554,0.36066583,1.4107724,0.5725621,0.23255217,0.23682031,1.6758224,0.8106787,0.99636894,0.08012349,0.28527904,0.041957702,1.0439851,1.8015053,0.3204132,0.025273405,0.286521,1.1788357,0.13538794,0.88002384,0.14288142,0.18516172,0.023365242,1.0207225,0.32568777,0.0,0.06257125,0.27742037,2.0107121,0.43492678,0.58344,0.1484052,0.0640609,1.2602185,1.0186493,0.9508419,0.089205205,0.02448311,0.077032864,2.8769832,0.5279616,0.16285227,0.8276714,0.741521,1.0101919,0.8734434,1.7919364,1.2205344,0.39959314,0.77236027,0.8495284,0.07038107,0.27102917,0.4469076,2.8866527,0.07179235,0.31700227,0.23275726,0.044392522,0.30589294,1.6301312,0.42484328,3.0158224,0.0337908,4.7898583,0.5846617,0.3214625,0.3125182,0.048198152,0.24820589,0.010746905,0.24370219,0.5718805,1.4477926,0.026398629,0.15819904,1.5892365,1.2872169,0.41291657,1.4439493,5.00339,0.016206438,0.6383031,0.059255827,1.631618,0.7806632,0.573832,1.2643466,0.15279052,0.026495773,0.33062527,0.12674388,0.6784105,2.3379147,0.085744515,3.034874,0.6813558,0.4370824,0.10580053,0.65050006,1.3085049,0.29220173,0.12848659,0.12956078,0.19968098,0.5995456,0.16255303,1.6808985,0.014256702,0.16226277,1.532355,0.23470142,0.07883693,0.51016384,0.81855696,0.80197203,0.22353828,0.76727724,0.6553169,0.9994703,0.020010041,0.45054847,1.9366511,0.09807539,0.7416031,0.41733074,0.5268382,0.34511793,0.21454094,1.2320131,0.96679515,0.51581913,0.297835,0.6849035,1.2363681,0.66226053,0.056623697,0.9213674,0.39946777,0.6344144,0.29380774,1.1244763,0.31652156,2.147081,0.38551554,0.9383485,0.7212884,0.47756374,0.81234723,0.3816825])
dfNN200query181081 = lshModel.approxNearestNeighbors(dfRFtransformed, query181081, 200)
# quels sont les ppv et les distances correspondantes
dfNN200query181081.select("_c0","distCol").show()
query415008 = Vectors.dense([0.45263392,0.8956163,0.87070477,0.14816852,0.52650183,1.1561016,0.6293764,0.9284468,0.38421723,4.6905303,0.86750746,0.053848274,0.2803663,0.6544485,0.8749838,3.4025502,0.42714152,0.9133633,0.17565206,0.9649471,0.1318799,3.7233472,0.20796484,0.97474825,0.3717961,1.9554434,1.8707739,3.3039918,0.99182826,0.27886763,1.7895261,1.491852,0.33849856,0.19214988,0.88428277,0.7720143,1.6419953,0.34475076,0.06494037,1.2440069,0.5226596,1.6443497,0.44598198,1.2996941,0.20559053,0.7820821,0.0,0.42952016,0.7608008,2.3894308,1.345929,0.01645541,1.3150162,0.2413582,0.34225136,0.8223587,2.4202332,2.5830383,0.058790546,1.6307446,2.0328014,3.0888891,0.22440873,1.5643606,2.3543608,0.5895053,0.002720334,0.12608899,0.50581914,0.8462622,0.0001264555,0.85995185,1.0312561,2.4007227,0.09775712,2.4886932,0.42553523,0.7466256,1.2594223,0.2238102,0.32247618,0.32852304,0.0,5.099753,0.1117684,2.5473087,0.26172626,1.2242079,1.3455564,0.00774833,0.6307779,2.023976,0.13854727,0.9647044,0.37276256,0.02528299,0.74835,1.2677828,0.0,0.010909967,2.5560832,1.2682337,0.11524897,0.47091496,0.5218562,0.87411904,1.0009542,1.5979035,0.10322508,0.5076439,0.0071715433,0.14258493,0.005567905,0.23197368,0.05281311,3.0220897,1.5834353,1.2366724,0.40156943,0.83163947,4.875465,0.0,0.737171,0.041035485,3.0698123,0.14297217,1.9236431,2.9314566,0.00917082,0.9888363,0.43763638,0.047162723,0.29775804,1.3054109,2.5959437,0.3400502,1.8285668,0.02136938,0.0,0.07793838,0.0014133605,0.12934381,0.3312592,0.7955034,0.3549969,1.3283514,1.7875737,1.1491022,1.4925776,0.3165809,0.5673069,1.3876493,2.2754574,1.5444372,1.6619296,0.2176583,1.8861767,0.15535067,0.041536473,0.42179826,0.6103136,0.65160865,2.7445345,3.4402654,0.030973628,1.0283326,2.2999504,0.25287843,0.2802209,1.054336,0.33884877,2.989419,3.2136192,3.5766551,0.11569985,0.17507003,0.35161313,0.60311687,0.45556974,0.40456232,2.9602437,0.07201544,1.2705411,0.28510845,0.039949536,0.10283875,1.2801944,0.9016231,0.78658295,0.8646321,0.0034093002,0.76072377,0.8297463,0.29100403,1.4552735,0.804808,0.21929853,0.16287188,0.09505152,0.21681665,0.7554719,0.020883337,1.717731,0.07955182,2.0607517,0.217322,0.64428794,1.0605676,0.25314844,6.383006,0.47646192,0.5124927,1.4541179,0.3532697,0.2513428,1.5772153,1.7872788,0.26646078,0.03598957,1.7086523,0.00768404,3.403396,2.0536177,2.0303457,0.6957179,0.012387057,1.1551886,0.0011552443,1.0009739,0.0028497053,0.12016697,0.19505928,1.4286954,0.5523408,0.15032008,1.948101,0.16595522,0.07704008,0.0068218997,1.5121472,0.03246734,0.3834128,2.3934941,1.2801512,0.57466936,0.13358139,1.4905454,0.21677655,0.012241403,0.6634224,1.9990369,1.4888697,0.046250198,0.567633,0.37149772,0.0695715,0.7751208,3.3232234,1.9529496,0.6433706,0.29582456,0.46578053,0.116250485,0.26288375,1.9813393,0.9387551,1.157241,1.215413,0.4552548,0.050808307,1.1613209,0.13422644,0.21431296,1.188897,0.044735916,2.1704319,0.81057507,0.07723919,1.8554053,0.5872315,1.7745929,0.62496847,1.4463258,0.014328592,0.18433736,0.35202718,0.78132397,0.46283117,0.010407659,4.443083,0.6800797,0.36474577,0.98243487,0.0,0.8563841,1.5238401,0.35596544,1.4687487,1.1605676,0.755615,1.5548942,2.861918,0.4937421,0.55141145,0.35684788,0.029336939,0.90690106,0.16248806,0.3778433,0.024810692,0.13464822,0.29201332,0.32607546,0.45219728,1.249203,0.68752927,0.02114641,1.9693781,0.068342045,0.3474129,1.0815232,0.32458854,0.417667,3.1885743,0.29306456,1.7536173,2.300651,2.1006317,0.2613018,1.7619803,0.014848229,1.9740887,1.2056358,1.3646326,1.5821574,0.4507597,0.48707905,0.04200224,0.12469209,1.1967199,0.8319932,0.28118965,0.44353834,0.70064133,0.011465581,0.17592621,0.15014814,0.32085174,0.11769223,1.0594741,0.047729336,0.35906988,1.2250327,1.3953635,2.1535811,0.04823087,1.8396713,0.9330752,2.931754,1.8205609,0.38561186,1.2825079,0.17464894,0.30233335,0.34293148,0.01875776,3.5774667,1.3687884,0.39101276,0.040731475,2.3296523,0.048506614,3.4258864,1.2192748,0.2334504,0.1541113,0.87802356,0.39543322,0.5851682,0.14774382,0.58431256,1.8559775,2.437189,0.12203799,0.43733776,1.2441306,0.19521865,0.046307057,0.25530845,2.3219538,0.12728329,1.345146,0.51866275,0.1148046,0.22727555,0.2201568,0.18787695,0.36578515,0.3327393,0.3755282,2.7952793,0.1699241,1.1266867,0.29099792,0.0012813747,4.0054636,1.8603879,0.4337331,0.66320294,5.437162,0.9894746,0.04256366,0.23364277,0.5691044,0.32171965,2.577568,0.1098858,0.020045849,0.62123513,0.4086426,0.9725301,0.02237483,0.30959535,0.045249205,0.59393895,0.2213169,1.2741151,0.6631232,0.23609507,2.7268214,1.2567642,3.6897569,1.663225,0.82428616,0.256715,1.2060919,4.110202,1.4713094,0.15731384,0.003555506,0.5560543,0.6526428,0.6432671,0.6893146,0.19068211,3.681301,0.5673297,0.26445177,0.102776594,0.03640779,1.0631367,0.20791848,0.13886246,0.5782891,0.7237094,0.657786,0.23636226,0.6911904,0.40478846,0.09338616,0.032393206,0.08497512,1.9046504,0.23404187,1.1589246,0.2819047,0.1652394,1.1480924,0.82744086,0.11994312,0.89856035,0.13563877,0.94947875,0.8775872,0.19169709,0.5171593,0.09551654,0.6022481,0.0,0.5338754,0.47721133,1.3559333,0.75558114,0.12732002,2.094488,0.5249446,1.2903894,0.0004469834,0.57952213,0.044969417,0.6313561,0.5405882,0.6595507,2.0314913,0.0072759674,0.7821973,0.10841969,2.2669275,0.88126385,0.036794934,0.33831996,1.1073501,0.43316153,1.2839894,0.54020905,0.09020492,0.29618382,1.0281754,0.0036136385,0.24706824,1.655145,0.0])
dfNN200query415008 = lshModel.approxNearestNeighbors(dfRFtransformed, query415008, 200)
# quels sont les ppv et les distances correspondantes
dfNN200query415008.select("_c0","distCol").show()
Question :
Afficher les résultats obtenus avec LSH pour ces deux requêtes sous la forme de pages HTML. Pour cela il est nécessaire de regarder la documentation Spark concernant l’écriture des données.
Evaluation des résultats obtenus¶
Les résultats obtenus pour ces recherches avec LSH sont comparés aux résultats obtenus pour les mêmes requêtes par des recherches exhaustives, résultats sauvegardés préalablement dans les fichiers 181081_resnet
et 415008_resnet
. La comparaison est réalisée par une jointure entre le DataFrame résultant de la recherche avec LSH et le DataFrame lu à partir du fichier, sur la colonne qui contient le nom de l’image : si le nombre de noms identiques entre une recherche exhaustive et une recherche approximative est égale au nombre l’éléments obtenus par la recherche exhaustive, alors cela veut dire que la recherche approximative est, dans cet exemple, exacte (le rappel et la précision sont de 1).
dfResQuery181081 = spark.read.load("tplsh/rqref/181081_resnet", \
format="csv", inferSchema="true", header="false")
# affichage schema Dataframe
print("Schéma résultats recherche exhaustive Query181081 : ")
dfResQuery181081.printSchema()
print("Extrait résultats recherche exhaustive Query181081 : ")
dfResQuery181081.show()
print("Nb résultats retenus recherche exhaustive Query181081 : ", dfResQuery181081.count())
print("Nb résultats communs avec recherche LSH Query181081 : ", \
dfNN200query181081.join(dfResQuery181081,dfNN200query181081._c0 == dfResQuery181081._c0,"inner").count())
dfResQuery415008 = spark.read.load("tplsh/rqref/415008_resnet", \
format="csv", inferSchema="true", header="false")
# affichage
print("Schéma résultats recherche exhaustive Query415008 : ")
dfResQuery415008.printSchema()
print("Extrait résultats recherche exhaustive Query415008 : ")
dfResQuery415008.show()
print("Nb résultats retenus recherche exhaustive Query415008 : ", dfResQuery415008.count())
print("Nb résultats communs avec recherche LSH Query415008 : ", \
dfNN200query415008.join(dfResQuery415008,dfNN200query415008._c0 == dfResQuery415008._c0,"inner").count())
Question :
Modifier la valeur de bucketLength
et examiner l’impact de chaque modification sur les résultats obtenus.
Correction :
Le code pour la construction des fonctions de hachage et pour leur évaluation est le même que celui déjà présenté, seule change la valeur de bucketLength
, qui représente la taille moyenne des pages (buckets) de hachage. Lorsque la valeur de bucketLength
augmente, la probabilité de collision augmente ; augmente donc le nombre de faux positifs (données retournées qui ne sont pas assez similaires à la requête), en diminuant ainsi la précision (taux de vrais positifs), et diminue le nombre de faux négatifs (données similaires à la requête mais qui ne sont pas retournées), en augmentant ainsi le rappel.
Question :
Modifier le nombre de tables de hachage et examiner l’impact de chaque modification sur les résultats obtenus.
Correction :
Le code pour la construction des fonctions de hachage et pour leur évaluation est le même que celui déjà présenté, seule change la valeur de numHashTables
. La diminution du nombre de tables de hachage augmente le nombre de faux négatifs (données similaires à la requête mais qui ne sont pas retournées), donc diminue le rappel.
Jointure par similarité¶
Afin d’étudier la jointure par similarité avec LSH, nous nous servirons de deux bases différentes qui sont les résultats retournés par les deux requêtes LSH query181081
et query415008
: dfNN200query181081
et dfNN200query415008
. La jointure par similarité avec LSH est donc faite entre les résultats de ces deux requêtes.
dfLSHjoin = lshModel.approxSimilarityJoin(dfNN200query181081, dfNN200query415008, \
1.5, distCol="EuclideanDistance")
#dfLSHjoin.printSchema()
dfLSHjoin.select(col("datasetA._c0").alias("idA"), \
col("datasetB._c0").alias("idB"), \
col("EuclideanDistance")).show()
print("Nb paires images (très) proches entre les deux bases : ", dfLSHjoin.count())
# "verification"
print("Nb images communes entre les deux bases : ", \
dfNN200query181081.join(dfNN200query415008,dfNN200query181081._c0 == dfNN200query415008._c0,"inner").count())
Question :
Expliquer le principe de la « vérification » qui consiste à comparer à une jointure classique (par égalité).
Correction :
La « vérification » faite par une jointure classique (par égalité) entre les deux bases indique simplement combien d’images sont communes entre les deux bases (entre les résultats de la requête query181081
et ceux de la requête query415008
). La jointure par similarité devrait trouver autant de paires que la jointure classique si le seuil de distance est proche de 0 (ici nous avons utilisé 1.5) et plus de paires si le seuil de distance est plus élevé.
Question :
Modifier la valeur de bucketLength
et le nombre de tables de hachage, examiner l’impact de chaque modification sur les résultats obtenus.
Correction :
Le code pour la construction des fonctions de hachage et pour leur évaluation est le même que celui déjà présenté, seules changent les valeurs de bucketLength
et respectivement numHashTables
. Les résultats sont cohérents avec ce que nous avons constaté plus haut pour la modification de ces paramètres dans le cas des recherches par similarité.