Travaux pratiques - Locality Sensitive Hashing (LSH) avec Spark

(correspond à 1 séance de TP)

Références externes utiles :

L’objectif de cette séance de TP est de vous familiariser avec l’utilisation des méthodes de hachage sensible à la similarité (LSH) pour la recherche et la jointure par similarité sur la plateforme de calcul distribué Spark.

Noter que pour le moment l’implémentation des fonctionalités LSH sur Spark n’est pas complète, seule étant disponible l’amplification OR (utilisation de plusieurs tables de hachage indépendantes) et pas encore l’amplification AND (combinaison de plusieurs fonctions LSH élémentaires dans chaque table de hachage).

Création de l’environnement de travail, téléchargement des données

Attention : les commandes qui suivent %%bash sont à entrer dans une fenêtre terminal linux (ou MacOS). Si vous travaillez directement sur Windows vous devez exécuter les manipulations équivalentes.

Si vous avez déjà récupéré certaines de ces données (comme Base10000.zip) lors des séances de TP précédentes il n’est pas nécessaire de les rechercher de nouveau, vous pouvez les réutiliser.

%%bash
mkdir -p tplsh tplsh/rqref
cd tpnlp
# recuperer les donnees
wget http://cedric.cnam.fr/~ferecatu/NFE205/Base10000.zip
wget http://cedric.cnam.fr/~crucianm/src/ResNetDescriptors
unzip Base10000.zip
cd rqref
# recuperer les reponses de reference de deux requetes
wget http://cedric.cnam.fr/~crucianm/src/181081_resnet
wget http://cedric.cnam.fr/~crucianm/src/415008_resnet

Lancement de PySpark

Ouvrez une nouvelle fenêtre terminal, positionnez-vous dans le répertoire tplsh (par ex. avec cd tplsh si vous êtes dans le répertoire parent) et lancez PySpark (l’interface de Spark en Python) :

%%bash
pyspark

Si la commande pyspark n’est pas trouvée, il faut identifier l’endroit où est installé Spark sur votre système (par ex. avec la commande linux whereis spark, ou avec echo $SPARK_HOME si la variable système $SPARK_HOME a été correctement initialisée) ; pyspark sera dans le sous-répertoire /bin du répertoire de Spark.

Importation des fonctionalités nécessaires

Les fonctionalités associées à la manipulation de vecteurs, à l’accès explicite à des colonnes de DataFrame et à l’utilisation de LSH doivent être importées dans la session Spark :

from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler

Lecture des données

Les données sont des descripteurs obtenus par un réseau de neurones de type ResNet. Chaque ligne du fichier ResNetDescriptors contient le nom d’un fichier image suivi de 512 valeur réelles qui constituent le descripteur obtenu par ResNet. Regardez la documentation Spark concernan la lecture de fichiers csv pour mieux comprendre l’utilisation de la fonction de lecture.

L”« assembleur » sert à regrouper les 512 colonnes issues de la lecture des composantes du vecteur descripteur en une seule colonne de DataFrame (qui contient des vecteurs de dimension 512).

dfResnetData = spark.read.load("ResNetDescriptors", \
                     format="csv", sep=" ", inferSchema="true", header="false")
# affichage schema Dataframe
dfResnetData.printSchema()

# regroupement des composantes du vecteur descripteur
assembler = VectorAssembler( \
    inputCols=["_c"+str(i) for i in range(1,513)], \
    outputCol="features")

dfResnetFeatures = assembler.transform(dfResnetData)
dfResnetFeatures.printSchema()

Construction des fonctions de hachage (LSH)

Examinez la documentation sur les fonctionalités LSH de Spark pour mieux comprendre la définition des fonctions de hachage et l’utilisation de ces fonctionalités.

bRP = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", \
                                  bucketLength=10.0, numHashTables=3)
lshModel = bRP.fit(dfResnetFeatures)
# quels sont les parametres d'un modele LSH
lshModel.params

Construction des tables de hachage à partir des données de la base

Les fonctions de hachage sont appliquées aux données de la base de données (DataFrame dfResnetFeatures) pour remplir les tables de hachage. Chaque table de hachage est issue d’une seule fonction LSH élémentaire (l’amplification AND n’est pas encore implémentée), plusieurs tables de hachage indépendantes (combien ?) sont employées. La concaténation des valeurs de hash correspondant aux différentes tables se trouve, pour les vecteurs features, dans la colonne hashes.

dfRFtransformed = lshModel.transform(dfResnetFeatures)
dfRFtransformed.printSchema()
dfRFtransformed.select("hashes").show()

Recherche par similarité avec LSH

Nous utilisons deux requêtes pour la recherche par similarité. Pour chaque requête nous retournons les 200 plus proches voisins. Ces deux requêtes ont déjà été traitées, avec des recherches exhaustives des 200 plus proches voisins, dans les TP antérieurs. qui ont

query181081 = Vectors.dense([0.8251823,0.91035753,0.09495628,1.4742385,0.47077715,0.72143084,0.2326101,1.9182028,1.602904,3.2716675,0.40041175,0.51895624,1.3964912,0.18736123,0.007291014,0.012191964,0.06572782,0.74409735,0.06422359,1.0140035,0.2881519,0.8197073,0.14281192,1.568304,0.14897016,0.08950109,1.1391809,1.8378319,0.76647186,1.4851955,1.8594424,0.2714822,0.74353796,1.9263119,0.1745341,0.15795814,1.7530701,0.060984455,1.0865763,0.7848891,0.30945352,1.2342236,0.23379217,1.1462557,1.267692,0.0,0.7408862,0.09214586,0.18567479,0.12656471,0.021630852,0.955537,0.15970318,0.089150205,1.684812,0.5671266,2.5707386,0.98088783,0.90231615,1.3630576,0.14114097,0.4463689,1.0327705,0.705891,0.12658072,0.48323864,0.0045423703,0.039263647,2.8394363,0.15548013,2.7657537,0.37192833,0.1715558,1.3063127,0.24360695,0.046071324,0.25840083,0.9103306,0.08077669,1.0398829,0.24748424,2.989612,0.3707291,0.87966746,0.2937655,0.748611,0.98078614,0.16775414,2.2089913,0.3963593,0.093315706,1.7715102,0.06331823,2.422794,1.1966951,0.73674244,0.7618964,0.013256618,0.28509787,1.5148723,0.5090185,2.211134,0.7279506,0.060473062,1.4912022,0.032703873,3.8989232,0.029223189,1.845185,0.80920166,1.4031032,0.27063265,0.99870557,0.8388382,0.04311324,0.76778144,0.91299003,0.8400017,1.0793769,0.98688084,1.2363697,0.9989632,0.6078563,1.2884163,1.1453526,0.0,1.303353,0.96292967,1.2171977,0.38198313,0.13100779,0.7961989,0.24733944,0.83064073,0.82503337,0.34207195,0.3793606,0.77659965,1.3392318,1.6045268,0.39323044,1.6529642,0.47640368,0.9393637,0.5614654,0.46222636,0.5178515,1.3867533,0.24465682,0.19497502,0.22383977,0.27872846,4.3903155,0.106261075,1.5941149,0.91558516,1.6847173,0.20818281,0.26747233,0.5019687,1.8784739,0.15575074,0.12299685,4.160663,0.3325281,0.6760867,1.3686028,0.010640076,0.989378,0.16756538,0.6657893,0.2904656,0.91679025,0.11874362,1.856527,1.5264468,0.012967174,1.2890414,0.41133478,0.17087613,0.297161,0.17655171,1.0926753,0.6619003,1.9088618,1.9291886,0.6224267,0.8459799,0.48735592,1.645272,0.3632541,0.5489407,1.7378443,2.7794728,0.5258269,0.32889587,1.0048896,1.1523347,1.2362299,0.5641484,0.70279956,2.050117,0.46726876,1.3492838,0.73215735,0.14453815,0.66694397,0.23404352,0.54396266,1.5131798,0.19280887,0.22775684,0.120462,1.2465322,0.67533493,0.15239385,0.30662766,0.65457433,0.16800117,2.587592,0.0,1.2646242,0.22110419,0.7204129,0.22643027,0.5203356,0.4501484,0.44751263,0.78501004,0.8660302,1.4740266,0.024843605,0.53146344,0.03903717,0.046770047,2.0057867,4.166679,0.25391626,0.09374673,1.0774068,0.7967229,0.0,0.060316835,0.30588898,0.22259046,0.0066905566,0.74676704,2.1363032,0.4819522,0.84701777,1.373439,0.13733654,1.031783,0.6279997,1.6977885,0.550729,0.0052132397,0.9728978,0.27339548,0.31880563,0.353831,0.94810313,2.1523573,0.22927774,0.40312037,0.39954814,0.37173545,1.1860722,0.2620945,0.1612127,0.25837606,0.79436976,2.403545,0.7575483,0.4023421,0.43574995,2.5502756,1.3306122,0.010389698,0.22068267,0.47496215,0.42805463,0.6734205,0.35107052,0.097786106,0.31571063,0.00723672,0.5044784,1.8777064,1.2576957,0.38982293,0.7543205,0.11731417,1.8303478,1.0231175,0.46368614,0.33111092,0.80494094,1.4011557,0.16004916,1.5364176,0.4860146,0.8693424,0.51418024,0.39284527,0.1627969,0.65889996,0.95257366,0.02478139,0.56915337,0.7425185,0.5774835,0.6707734,0.24455981,0.4102656,0.93017966,1.0721194,0.02066911,0.8164203,0.14819643,1.6285479,0.2553866,0.053712346,0.5492008,0.35030517,0.50301754,0.32716075,0.22063479,0.22277375,0.89927036,0.45242822,0.70615697,0.74781585,0.040721733,0.42760938,0.31101212,0.14304462,0.29368323,0.5602965,1.3290967,0.14858778,1.8203995,0.4973353,2.1680048,0.42841884,0.34959275,0.5989331,1.0380756,0.6601667,0.82948405,0.4460639,0.3634462,0.06947707,0.55306345,0.0766763,0.60772634,0.3708077,0.8644778,0.24274044,0.5678104,0.5981921,0.76298964,0.46475554,0.36066583,1.4107724,0.5725621,0.23255217,0.23682031,1.6758224,0.8106787,0.99636894,0.08012349,0.28527904,0.041957702,1.0439851,1.8015053,0.3204132,0.025273405,0.286521,1.1788357,0.13538794,0.88002384,0.14288142,0.18516172,0.023365242,1.0207225,0.32568777,0.0,0.06257125,0.27742037,2.0107121,0.43492678,0.58344,0.1484052,0.0640609,1.2602185,1.0186493,0.9508419,0.089205205,0.02448311,0.077032864,2.8769832,0.5279616,0.16285227,0.8276714,0.741521,1.0101919,0.8734434,1.7919364,1.2205344,0.39959314,0.77236027,0.8495284,0.07038107,0.27102917,0.4469076,2.8866527,0.07179235,0.31700227,0.23275726,0.044392522,0.30589294,1.6301312,0.42484328,3.0158224,0.0337908,4.7898583,0.5846617,0.3214625,0.3125182,0.048198152,0.24820589,0.010746905,0.24370219,0.5718805,1.4477926,0.026398629,0.15819904,1.5892365,1.2872169,0.41291657,1.4439493,5.00339,0.016206438,0.6383031,0.059255827,1.631618,0.7806632,0.573832,1.2643466,0.15279052,0.026495773,0.33062527,0.12674388,0.6784105,2.3379147,0.085744515,3.034874,0.6813558,0.4370824,0.10580053,0.65050006,1.3085049,0.29220173,0.12848659,0.12956078,0.19968098,0.5995456,0.16255303,1.6808985,0.014256702,0.16226277,1.532355,0.23470142,0.07883693,0.51016384,0.81855696,0.80197203,0.22353828,0.76727724,0.6553169,0.9994703,0.020010041,0.45054847,1.9366511,0.09807539,0.7416031,0.41733074,0.5268382,0.34511793,0.21454094,1.2320131,0.96679515,0.51581913,0.297835,0.6849035,1.2363681,0.66226053,0.056623697,0.9213674,0.39946777,0.6344144,0.29380774,1.1244763,0.31652156,2.147081,0.38551554,0.9383485,0.7212884,0.47756374,0.81234723,0.3816825])
dfNN200query181081 = lshModel.approxNearestNeighbors(dfRFtransformed, query181081, 200)
# quels sont les ppv et les distances correspondantes
dfNN200query181081.select("_c0","distCol").show()

query415008 = Vectors.dense([0.45263392,0.8956163,0.87070477,0.14816852,0.52650183,1.1561016,0.6293764,0.9284468,0.38421723,4.6905303,0.86750746,0.053848274,0.2803663,0.6544485,0.8749838,3.4025502,0.42714152,0.9133633,0.17565206,0.9649471,0.1318799,3.7233472,0.20796484,0.97474825,0.3717961,1.9554434,1.8707739,3.3039918,0.99182826,0.27886763,1.7895261,1.491852,0.33849856,0.19214988,0.88428277,0.7720143,1.6419953,0.34475076,0.06494037,1.2440069,0.5226596,1.6443497,0.44598198,1.2996941,0.20559053,0.7820821,0.0,0.42952016,0.7608008,2.3894308,1.345929,0.01645541,1.3150162,0.2413582,0.34225136,0.8223587,2.4202332,2.5830383,0.058790546,1.6307446,2.0328014,3.0888891,0.22440873,1.5643606,2.3543608,0.5895053,0.002720334,0.12608899,0.50581914,0.8462622,0.0001264555,0.85995185,1.0312561,2.4007227,0.09775712,2.4886932,0.42553523,0.7466256,1.2594223,0.2238102,0.32247618,0.32852304,0.0,5.099753,0.1117684,2.5473087,0.26172626,1.2242079,1.3455564,0.00774833,0.6307779,2.023976,0.13854727,0.9647044,0.37276256,0.02528299,0.74835,1.2677828,0.0,0.010909967,2.5560832,1.2682337,0.11524897,0.47091496,0.5218562,0.87411904,1.0009542,1.5979035,0.10322508,0.5076439,0.0071715433,0.14258493,0.005567905,0.23197368,0.05281311,3.0220897,1.5834353,1.2366724,0.40156943,0.83163947,4.875465,0.0,0.737171,0.041035485,3.0698123,0.14297217,1.9236431,2.9314566,0.00917082,0.9888363,0.43763638,0.047162723,0.29775804,1.3054109,2.5959437,0.3400502,1.8285668,0.02136938,0.0,0.07793838,0.0014133605,0.12934381,0.3312592,0.7955034,0.3549969,1.3283514,1.7875737,1.1491022,1.4925776,0.3165809,0.5673069,1.3876493,2.2754574,1.5444372,1.6619296,0.2176583,1.8861767,0.15535067,0.041536473,0.42179826,0.6103136,0.65160865,2.7445345,3.4402654,0.030973628,1.0283326,2.2999504,0.25287843,0.2802209,1.054336,0.33884877,2.989419,3.2136192,3.5766551,0.11569985,0.17507003,0.35161313,0.60311687,0.45556974,0.40456232,2.9602437,0.07201544,1.2705411,0.28510845,0.039949536,0.10283875,1.2801944,0.9016231,0.78658295,0.8646321,0.0034093002,0.76072377,0.8297463,0.29100403,1.4552735,0.804808,0.21929853,0.16287188,0.09505152,0.21681665,0.7554719,0.020883337,1.717731,0.07955182,2.0607517,0.217322,0.64428794,1.0605676,0.25314844,6.383006,0.47646192,0.5124927,1.4541179,0.3532697,0.2513428,1.5772153,1.7872788,0.26646078,0.03598957,1.7086523,0.00768404,3.403396,2.0536177,2.0303457,0.6957179,0.012387057,1.1551886,0.0011552443,1.0009739,0.0028497053,0.12016697,0.19505928,1.4286954,0.5523408,0.15032008,1.948101,0.16595522,0.07704008,0.0068218997,1.5121472,0.03246734,0.3834128,2.3934941,1.2801512,0.57466936,0.13358139,1.4905454,0.21677655,0.012241403,0.6634224,1.9990369,1.4888697,0.046250198,0.567633,0.37149772,0.0695715,0.7751208,3.3232234,1.9529496,0.6433706,0.29582456,0.46578053,0.116250485,0.26288375,1.9813393,0.9387551,1.157241,1.215413,0.4552548,0.050808307,1.1613209,0.13422644,0.21431296,1.188897,0.044735916,2.1704319,0.81057507,0.07723919,1.8554053,0.5872315,1.7745929,0.62496847,1.4463258,0.014328592,0.18433736,0.35202718,0.78132397,0.46283117,0.010407659,4.443083,0.6800797,0.36474577,0.98243487,0.0,0.8563841,1.5238401,0.35596544,1.4687487,1.1605676,0.755615,1.5548942,2.861918,0.4937421,0.55141145,0.35684788,0.029336939,0.90690106,0.16248806,0.3778433,0.024810692,0.13464822,0.29201332,0.32607546,0.45219728,1.249203,0.68752927,0.02114641,1.9693781,0.068342045,0.3474129,1.0815232,0.32458854,0.417667,3.1885743,0.29306456,1.7536173,2.300651,2.1006317,0.2613018,1.7619803,0.014848229,1.9740887,1.2056358,1.3646326,1.5821574,0.4507597,0.48707905,0.04200224,0.12469209,1.1967199,0.8319932,0.28118965,0.44353834,0.70064133,0.011465581,0.17592621,0.15014814,0.32085174,0.11769223,1.0594741,0.047729336,0.35906988,1.2250327,1.3953635,2.1535811,0.04823087,1.8396713,0.9330752,2.931754,1.8205609,0.38561186,1.2825079,0.17464894,0.30233335,0.34293148,0.01875776,3.5774667,1.3687884,0.39101276,0.040731475,2.3296523,0.048506614,3.4258864,1.2192748,0.2334504,0.1541113,0.87802356,0.39543322,0.5851682,0.14774382,0.58431256,1.8559775,2.437189,0.12203799,0.43733776,1.2441306,0.19521865,0.046307057,0.25530845,2.3219538,0.12728329,1.345146,0.51866275,0.1148046,0.22727555,0.2201568,0.18787695,0.36578515,0.3327393,0.3755282,2.7952793,0.1699241,1.1266867,0.29099792,0.0012813747,4.0054636,1.8603879,0.4337331,0.66320294,5.437162,0.9894746,0.04256366,0.23364277,0.5691044,0.32171965,2.577568,0.1098858,0.020045849,0.62123513,0.4086426,0.9725301,0.02237483,0.30959535,0.045249205,0.59393895,0.2213169,1.2741151,0.6631232,0.23609507,2.7268214,1.2567642,3.6897569,1.663225,0.82428616,0.256715,1.2060919,4.110202,1.4713094,0.15731384,0.003555506,0.5560543,0.6526428,0.6432671,0.6893146,0.19068211,3.681301,0.5673297,0.26445177,0.102776594,0.03640779,1.0631367,0.20791848,0.13886246,0.5782891,0.7237094,0.657786,0.23636226,0.6911904,0.40478846,0.09338616,0.032393206,0.08497512,1.9046504,0.23404187,1.1589246,0.2819047,0.1652394,1.1480924,0.82744086,0.11994312,0.89856035,0.13563877,0.94947875,0.8775872,0.19169709,0.5171593,0.09551654,0.6022481,0.0,0.5338754,0.47721133,1.3559333,0.75558114,0.12732002,2.094488,0.5249446,1.2903894,0.0004469834,0.57952213,0.044969417,0.6313561,0.5405882,0.6595507,2.0314913,0.0072759674,0.7821973,0.10841969,2.2669275,0.88126385,0.036794934,0.33831996,1.1073501,0.43316153,1.2839894,0.54020905,0.09020492,0.29618382,1.0281754,0.0036136385,0.24706824,1.655145,0.0])
dfNN200query415008 = lshModel.approxNearestNeighbors(dfRFtransformed, query415008, 200)
# quels sont les ppv et les distances correspondantes
dfNN200query415008.select("_c0","distCol").show()

Question :

Afficher les résultats obtenus avec LSH pour ces deux requêtes sous la forme de pages HTML. Pour cela il est nécessaire de regarder la documentation Spark concernant l’écriture des données.

Evaluation des résultats obtenus

Les résultats obtenus pour ces recherches avec LSH sont comparés aux résultats obtenus pour les mêmes requêtes par des recherches exhaustives, résultats sauvegardés préalablement dans les fichiers 181081_resnet et 415008_resnet. La comparaison est réalisée par une jointure entre le DataFrame résultant de la recherche avec LSH et le DataFrame lu à partir du fichier, sur la colonne qui contient le nom de l’image.

dfResQuery181081 = spark.read.load("rqref/181081_resnet", \
                     format="csv", inferSchema="true", header="false")
# affichage schema Dataframe
dfResQuery181081.printSchema()
dfResQuery181081.show()

dfNN200query181081.join(dfResQuery181081,dfNN200query181081._c0 == dfResQuery181081._c0,"inner").count()

dfResQuery415008 = spark.read.load("rqref/415008_resnet", \
                     format="csv", inferSchema="true", header="false")
# affichage schema Dataframe
dfResQuery415008.printSchema()
dfResQuery415008.show()

dfNN200query415008.join(dfResQuery415008,dfNN200query415008._c0 == dfResQuery415008._c0,"inner").count()

Question :

Modifier la valeur de bucketLength et examiner l’impact de chaque modification sur les résultats obtenus.

Question :

Modifier le nombre de tables de hachage et examiner l’impact de chaque modification sur les résultats obtenus.

Jointure par similarité

Afin d’étudier la jointure par similarité avec LSH, nous nous servirons de deux bases différentes qui sont les résultats retournés par les deux requêtes query181081 et query415008. La jointure par similarité avec LSH est donc faite entre les résultats de ces deux requêtes.

dfLSHjoin = lshModel.approxSimilarityJoin(dfNN200query181081, dfNN200query415008, \
                                          1.5, distCol="EuclideanDistance")
dfLSHjoin.printSchema()
dfLSHjoin.select(col("datasetA._c0").alias("idA"), \
                 col("datasetB._c0").alias("idB"), \
                 col("EuclideanDistance")).show()
# "verification"
dfResQuery181081.join(dfResQuery415008,dfResQuery181081._c0 == dfResQuery415008._c0,"inner").count()

Question :

Expliquer le principe de la « vérification ».

Question :

Modifier la valeur de bucketLength et le nombre de tables de hachage, examiner l’impact de chaque modification sur les résultats obtenus.