Avancement sur les testes et l'affichage des résultats

5e033e19 · Ivan Pavlovich · 9f72853e · 5e033e19 · 5e033e19 · 5e033e19
Commit 5e033e19 authored 5 months ago by Ivan Pavlovich
--- a/TODO.md
+++ b/TODO.md
+- [ ] Check Ollama
+- [ ] Restructurer le projet
+- [ ] Test results for different text lenght
\ No newline at end of file
--- a/dataSources/PubMed/pubmedApi.py
+++ b/dataSources/PubMed/pubmedApi.py
+import sys
+import os
+# Ajouter le répertoire parent au chemin de recherche
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
 from requests import get
 from parsers.xmlParser import parseXmlFile
 import json
-FILENAME = "pubmedData.xml"
+TMP_DIR_NAME = "./tmp"
+TMP_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), TMP_DIR_NAME))
+TMP_FILENAME = "pubmedData.xml"
 # term = "diabetes+type+1+OR+diabetes+type+2+OR+mental+health"
 # term = '"Diabetes+Mellitus"[Mesh:noexp]'
@@ -36,10 +44,10 @@ def getPubmedData(term, date_min, date_max, nb_items = -1, debug = False, store
    response = get(url)
-    with open(f"tmp/{FILENAME}", "w", encoding="utf-8") as file:
+    with open(f"{TMP_DIR}/{TMP_FILENAME}", "w+", encoding="utf-8") as file:
        file.write(response.text)
-    obj = parseXmlFile(f"tmp/{FILENAME}")
+    obj = parseXmlFile(f"{TMP_DIR}/{TMP_FILENAME}")
    data_list = []

--- a/models/ZeroShotClassifier/HuggingFace/__pycache__/zero_shot_classification.cpython-313.pyc
+++ b/models/ZeroShotClassifier/HuggingFace/__pycache__/zero_shot_classification.cpython-313.pyc
--- a/parsers/__pycache__/jsonParser.cpython-313.pyc
+++ b/parsers/__pycache__/jsonParser.cpython-313.pyc
--- a/rapports/img/facebook_results.png
+++ b/rapports/img/facebook_results.png
--- a/rapports/rapport_2.md
+++ b/rapports/rapport_2.md
+# Rapport 26.02.2025
+## Structure des fichier
+J'ai restructuré les fichiers du projet pour faciliter la navigation. Désormais, il y a trois dossiers principaux : dataSources, testModel et variables.
+- dataSources contient les différentes sources de données utilisées, ainsi que le fichier Python permettant de récupérer les données via leurs API. Il inclut également des fichiers Python contenant différents tests effectués sur les sources de données (par exemple, PubMed/data_num.py, qui récupère le nombre d'articles publiés sur PubMed). Pour l’instant, seule la source PubMed est incluse.
+- testModel contient tous les scripts Python de test des modèles, ainsi que les résultats et les datasets utilisés.
+- variables regroupe tous les fichiers Python contenant des variables réutilisées à travers le projet.
+Tous les autres dossiers contiennent des fichiers utiles ou utilisés dans le cadre du projet.
+## Test des models
+### Longeur des textes classifier
+Après discussion avec Monsieur Glück, nous avons décidé d'examiner si la longueur du texte influence la performance des modèles. L'objectif est également d'observer si certains modèles sont plus performants sur des textes longs mais moins précis sur des textes courts, et inversement.
+J'ai commencé par analyser la longueur des textes dans mon dataset :
+```sh
+Longuest: 3863
+Shortest: 31
+Mean: 823.4525
+Median: 538.5
+```
+Sur la base de ces résultats, j’ai décidé de séparer les textes en quatre catégories : SHORT, MEDIUM, LONG et VERY LONG: 
+- Short: 0-300 caractères
+- Medium: 301-600 caractères
+- Long: 601-900 caractères
+- Very Long: 901-inf caractères
+Cela donne la répartition suivante:
+```sh
+SHORT: 144
+MEDIUM: 300
+LONG: 75
+VERY LONG: 281
+```
+Par la suite, il faudra recréer le dataset afin d'obtenir une répartition plus équilibrée des longueurs d'articles.
+J'ai également modifié mon modèle de test pour classer les résultats obtenus en fonction des quatre catégories de longueur. Cela permettra d'effectuer des comparaisons et ainsi de faciliter mon choix de modèle.
+En parallèle, j’ai réfléchi à l’affichage des résultats. Actuellement, la version que j’ai retenue est la suivante:
+![Affichage des résultats de test](../testModel/results/zero_shot/v2/results.png)
+Cependant, je trouve que cette représentation manque de lisibilité. Je vais donc poursuivre mes recherches pour trouver un affichage plus clair et pertinent.
+Dans l'image ci-dessus, seules les valeurs du modèle facebook/bart-large-mnli sont correctes, car je n’ai pu retrouver l’accès au serveur Baobab de l'UNIGE que ce matin (en raison de problèmes avec mes clés SSH).
+Je n’ai pas pu exécuter les tests sur mon PC personnel, car chaque exécution posait des problèmes et prenait plus d’une heure par modèle. J’ai donc préféré attendre l’accès à Baobab.
+Toutefois, j’ai tout de même tenté d’obtenir des résultats pour un modèle, afin d’avoir une première idée de la direction que je prenais. Voici les résultats pour facebook/bart-large-mnli:
+![Tests sur facebook/bart-large-mnli](./img/facebook_results.png)
+L’ordre des résultats, de haut en bas sur l’image, est le suivant : SHORT, MEDIUM, LONG, VERY LONG et ALL. On observe que ce modèle est plus performant sur les textes VERY LONG.
+## Suite pour semaine prochaine
+Je suis conscient de ne pas avoir avancé autant que prévu, mais voici les tâches que je compte accomplir d’ici vendredi:
+- Refaire tourner mes testes sur les serveurs Baobab
+- Afficher tous les résultats d'une façon lisible
+- Regarder ce qu'est Mistral AI (une LLM dont un amis m'a parler)
+- Essayer Ollama
+- Mettre au propre tous les résultats pour les LLM et les modèles de HuggingFace
\ No newline at end of file
--- a/testModel/__pycache__/metrics.cpython-313.pyc
+++ b/testModel/__pycache__/metrics.cpython-313.pyc
--- a/testModel/__pycache__/utils.cpython-313.pyc
+++ b/testModel/__pycache__/utils.cpython-313.pyc
--- a/testModel/doc/articles_length.md
+++ b/testModel/doc/articles_length.md
+# Tests
+All information was colected using "testModel/test_articles_len.py" python script.
+I looked at articles lenght to separate them into categories. I tested the length on my dataset:
+```sh
+Longuest: 3863
+Shortest: 31
+Mean: 823.4525
+Median: 538.5
+```
+I think I'll do 4 categories:
+- Short: 0-300 caracters
+- Medium: 301-600 caracters
+- Long: 601-900 caracters
+- Very Long: 901-inf caracters
+If i follow this separation, we have:
+- Short: 144 articles
+- Medium: 300 articles
+- Long: 75 articles
+- Very long: 281 articles
\ No newline at end of file
--- a/testModel/metrics.py
+++ b/testModel/metrics.py
+def confusion_matrix(wanted, prediction):
+    matrix = [[0, 0], [0, 0]]
+    for key in wanted.keys():
+        if wanted[key]:
+            if prediction[key]:
+                matrix[0][0] += 1
+            else:
+                matrix[1][0] += 1
+        else:
+            if prediction[key]:
+                matrix[0][1] += 1
+            else:
+                matrix[1][1] += 1
+    return matrix
+def add_confusion_matrices(confusion_matrix, tmp_confusion_matrix):
+    for i in range(2):
+        for j in range(2):
+            confusion_matrix[i][j] += tmp_confusion_matrix[i][j]
+    return confusion_matrix
+def get_tpr(confusion_matrix):
+    return confusion_matrix[0][0] / (confusion_matrix[0][0] + confusion_matrix[1][0])
+def get_tnr(confusion_matrix):
+    return confusion_matrix[1][1] / (confusion_matrix[1][1] + confusion_matrix[0][1])
+def get_precision(confusion_matrix):
+    return confusion_matrix[0][0] / (confusion_matrix[0][0] + confusion_matrix[0][1])
\ No newline at end of file
--- a/testModel/results/zero_shot/v2/MoritzLaurer-bge_m3_zeroshot_v2_0.txt
+++ b/testModel/results/zero_shot/v2/MoritzLaurer-bge_m3_zeroshot_v2_0.txt
--- a/testModel/results/zero_shot/v2/facebook-bart_large_mnli.txt
+++ b/testModel/results/zero_shot/v2/facebook-bart_large_mnli.txt
--- a/testModel/results/zero_shot/v2/results.json
+++ b/testModel/results/zero_shot/v2/results.json
+{
+    "results":[
+        {
+            "model" : "facebook/bart-large-mnli",
+            "data": [
+                {
+                    "Name": "ALL",
+                    "TPR": 0.6965944272445821,
+                    "TNR": 0.953047320935371,
+                    "Precision": 0.7258064516129032
+                },
+                {
+                    "Name": "SHORT",
+                    "TPR": 0.569060773480663,
+                    "TNR": 0.9505664263645726,
+                    "Precision": 0.6821192052980133
+                },
+                {
+                    "Name": "MEDIUM",
+                    "TPR": 0.6657681940700808,
+                    "TNR": 0.9507146377525875,
+                    "Precision": 0.7118155619596542
+                },
+                {
+                    "Name": "LONG",
+                    "TPR": 0.6666666666666666,
+                    "TNR": 0.9529411764705882,
+                    "Precision": 0.7142857142857143
+                },
+                {
+                    "Name": "VERY LONG",
+                    "TPR": 0.8103975535168195,
+                    "TNR": 0.956793336803748,
+                    "Precision": 0.7614942528735632
+                }
+            ]
+        },
+        {
+            "model" : "MoritzLaurer/bge-m3-zeroshot-v2.0",
+            "data": [
+                {
+                    "Name": "ALL",
+                    "TPR": 0.6965944272445821,
+                    "TNR": 0.953047320935371,
+                    "Precision": 0.7258064516129032
+                },
+                {
+                    "Name": "SHORT",
+                    "TPR": 0.569060773480663,
+                    "TNR": 0.9505664263645726,
+                    "Precision": 0.6821192052980133
+                },
+                {
+                    "Name": "MEDIUM",
+                    "TPR": 0.6657681940700808,
+                    "TNR": 0.9507146377525875,
+                    "Precision": 0.7118155619596542
+                },
+                {
+                    "Name": "LONG",
+                    "TPR": 0.6666666666666666,
+                    "TNR": 0.9529411764705882,
+                    "Precision": 0.7142857142857143
+                },
+                {
+                    "Name": "VERY LONG",
+                    "TPR": 0.8103975535168195,
+                    "TNR": 0.956793336803748,
+                    "Precision": 0.7614942528735632
+                }
+            ]
+        }
+    ]
+}
\ No newline at end of file
--- a/testModel/results/zero_shot/v2/results.png
+++ b/testModel/results/zero_shot/v2/results.png
--- a/testModel/show_results.py
+++ b/testModel/show_results.py
+import sys
+import os
+import matplotlib.pyplot as plt
+import numpy as np
+# Ajouter le répertoire parent au chemin de recherche
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
+from parsers.jsonParser import parseJsonFile
+RESULTS_DIR_NAME = "./results"
+RESULTS_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), RESULTS_DIR_NAME))
+try:
+    results = parseJsonFile(f"{RESULTS_DIR}/zero_shot/v2/results.json")
+    print(results["results"])
+except Exception as e:
+    print(f"Error: {e}")
+data = results["results"]
+models = [entry["model"] for entry in data]
+categories = [entry["Name"] for entry in data[0]["data"]]
+fig, axes = plt.subplots(1, 5, figsize=(18, 5), sharey=True)
+colors = ["skyblue", "orange", "green"]
+metrics = ["TPR", "TNR", "Precision"]
+bar_width = 0.2
+x_positions = np.arange(len(models))
+for i, category in enumerate(categories):
+    ax = axes[i]
+    for j, metric in enumerate(metrics):
+        values = []
+        for model_data in data:
+            category_data = next(item for item in model_data["data"] if item["Name"] == category)
+            values.append(category_data[metric])
+        ax.bar(x_positions + j * bar_width, values, width=bar_width, label=metric, color=colors[j])
+    ax.set_title(category)
+    ax.set_xticks(x_positions + bar_width)
+    ax.set_xticklabels(models, rotation=15)
+    ax.set_ylabel("Valeur")
+    ax.legend()
+plt.tight_layout()
+plt.savefig(f"{RESULTS_DIR}/zero_shot/v2/results.png")
\ No newline at end of file
--- a/testModel/test.py
+++ b/testModel/test.py
+import sys
+import os
+import statistics
+import time
+# Ajouter le répertoire parent au chemin de recherche
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
+from variables.diseases import DISEASES_LABELS
+from variables.huggingface import HUGGINGFACE_MODELS
+from variables.articles import LENGTH_CATEGORIES, LENGTH_CATEGORIES_TRESHOLDS
+from testModel.utils import get_dataset_filename, get_article_data, get_wanted_predictions
+from testModel.metrics import confusion_matrix, add_confusion_matrices, get_tpr, get_tnr, get_precision
+from parsers.jsonParser import parseJsonFile
+from models.ZeroShotClassifier.HuggingFace.zero_shot_classification import create_classifier, classify
+RESULTS_DIR_NAME = "./results"
+DATASET_DIR_NAME = "./dataset"
+RESULTS_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), RESULTS_DIR_NAME))
+DATASET_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), DATASET_DIR_NAME))
+TRESHOLD = 0.7
+for model in HUGGINGFACE_MODELS:
+    classifier = create_classifier(model)
+    result_filename = model.replace(" ", "_").replace("-", "_").replace(".", "_").replace("/", "-")
+    with open(f"{RESULTS_DIR}/zero_shot/v2/{result_filename}.txt", "w+") as file:
+        print("---------------------------------", file=file)
+        print(f"MODEL: {model}", file=file)
+        print(f"TRESHOLD: {TRESHOLD}", file=file)
+        print("---------------------------------", file=file)
+        nb_articles = 0
+        result_matrix = [[0, 0], [0, 0]]
+        length_matrix = {}
+        for length_category in LENGTH_CATEGORIES:
+            length_matrix[length_category] = [[0, 0], [0, 0]]
+        start = time.time()
+        for disease_label in DISEASES_LABELS:
+            try:
+                filename = get_dataset_filename(disease_label)
+                articles = parseJsonFile(f"{DATASET_DIR}/{filename}.json")
+            except Exception as e:
+                print(f"Error: {e}")
+            for article in articles:
+                nb_articles += 1
+                print("---------------------------------", file=file)
+                title, abstract = get_article_data(article)
+                wanted = get_wanted_predictions(article, DISEASES_LABELS)
+                print(f"PMID: {article["PMID"]}", file=file)
+                pred = article["Predictions"]
+                print(f"Predictions: {pred}", file=file)
+                print(f"MeshTerm: {article["MeshTerms"]}", file=file)
+                predictions = {}
+                selected_labels = []
+                for predict_label in DISEASES_LABELS:
+                    results = classify(classifier, title+abstract, [predict_label])
+                    print(f"Labels: {results["labels"]}", file=file)
+                    print(f"Scores: {results["scores"]}", file=file)
+                    predictions[predict_label] = results["scores"][0] > TRESHOLD
+                    if results["scores"][0] > TRESHOLD:
+                        selected_labels.append(predict_label)
+                print(f"Wanted: {wanted}", file=file)
+                print(f"Predicted: {predictions}", file=file)
+                print(f"Selected labels: {selected_labels}", file=file)
+                matrix = confusion_matrix(wanted, predictions)
+                print(f"Confusion matrix: {matrix}", file=file)
+                result_matrix = add_confusion_matrices(result_matrix, matrix)
+                added = False
+                for id, length_category_treshold in enumerate(LENGTH_CATEGORIES_TRESHOLDS):
+                    if len(title+abstract) < length_category_treshold:
+                        length_matrix[LENGTH_CATEGORIES[id]] = add_confusion_matrices(length_matrix[LENGTH_CATEGORIES[id]], matrix)
+                        added = True
+                        break
+                if not added:
+                    length_matrix[LENGTH_CATEGORIES[-1]] = add_confusion_matrices(length_matrix[LENGTH_CATEGORIES[-1]], matrix)
+                print("---------------------------------", file=file)
+        end = time.time()
+        for length_category in LENGTH_CATEGORIES:
+            print(f"Result confusion matrix: {length_matrix[length_category]}", file=file)
+            print(f"True Positive Rate (TPR): {get_tpr(length_matrix[length_category])}", file=file)
+            print(f"True Negative Rate (TNR): {get_tnr(length_matrix[length_category])}", file=file)
+            print(f"Precision: {get_precision(length_matrix[length_category])}", file=file)
+            print("---------------------------------", file=file)
+        print(f"Time to classify all articles: {end-start} seconds", file=file)
+        print(f"Result confusion matrix: {result_matrix}", file=file)
+        print(f"True Positive Rate (TPR): {get_tpr(result_matrix)}", file=file)
+        print(f"True Negative Rate (TNR): {get_tnr(result_matrix)}", file=file)
+        print(f"Precision: {get_precision(result_matrix)}", file=file)
+        print()
\ No newline at end of file
--- a/testModel/test_articles_len.py
+++ b/testModel/test_articles_len.py
+import sys
+import os
+import statistics
+# Ajouter le répertoire parent au chemin de recherche
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
+from variables.diseases import DISEASES_LABELS
+from variables.articles import LENGTH_CATEGORIES, LENGTH_CATEGORIES_TRESHOLDS
+from testModel.utils import get_dataset_filename, get_article_data
+from parsers.jsonParser import parseJsonFile
+DATASET_DIR_NAME = "./dataset"
+DATASET_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), DATASET_DIR_NAME))
+nb_articles = 0
+nb_articles_categories = {}
+for length_category in LENGTH_CATEGORIES:
+    nb_articles_categories[length_category] = 0
+lens = []
+for disease_label in DISEASES_LABELS:
+    try:
+        filename = get_dataset_filename(disease_label)
+        articles = parseJsonFile(f"{DATASET_DIR}/{filename}.json")
+    except Exception as e:
+        print(f"Error: {e}")
+    for article in articles:
+        nb_articles += 1
+        title, abstract = get_article_data(article)
+        text = title + abstract 
+        lens.append(len(text))
+        added = False
+        for id, length_category_treshold in enumerate(LENGTH_CATEGORIES_TRESHOLDS):
+            if len(text) < length_category_treshold:
+                nb_articles_categories[LENGTH_CATEGORIES[id]] += 1
+                added = True
+                break;
+        if not added:
+            nb_articles_categories[LENGTH_CATEGORIES[-1]] += 1
+print(f"Nb articles: {nb_articles}")
+for length_category in LENGTH_CATEGORIES:
+    print(f"Nb articles ({length_category}): {nb_articles_categories[length_category]}")
+print()
+print(f"Longuest: {max(lens)}")
+print(f"Shortest: {min(lens)}")
+print(f"Mean: {statistics.mean(lens)}")
+print(f"Median: {statistics.median(lens)}")
--- a/testModel/utils.py
+++ b/testModel/utils.py
+def get_dataset_filename(name):
+    return name.replace(" ", "_").replace(",", "").lower()
+def get_article_data(article):
+    title = article["ArticleTitle"]
+    abstract = article["Abstract"]
+    if not isinstance(title, str):
+        title = ""
+    if not isinstance(abstract, str):
+        abstract = ""
+    return title, abstract
+def get_wanted_predictions(article, labels):
+    wanted = {}
+    for label in labels:
+        wanted[label] = label in article["Predictions"]
+    return wanted
\ No newline at end of file
--- a/variables/__pycache__/articles.cpython-313.pyc
+++ b/variables/__pycache__/articles.cpython-313.pyc
--- a/variables/__pycache__/diseases.cpython-313.pyc
+++ b/variables/__pycache__/diseases.cpython-313.pyc