diff --git a/ProjectNotes.md b/ProjectNotes.md new file mode 100644 index 0000000000000000000000000000000000000000..69f6a7cb37e4ad2193b2d480c3b30a8dbf29dbfa --- /dev/null +++ b/ProjectNotes.md @@ -0,0 +1,18 @@ +# NCD: data aggregation and classification + +## Remarks + +- + +## TODO + +### Must +- (Parfaitement comprendre comment ATM fonctionne) +- Trouver en combien de temps les mesh term sont donnée +- Explorer les non-superviser (classification de texte) + +- ~~Mesh term vs keyword~~ +- ~~Ou il recherche les term (titre, keywords, meshterms, etc...)~~ +- ~~Etudier les operateur (A or B and C) (A or B) and C A or (B and C)~~ +- Service qui tourne en arriere plan qui récupere les données +- Reflechire sur la structure de stockage \ No newline at end of file diff --git a/PubmedInfo.md b/PubmedInfo.md index 0b2f77571cfc62b9fd5500953baac1e3300c160c..15d8df11c297274460c026749fa510db58c74f4e 100644 --- a/PubmedInfo.md +++ b/PubmedInfo.md @@ -134,6 +134,100 @@ where the terms are search terms, each limited to a particular Entrez field in s Example: human[organism] AND topoisomerase[protein name] +When searching for terms, it is important to pay attention to how the search is structured. For example, if you search for **breast cancer** without quotes, the search engine will look for any occurrence of the terms **"breast"**, **"cancer"**, or the exact phrase **"breast cancer"**. To specifically search for the phrase **breast cancer** as a whole, you need to enclose it in quotation marks (" "). + + + +### API syntax + +When using the API, the search term must be properly URL-encoded. Spaces should be replaced by "+", and all other special characters must be replaced by % followed by their ASCII hexadecimal code, except for a small number of URL-safe characters, such as "()" and "[]". + +## Content +PubMed is composed of: + +- Primarily the MEDLINE database. +- Older references from the print version of Index Medicus, dating back to 1951 and earlier. +- References to some journals before they were indexed in Index Medicus and MEDLINE, such as Science, BMJ, and Annals of Surgery. +- Very recent entries for articles before they are indexed with Medical Subject Headings (MeSH) and added to MEDLINE. +- A collection of full-text books and other subsets of NLM (National Library of Medicine) records. +- Citations from PMC (PubMed Central). +- The NCBI Bookshelf, which provides access to books and documents in life sciences and healthcare. + +## MeSH + +Medical Subject Headings (MeSH) are standardised keywords present in the MeSH database to index articles in the MEDLINE database. The MeSH database is a vocabulary thesaurus developed by the National Library of Medicine (NLM) for indexing articles in PubMed. MeSH terms provide information on the content of an article. Changes in medical terminology in published literature is reflected by MeSH headings which are regularly updated by the NLM. They are assigned by trained indexers based on the article's content and MeSH vocabulary. + +### Mesh headings + +The primary subject or concept that an article is about. It is the main term used for indexing and organizing literature on a specific topic. + +### Mesh subheadings + +Further refine the topic with specific aspects, such as therapy, diagnosis, surgery, blood, etc... + +### Search syntax + +Each MeSH term is organized into categories and subcategories, making them hierarchical. + +To search for articles that specifically discuss a MeSH term, you need to specify the term followed by "[MeSH]". For example: + +``` +diabetes[MeSH] +``` + +This will automatically match your term to a MeSH term if it exists. + +You can also visit the MeSH database at https://www.ncbi.nlm.nih.gov/mesh to search for terms. After selecting a term, you can specify a subheading. The subheading is added after the MeSH term with a forward slash (/): + +``` +"Diabetes Mellitus/surgery"[MeSH] +``` + +This searches for the MeSH term "Diabetes Mellitus" with the subheading "surgery". + +As mentioned earlier, MeSH terms are hierarchical. When you search for a term, PubMed will also search for all related, narrower terms (children of the term). To limit this automatic expansion, you can add :noexp to your request: + +``` +"Diabetes Mellitus"[MeSH:noexp] +``` + +This ensures that PubMed will search for the exact MeSH term "Diabetes Mellitus" and not expand to related terms. + +## Keywords + +Keywords are words or phrases selected by the authors to describe the main content of an article. They are not standardized, meaning they can vary between authors and articles. These keywords help readers quickly identify the key themes or topics discussed in the publication. + +## Fields + +In PubMed, fields are specific parts of a database record where different types of information about an article are stored. Fields help organize and retrieve information effectively, as they allow you to target searches to specific sections of an article's metadata (such as the title, author, publication date, or MeSH terms). + +Each field has a field tag (enclosed in square brackets, e.g., [TI] for Title) that you can use to specify your search query. + +When no field tags are specified, PubMed searches across all fields. However, it also applies Automatic Term Mapping (ATM) to map your terms to MeSH terms, journal names, author names, and other indexed fields, which helps broaden the search intelligently. + + + +### ATM + +Each time you enter a search term into PubMed, the system checks for that term in what it calls its "Translation Tables". These are just a set tools that map terms to other terms. + +There are a number of different ways a term might get mapped, but one of the primary ways is to a particular Subject Heading or MeSH Term. Essentially, the database will try to see if there is an official MeSH Term mapped to the keyword you entered. If it can find a match, then it will search for that MeSH term in addition to the term you entered. + +### List + +I don't know if it is all the fields because I think there is more fields + +- https://www.ncbi.nlm.nih.gov/books/NBK49540/ +- (A verifier) https://wayback.archive-it.org/org-350/20240424200258/https://www.nlm.nih.gov/bsd/mms/medlineelements.html#ab + +## Operators + +When performing a search in PubMed, you can use Boolean operators to structure your query for more precise results. The available operators are: AND, OR, and NOT. These operators must be written in uppercase, as lowercase versions may not be recognized as operators and could be interpreted as search terms instead. + +When using these operators, they are applyed in order. For exemple: +- aa AND bb OR cc <=> (aa AND bb) OR cc +- aa OR bb AND cc <=> (aa OR bb) AND cc + ## Source https://www.ncbi.nlm.nih.gov/home/develop/api/ https://www.ncbi.nlm.nih.gov/books/NBK25501/ (All the books) \ No newline at end of file diff --git a/api/__pycache__/whoApi.cpython-312.pyc b/api/__pycache__/whoApi.cpython-312.pyc index 83ace5edce560eb8f581c13738c1798960ff76c3..c2ccfaa62328b440870f9545706215014d433fd6 100644 Binary files a/api/__pycache__/whoApi.cpython-312.pyc and b/api/__pycache__/whoApi.cpython-312.pyc differ diff --git a/api/data/pubmedData.xml b/api/data/pubmedData.xml index 6b3970e82258231c1746bd2235d62921c36faca6..04adf9e58d0f3432f64e9f8dd563b81b9583ad7d 100644 Binary files a/api/data/pubmedData.xml and b/api/data/pubmedData.xml differ diff --git a/api/parser/__pycache__/xlsxParser.cpython-312.pyc b/api/parser/__pycache__/xlsxParser.cpython-312.pyc index f1d9612058f093d1b690c1cd97686eb11049413b..f86b635812b48746651294ed6c362deefb6a84ae 100644 Binary files a/api/parser/__pycache__/xlsxParser.cpython-312.pyc and b/api/parser/__pycache__/xlsxParser.cpython-312.pyc differ diff --git a/api/pubmedApi.py b/api/pubmedApi.py index 2bb6e83fbb9b7c4b64e030bdfcadcd14316bba07..266c873dee445978829b99ff27737ca054c503d3 100644 --- a/api/pubmedApi.py +++ b/api/pubmedApi.py @@ -3,11 +3,15 @@ from parser.xmlParser import parseXmlFile FILENAME = "pubmedData.xml" -#term = "diabetes+type+1+OR+diabetes+type+2+OR+mental+health" -term = "diabetes+type+1" -date_min = "2024/11/07" +term = "diabetes+type+1+OR+diabetes+type+2+OR+mental+health" +# term = "diabetes+type+1" +date_min = "2024/09/08" date_max = "2024/11/08" +print(f"Term: {term}") +print(f"Date min: {date_min}") +print(f"Date max: {date_max}") + url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={term}&retmode=json&mindate={date_min}&maxdate={date_max}&usehistory=y" response = get(url) @@ -29,4 +33,10 @@ with open(f"data/{FILENAME}", "w", encoding="utf-8") as file: obj = parseXmlFile(f"data/{FILENAME}") -print(obj) \ No newline at end of file +# print(obj) + +for key in obj.keys(): + if isinstance(obj[key], list): + print(f"{key} has {len(obj[key])} entries") + else: + print(f"{key} has 1 entry") \ No newline at end of file diff --git a/docImg/SearchATMExample.png b/docImg/SearchATMExample.png new file mode 100644 index 0000000000000000000000000000000000000000..e1999eaeb87e34b38d7e38902f944b70b8a3925f Binary files /dev/null and b/docImg/SearchATMExample.png differ diff --git a/docImg/SearchQuotationExample.png b/docImg/SearchQuotationExample.png new file mode 100644 index 0000000000000000000000000000000000000000..9a03c5a91e4ebe5644c2764ee280b4c73c5a52c1 Binary files /dev/null and b/docImg/SearchQuotationExample.png differ